I’ve been wrestling with the backup architecture for our new SQL clusters. We’ve got:
• 3 Always-On Availability Groups, each with 3 servers (no shared storage between nodes).
• A handful of standalone SQL servers for test/dev/replication.
My original plan was to lean heavily on the Veeam SQL Plug-In to centralize our database backups. Previously, our DBA would back up the databases manually, and we’d back up those backups using standard VM-level Veeam jobs which is a bandwidth and storage nightmare.
Using the plug-in would centralize backups into a Veeam repo, while still allowing our DBA to manage them with support from my team. The idea was to complement this with application aware copy-only storage snapshots as a failsafe in case something went wrong with the plug-in. We’d then spin off the SQL Plug-In backups to tape, just like the rest of our backups.
This is where things started to fall apart, along with a few oddities:
• For the SQL Plug-In you can only do repo-to-repo copy, which doesn’t help with tape.
• Or you can do file-to-tape periodically, which isn’t ideal.
• And SQL to object storage jobs aren’t supported until Veeam v13.
Current Setup
SQL Plug-In Backups (and Restores):
1. SQL Agent Jobs configured identically across all cluster nodes (which is a pain) for all 3 clusters. (these may vary depending on the cluster or server)
• Nightly full backups of every database.
• 20-minute transaction log backups.
2. Nightly full restores to dev/test environments.
• (Don’t ask why devs need full copies of certain production databases in dev, we’ve fought this for years. It’s insane, but this is where the plug-in really shines for us.)
Storage Snapshots (Failsafe):
• Every 2 hours, Veeam triggers a full VM snapshot with a copy-only SQL backup.
• Retained for one week on our storage array.
VM Repo Backups:
• Standard VM backups, system drives only, no SQL data.
Tape Backups:
1. Standard SQL VM repo backups are included in our regular tape jobs, nothing special here, just a job to tape.
2. File-to-tape job for the entire SQL server backup path.
• We chose not to do repo-to-tape because we didn’t want to restore the entire repo just to get one database.
• Full backup on Sunday, incrementals through Saturday.
This setup doesn’t feel overly complex, but the SQL Veeam repo just for two full backup jobs and the tlog backups is around 62TB. Tape takes a while, and I think I’ll need to leverage two drives in our library instead of one to meet that full window on Sunday, it’s doable, but not ideal. SQL job to tape would be better.
Another issue: not enough storage to do VSS for these tape jobs — not sure how big of a problem that is.
We also see sporadic issues with the SQL Plug-In backups:
• Sometimes we get:
“Session failed: Error while generating metadata.”
It’ll fail for a while, then go back to working which is concerning, and maybe related to the tape job since I can’t do VSS there. This is usually the full job that truncates the tlogs.
• Or we get errors like:
“Failed to backup database [name]. Error: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Write on ‘bf3839bf-1abf-4c0f-8166-0c37319a3762’ failed: 995 (The I/O operation has been aborted because of either a thread exit or an application request.)”
These could be caused by something the DBA is doing that I’m not aware of, but I just don’t trust the backups right now. I’m starting to think I may have made some poor architectural choices based on a misunderstanding of the SQL Plug-In’s limitations.
Sorry for the wall of text, but I’m looking for advice on how to architect this better while still meeting our goals. Maybe I’m missing something obvious. I thought about just doing daily VM backups that are application aware, with local SQL tlog backups on a dedicated drive, along with SQL plug-in backups strictly for dev/test refreshes. But since these are clusters, I have a feeling I’d be backing up 3x the data + the dev/test now. That’s A LOT of storage. This has been absolutely exhausting trying to architect this with all the various limitations and scenarios.
-
- Novice
- Posts: 9
- Liked: 1 time
- Joined: Mar 26, 2020 1:52 pm
- Contact:
-
- Veeam Software
- Posts: 3983
- Liked: 686 times
- Joined: Aug 28, 2013 8:23 am
- Full Name: Petr Makarov
- Location: Prague, Czech Republic
- Contact:
Re: Struggling with SQL Backup Architecture, looking for advice
Hello,
Well, it's not a wall of text, it's actually a very useful reference architecture example. I believe we can split it into four parts:
1. Plug-in backup to tape
This is an important request that is relevant to all plug-ins, not just SQL. We have been working on it, but I cannot share an ETA at this time.
Please note that file to tape job is not supported for plug-in backups, and we cannot guarantee a successful restore or stable operation in this scenario. Furthermore, technical support will not address issues related to file-to-tape functionality when plug-in backups are processed this way. However, I can suggest two workarounds:
A) Use periodic image-level backups, including SQL data, with Copy-only mode enabled, and send these backups to tape.
B) Use an application backup copy job to copy backups to tape when the job is disabled.
2. Overall opinion about the setup
The setup looks good, and aside from the file-to-tape issue mentioned above, I don’t see any problems. By the way, nightly Dev/Test restores are quite common, I have seen other customers use a similar approach.
3. SQL Plug-in centralized management
You mentioned that it would be nice to have it. This feature will be available soon in the upcoming version 13.0.1. At a high level, it will be similar to our application policy for RMAN or SAP HANA: a Protection Group is used to roll out plug-ins, a policy allows you to protect SQL Server workloads with defined settings, and two recovery options via Explorer or standalone plug-in will be available. Additionally, you will no longer need to configure SQL Server Agent jobs on each node, it will be sufficient to add Always On databases to a policy once, and we will automatically detect the preferred replica for backup.
4. Error messages
I would recommend checking if these messages persist after the file-to-tape job is disabled. If they do, the best approach is to open a support case and share the case ID with us. I don’t think we can effectively troubleshoot these issues via forum posts.
Thanks!
Well, it's not a wall of text, it's actually a very useful reference architecture example. I believe we can split it into four parts:
1. Plug-in backup to tape
This is an important request that is relevant to all plug-ins, not just SQL. We have been working on it, but I cannot share an ETA at this time.
Please note that file to tape job is not supported for plug-in backups, and we cannot guarantee a successful restore or stable operation in this scenario. Furthermore, technical support will not address issues related to file-to-tape functionality when plug-in backups are processed this way. However, I can suggest two workarounds:
A) Use periodic image-level backups, including SQL data, with Copy-only mode enabled, and send these backups to tape.
B) Use an application backup copy job to copy backups to tape when the job is disabled.
2. Overall opinion about the setup
The setup looks good, and aside from the file-to-tape issue mentioned above, I don’t see any problems. By the way, nightly Dev/Test restores are quite common, I have seen other customers use a similar approach.
3. SQL Plug-in centralized management
You mentioned that it would be nice to have it. This feature will be available soon in the upcoming version 13.0.1. At a high level, it will be similar to our application policy for RMAN or SAP HANA: a Protection Group is used to roll out plug-ins, a policy allows you to protect SQL Server workloads with defined settings, and two recovery options via Explorer or standalone plug-in will be available. Additionally, you will no longer need to configure SQL Server Agent jobs on each node, it will be sufficient to add Always On databases to a policy once, and we will automatically detect the preferred replica for backup.
4. Error messages
I would recommend checking if these messages persist after the file-to-tape job is disabled. If they do, the best approach is to open a support case and share the case ID with us. I don’t think we can effectively troubleshoot these issues via forum posts.
Thanks!
Who is online
Users browsing this forum: No registered users and 1 guest