Struggling with SQL Backup Architecture, looking for advice

jmbi · Post by **jmbi** » Oct 14, 2025 11:15 pm this post

I’ve been wrestling with the backup architecture for our new SQL clusters. We’ve got:
• 3 Always-On Availability Groups, each with 3 servers (no shared storage between nodes).
• A handful of standalone SQL servers for test/dev/replication.

My original plan was to lean heavily on the Veeam SQL Plug-In to centralize our database backups. Previously, our DBA would back up the databases manually, and we’d back up those backups using standard VM-level Veeam jobs which is a bandwidth and storage nightmare.
Using the plug-in would centralize backups into a Veeam repo, while still allowing our DBA to manage them with support from my team. The idea was to complement this with application aware copy-only storage snapshots as a failsafe in case something went wrong with the plug-in. We’d then spin off the SQL Plug-In backups to tape, just like the rest of our backups.
This is where things started to fall apart, along with a few oddities:
• For the SQL Plug-In you can only do repo-to-repo copy, which doesn’t help with tape.
• Or you can do file-to-tape periodically, which isn’t ideal.
• And SQL to object storage jobs aren’t supported until Veeam v13.

Current Setup
SQL Plug-In Backups (and Restores):
1. SQL Agent Jobs configured identically across all cluster nodes (which is a pain) for all 3 clusters. (these may vary depending on the cluster or server)
• Nightly full backups of every database.
• 20-minute transaction log backups.
2. Nightly full restores to dev/test environments.
• (Don’t ask why devs need full copies of certain production databases in dev, we’ve fought this for years. It’s insane, but this is where the plug-in really shines for us.)

Storage Snapshots (Failsafe):
• Every 2 hours, Veeam triggers a full VM snapshot with a copy-only SQL backup.
• Retained for one week on our storage array.

VM Repo Backups:
• Standard VM backups, system drives only, no SQL data.

Tape Backups:
1. Standard SQL VM repo backups are included in our regular tape jobs, nothing special here, just a job to tape.
2. File-to-tape job for the entire SQL server backup path.
• We chose not to do repo-to-tape because we didn’t want to restore the entire repo just to get one database.
• Full backup on Sunday, incrementals through Saturday.

This setup doesn’t feel overly complex, but the SQL Veeam repo just for two full backup jobs and the tlog backups is around 62TB. Tape takes a while, and I think I’ll need to leverage two drives in our library instead of one to meet that full window on Sunday, it’s doable, but not ideal. SQL job to tape would be better.

Another issue: not enough storage to do VSS for these tape jobs — not sure how big of a problem that is.

We also see sporadic issues with the SQL Plug-In backups:
• Sometimes we get:
“Session failed: Error while generating metadata.”
It’ll fail for a while, then go back to working which is concerning, and maybe related to the tape job since I can’t do VSS there. This is usually the full job that truncates the tlogs.
• Or we get errors like:
“Failed to backup database [name]. Error: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Write on ‘bf3839bf-1abf-4c0f-8166-0c37319a3762’ failed: 995 (The I/O operation has been aborted because of either a thread exit or an application request.)”

These could be caused by something the DBA is doing that I’m not aware of, but I just don’t trust the backups right now. I’m starting to think I may have made some poor architectural choices based on a misunderstanding of the SQL Plug-In’s limitations.

Sorry for the wall of text, but I’m looking for advice on how to architect this better while still meeting our goals. Maybe I’m missing something obvious. I thought about just doing daily VM backups that are application aware, with local SQL tlog backups on a dedicated drive, along with SQL plug-in backups strictly for dev/test refreshes. But since these are clusters, I have a feeling I’d be backing up 3x the data + the dev/test now. That’s A LOT of storage. This has been absolutely exhausting trying to architect this with all the various limitations and scenarios.

Post by **PetrM** » Oct 15, 2025 4:11 pm this post

Hello,

Well, it's not a wall of text, it's actually a very useful reference architecture example. I believe we can split it into four parts:
1. Plug-in backup to tape
This is an important request that is relevant to all plug-ins, not just SQL. We have been working on it, but I cannot share an ETA at this time.
Please note that file to tape job is not supported for plug-in backups, and we cannot guarantee a successful restore or stable operation in this scenario. Furthermore, technical support will not address issues related to file-to-tape functionality when plug-in backups are processed this way. However, I can suggest two workarounds:
A) Use periodic image-level backups, including SQL data, with Copy-only mode enabled, and send these backups to tape.
B) Use an application backup copy job to copy backups to tape when the job is disabled.

2. Overall opinion about the setup
The setup looks good, and aside from the file-to-tape issue mentioned above, I don’t see any problems. By the way, nightly Dev/Test restores are quite common, I have seen other customers use a similar approach.

3. SQL Plug-in centralized management
You mentioned that it would be nice to have it. This feature will be available soon in the upcoming version 13.0.1. At a high level, it will be similar to our application policy for RMAN or SAP HANA: a Protection Group is used to roll out plug-ins, a policy allows you to protect SQL Server workloads with defined settings, and two recovery options via Explorer or standalone plug-in will be available. Additionally, you will no longer need to configure SQL Server Agent jobs on each node, it will be sufficient to add Always On databases to a policy once, and we will automatically detect the preferred replica for backup.

4. Error messages
I would recommend checking if these messages persist after the file-to-tape job is disabled. If they do, the best approach is to open a support case and share the case ID with us. I don’t think we can effectively troubleshoot these issues via forum posts.

Thanks!

jmbi · Post by **jmbi** » Oct 15, 2025 10:10 pm this post

I appreciate the reassurance on the design.

With regards to A) and B).
Do you have any ideas on being able to backup the tlogs along with the database. The struggle will be storage capacity I think with the large number of redundancies there. Obviously not your problem there

I might be missing something on the application copy job option also.
I can't do an app copy job to tape for SQL plug-in backups correct? I didn't think I could do a copy job to anything other than another repo.

jmbi · Post by **jmbi** » Oct 16, 2025 9:24 pm this post

One other thing that was making me feel unsure on this was understanding the actual SQL Plug-In itself and the logic behind this, or if it even matters.
I don't know why, but we do see backups appearing in server names rather than availability group name every so often. Support was unable to explain to me the reason behind this.

Example in our environment.

SQLRepo001 (SQL Plug-In Repo target)
SQL-AG1 (Always-On Availability Group)
SQL001 (SQL Node in AG1)
SQL002 (SQL Node in AG1)
SQL003 (SQL Node in AG1)

SQL-AG2 (Always-On Availability Group)
SQL004 (SQL Node in AG2)
SQL005 (SQL Node in AG2)
SQL006 (SQL Node in AG2)

We may see folders created for these backup jobs that are inconsistently named, and I'm fearful about how that effects the actual backup chain and retention.
For instance we see these folders in the repo folder.
SQL001 Microsoft SQL backup (SQLRepo001)
SQL003 Microsoft SQL backup (SQLRepo001)
SQL-AG1 Microsoft SQL backup (SQLRepo001)
SQL006 Microsoft SQL backup (SQLRepo001)
SQL-AG2 Microsoft SQL backup (SQLRepo001)

My assumption is that an agent job is being manually run from one of the servers, making the folder names not align. But if that's the case, I'm curious what the appropriate action should be by our DBA, if for some reason a job fails or doesn't happen, and they need to manually run the agent job to do the backup again.

Post by **PetrM** » Oct 16, 2025 10:08 pm this post

Hello,

1. Regarding storage capacity and transaction log backups, I believe the best approach is to use the Capacity Tier and enable the Move policy in the SOBR settings to offload backups that fall outside the operational restore window.

2. There is no option to send Backup Copy jobs for Microsoft SQL Server backups directly to tape, but you can use file-to-tape jobs to process backups created by the copy job. Disable the backup copy job on schedule to make sure there are no schedule collisions with the file-to-tape job.

3. Normally, you should see all backups from the same availability group under the same name. You can find more information on this page. Could you please provide a support case ID? I will discuss your question with the team.

Thanks!

jmbi · Oct 16, 2025 11:33 pm

Thanks, I've sent you a PM with the case number.

Post by **jordan.mohr** » Oct 17, 2025 10:47 pm this post

Hello! I'm a senior team leader over with Veeam Agent for Windows in support. I'd be happy to revisit this with our Advanced Support team to get some further insight. I see the existing case was closed out. Would you mind creating a new case and referencing the existing closed case? I'd be happy to get it prioritized with our team.

jmbi · Oct 20, 2025 3:54 pm

Jordan I attempted to PM, you but I'm not sure it went through.

R&D Forums

Struggling with SQL Backup Architecture, looking for advice

Re: Struggling with SQL Backup Architecture, looking for advice

Re: Struggling with SQL Backup Architecture, looking for advice

Re: Struggling with SQL Backup Architecture, looking for advice

Re: Struggling with SQL Backup Architecture, looking for advice

Re: Struggling with SQL Backup Architecture, looking for advice

Re: Struggling with SQL Backup Architecture, looking for advice

Re: Struggling with SQL Backup Architecture, looking for advice

Who is online