Backup of enterprise applications (Microsoft stack, IBM Db2, MongoDB, Oracle, PostgreSQL, SAP)
Post Reply
jmbi
Novice
Posts: 9
Liked: 1 time
Joined: Mar 26, 2020 1:52 pm
Contact:

Struggling with SQL Backup Architecture, looking for advice

Post by jmbi »

I’ve been wrestling with the backup architecture for our new SQL clusters. We’ve got:
• 3 Always-On Availability Groups, each with 3 servers (no shared storage between nodes).
• A handful of standalone SQL servers for test/dev/replication.

My original plan was to lean heavily on the Veeam SQL Plug-In to centralize our database backups. Previously, our DBA would back up the databases manually, and we’d back up those backups using standard VM-level Veeam jobs which is a bandwidth and storage nightmare.
Using the plug-in would centralize backups into a Veeam repo, while still allowing our DBA to manage them with support from my team. The idea was to complement this with application aware copy-only storage snapshots as a failsafe in case something went wrong with the plug-in. We’d then spin off the SQL Plug-In backups to tape, just like the rest of our backups.
This is where things started to fall apart, along with a few oddities:
• For the SQL Plug-In you can only do repo-to-repo copy, which doesn’t help with tape.
• Or you can do file-to-tape periodically, which isn’t ideal.
• And SQL to object storage jobs aren’t supported until Veeam v13.

Current Setup
SQL Plug-In Backups (and Restores):

1. SQL Agent Jobs configured identically across all cluster nodes (which is a pain) for all 3 clusters. (these may vary depending on the cluster or server)
• Nightly full backups of every database.
• 20-minute transaction log backups.
2. Nightly full restores to dev/test environments.
• (Don’t ask why devs need full copies of certain production databases in dev, we’ve fought this for years. It’s insane, but this is where the plug-in really shines for us.)

Storage Snapshots (Failsafe):
• Every 2 hours, Veeam triggers a full VM snapshot with a copy-only SQL backup.
• Retained for one week on our storage array.

VM Repo Backups:
• Standard VM backups, system drives only, no SQL data.

Tape Backups:
1. Standard SQL VM repo backups are included in our regular tape jobs, nothing special here, just a job to tape.
2. File-to-tape job for the entire SQL server backup path.
• We chose not to do repo-to-tape because we didn’t want to restore the entire repo just to get one database.
• Full backup on Sunday, incrementals through Saturday.

This setup doesn’t feel overly complex, but the SQL Veeam repo just for two full backup jobs and the tlog backups is around 62TB. Tape takes a while, and I think I’ll need to leverage two drives in our library instead of one to meet that full window on Sunday, it’s doable, but not ideal. SQL job to tape would be better.

Another issue: not enough storage to do VSS for these tape jobs — not sure how big of a problem that is.

We also see sporadic issues with the SQL Plug-In backups:
• Sometimes we get:
“Session failed: Error while generating metadata.”
It’ll fail for a while, then go back to working which is concerning, and maybe related to the tape job since I can’t do VSS there. This is usually the full job that truncates the tlogs.
• Or we get errors like:
“Failed to backup database [name]. Error: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Write on ‘bf3839bf-1abf-4c0f-8166-0c37319a3762’ failed: 995 (The I/O operation has been aborted because of either a thread exit or an application request.)”

These could be caused by something the DBA is doing that I’m not aware of, but I just don’t trust the backups right now. I’m starting to think I may have made some poor architectural choices based on a misunderstanding of the SQL Plug-In’s limitations.

Sorry for the wall of text, but I’m looking for advice on how to architect this better while still meeting our goals. Maybe I’m missing something obvious. I thought about just doing daily VM backups that are application aware, with local SQL tlog backups on a dedicated drive, along with SQL plug-in backups strictly for dev/test refreshes. But since these are clusters, I have a feeling I’d be backing up 3x the data + the dev/test now. That’s A LOT of storage. This has been absolutely exhausting trying to architect this with all the various limitations and scenarios.
Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests