I’m a new customer to VEEAM but I’ve done a ton of reading on both VEEAM documentation and here on the forums.
Last November/December, we ran a successful POC for VEEAM in our dev/test/stage environment and ended up purchasing the full VEEAM Availability Suite at the beginning of the year. Since that time I’ve been trying get everything set up for production but I’ve ran into significant challenges and headaches. Many of these problems I didn’t encounter until I added more jobs/VM’s into the equation.
My backup jobs have just been going way too slow and I can’t get them to finish within any reasonable backup windows. My jobs almost exclusively have the bottleneck list as Source at 99% and the processing rate is typically 20 MB/s – 100 MB/s. 100+ is pretty rare but I’ve seen it before if only one job is running for example.
My environment is:
-VMware as the HypervisorProduction:
-NetAPP FAS 3250’s with VM’s on either 10K or 15K RPM SAS disks depending on the aggregate
-NFS 3.0 Datastores
-NETAPP FAS 8020’s with VM’s on SATA aggregates with Flash Pooling
-NFS 3.0 Datastores
-10GB EverywhereVEEAM Proxy:
Physical Cisco UCS B200 M3 in same Chassis as production server
I originally was trying to set my jobs up to do Incremental with synthetic fulls and transforming previous backup chains into rollbacks. I quickly learned that my repository which is a NetApp FAS2040 (using CIFS) wasn’t going to be able to handle that load and especially not with my regular jobs bottlenecking at the source and already running slow. I switched to Active Fulls and it was definately better my jobs are still really slow, still with 99% source bottlenecks. I also tried the periodic health check options and that made things really slow as well (pretty much would never finish my jobs on time if I keep that enabled).
I read VEEAM forums with similar issues and really no end resolution to them:https://forums.veeam.com/veeam-backup-replication-f2/netapp-source-as-bottleneck-t27025.htmlhttps://forums.veeam.com/vmware-vsphere-f24/netapp-backup-performance-t26635.html
I even tried switching my VMware datastores to FCOE/VMFS 10GB so it could utilize multipathing and ALUA to see if that helped. Things didn’t change at all, and I even configured the VEEAM proxy for FCOE Direct Storage Access as well (Marginal difference if at all).
I also tried creating virtual VEEAM proxies with hot add mode, one per esxi host, didn't help, used a ton of resources, and same 99% source bottleneck and slowness. Messed around a ton with dedup, compression settings, etc...no big difference either.
In the end, I suppose I’m just hitting limitations of throughput on my disks...although my production servers don’t have any apparent issues. I set up NetApp Harvest and the throughput it showed was fairly consistent with the processing rates I was getting from VEEAM considering the multiple running jobs, etc.
So where I’m at now is figuring out how to still try to make this product work for us. Luckily since VEEAM integrates with NetApp snapshots, I’m going to use Snapshots to back up my Dev/Test environment and I’ll just use VEEAM for the restore/management of those snaps. Then I’ll use VEEAM backup jobs strictly for production.
I really just need two types of jobs:
31 Restore Points – I need a month
7 Restore Points – I need a week
Where I’m struggling is deciding how often to take my Active Full backups. I’ve read a ton about people just using Forever Incremental and given my performance issues would certainly be the best scenario for me as long as it won’t cause me and the DBA’s issues with either corrupted backups or really long restore times.
I could use some suggestions on whether to set up my jobs like this:
7 Restore Points – Forever Incremental
31 Restore Points – Forever Incremental
7 Restore Points – Incremental with weekly Active Full’s on Saturday
31 Restore Points – Incremental with Active Fulls on the First Saturday on the month (Maybe something else?)
Are there major downsides to having a chain of incremental backups this long, 31 days? I've read about long chains causing issues, but I've never seen at what number they mean. Some of the VM’s being backed up 31 days will be 1-2TB SQL servers that will use the 15 min. transaction log backup for point-in-time restores. So I could use some suggestions on this. Also, am I putting myself at risk by not running the periodic health checks? They just take forever to run and I'm not sure my jobs will ever finish if I turn them on.
Lastly, I need a backup copy job to get certain jobs/VM’s to my DR site where I can keep 7 Restore Points and 3 Months of Weekly backups. For this, I was considering a Backup Copy Job with 7 Restore points and using the “Keeping the following restore points for archival purposes” option set to 14 Weekly backups running every 24 hours. Thoughts?
Thanks for reading. If anyone has an insight on my NetApp source problems as well please feel free to chime in with suggestions. I've been through a ton of headaches that last month or so.