On Monday night, we apparently had some backup issues. Typically when I see backup issues, it's a one off with Veeam and the next backup cycle fires off fine. Unfortunately that's not the case this time. I have 4 jobs that fire off one after another starting around 8:00 PM or so. In jobs 1 and 2 there are 20 to 60 VMs. Jobs 3 and 4 contain a single VM each, but they are very large. The fulls for these jobs (not synthetic) fire off on Saturdays and then we do incrementals for the rest of the week. Job 1 starts at 8PM and can backup anywhere from 50 to 200 gigs out of 2.7 terabytes total a night. Job 1 usually takes about 30 to 45 minutes. Job 2 starts immediately after job 1, typically around 8:30 PM or 9 PM and then finishes around 10 or so. For the past 3 nights job 1 has functioned flawlessly. Job 2 has not.
Job 2 seems to process about 3-4 of the 60 VMs per usual and then immediately after the 3rd or 4th VM that successfully completes the job appears to just hang. I watched the job last night and again around 9:05 PM after the 3rd VM backed up successfully (usually processes a few at a time) the job just ceases to make forward progress. I cancel the job "immediately" and after about an hour and a half of waiting for the job to cancel. I decided to punt and reboot the Veeam B&R server.
Further details
- Veeam Backup Server is Windows 2012 R2 on Veeam 9.5.0.823
- Veeam Backup Server has 2 10GbE uplinks in an LACP trunk. There are two VLANs down the trunk. One for the production network and one for the backup network.
- Veeam Backup Server is also a proxy.
- Veeam Backup Server backs up to a DD2500.
- All VMs being backed up are in the same vCenter
- All VMs being backed up are in the same cluster
- Recently patched the PSC and vCSA responsible for the cluster to 6.0.30800
- Job 1 seems to work fine as expected
- Job 2 can successfully back up 3-4 of the first few VMs in the group and then fails to back up the rest
- Job 3 was manually triggered and it appears to be functioning as expected. Update: It processed 2.8 TBs of data, read 286 gigs, and transferred 286 gigs in about 19 minutes.
- Job 4 was manually triggered and it backed up successfully within the expected timeframe and grabbed the appropriate amount of data
- Average backup processing speeds are usually between 250 MB/s and 400 MB/s
- During the "stall time" on Job 2, the repository seems to be accessible and is exhibiting no discernible performance issues.
- During the "stall time" on Job 2, the source seems to be accessible (vmfs volumes) and doesn't seem to be exhibiting any performance issues.