-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
V11: Huge backup copy jobs stalling
Case 04822737
With V11 we started to switch to much larger backup and backup copy jobs. Because of the scalability enhancements we started to group our VMs by similar backup settings and VM types. Before we created 30 jobs based on VM datastore which always proved to be quite problematic because of storage drs.
Also, we took it slow. First we created a large immediate copy job with ~1000 VMs but with 20 source jobs - we had no issue whatsoever! We got more courageous and created a large backup job with ~1500 VMs to a test repo all in one job. This also caused no issues. This was all with V10.
After upgrade to V11 we implemented our plan in production. For the backups it still worked quite well. Sadly, problems came up by the copy job copying the backups of the ~1500 VM job! The job hangs at the individual VM level at 0 % (network traffic will be encrypted) or at 99 % (finalizing). Source and target repo show gaps without any load. So we focused on Veeam SQL.
We found:
- alot of commands locking each other
- wait times 49 % Buffer Latch 49 % Latch
- The stored procedure ReportExpandedBackupsByBackupIdsView doing 800000 logical reads all the time taking 15-20 seconds
- SQL Query store telling us there is an index missing in Backup.Model.SqlOIBs
I know 1500 VMs is alot. Still, i believe Veeam can make this work!
We provided logs, sql trace and screenshots to Veeam support and they told us "why don't you just split jobs". To be honest we don't want to do this and we want to help Veeam getting this working. Is that something Veeam is interested in? Or do we have to roll back everything?
Right now jobs take very long but we can "survive" for a while like this!
Markus
With V11 we started to switch to much larger backup and backup copy jobs. Because of the scalability enhancements we started to group our VMs by similar backup settings and VM types. Before we created 30 jobs based on VM datastore which always proved to be quite problematic because of storage drs.
Also, we took it slow. First we created a large immediate copy job with ~1000 VMs but with 20 source jobs - we had no issue whatsoever! We got more courageous and created a large backup job with ~1500 VMs to a test repo all in one job. This also caused no issues. This was all with V10.
After upgrade to V11 we implemented our plan in production. For the backups it still worked quite well. Sadly, problems came up by the copy job copying the backups of the ~1500 VM job! The job hangs at the individual VM level at 0 % (network traffic will be encrypted) or at 99 % (finalizing). Source and target repo show gaps without any load. So we focused on Veeam SQL.
We found:
- alot of commands locking each other
- wait times 49 % Buffer Latch 49 % Latch
- The stored procedure ReportExpandedBackupsByBackupIdsView doing 800000 logical reads all the time taking 15-20 seconds
- SQL Query store telling us there is an index missing in Backup.Model.SqlOIBs
I know 1500 VMs is alot. Still, i believe Veeam can make this work!
We provided logs, sql trace and screenshots to Veeam support and they told us "why don't you just split jobs". To be honest we don't want to do this and we want to help Veeam getting this working. Is that something Veeam is interested in? Or do we have to roll back everything?
Right now jobs take very long but we can "survive" for a while like this!
Markus
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
BTW in that configuration (3 jobs 269, 1432 and 1081 VMs, all starting at the same time, backed up by 6 6-core linux proxies) we were able to backup these 2782 VMs in about 3 hours! All while the immediate copy jobs still ran and two LTO8 streamed at 300 - 400 MB/s. This is a new record for us!
We love V11 and hope we get the copy issue fixed!
We love V11 and hope we get the copy issue fixed!
-
- Veeam Software
- Posts: 21171
- Liked: 2157 times
- Joined: Jul 11, 2011 10:22 am
- Full Name: Alexander Fogelson
Re: V11: Huge backup copy jobs stalling
Hi Markus, we're definitely interested and will take a look - thanks!
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
Support again told me: "To alleviate the load on SQL server hosting Veeam DB and improving job processing speed I would recommend you to split this single job into several jobs of lesser size (around 200 VMs in each)."
We are not really happy with that answer...
We are not really happy with that answer...
-
- Veeam Software
- Posts: 50
- Liked: 12 times
- Joined: Oct 21, 2010 8:54 am
- Full Name: Dmitry Vedyakov
- Contact:
Re: V11: Huge backup copy jobs stalling
Hi Markus. R&D will look in to this case more deeply. Thank you for sharing this issue. We are constantly working on improving our products.
BTW. The issue is that BCJ and 1500 vm's backup job as source works a way worse than several BCJ, or several BJ (less VM's per backup job)? And in it was working well in v10?
BTW. The issue is that BCJ and 1500 vm's backup job as source works a way worse than several BCJ, or several BJ (less VM's per backup job)? And in it was working well in v10?
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
In V10 we did initial testing of the following:
- 1500 VMs in one BJ (but not job copying those) -> Works very well and still works very well in V11
- 1000 VMs from multiple BJ in one BCJ -> Works very well in V10 and V11 - this is our new default and has worked now for several weeks. All our "other backup jobs" have been consolidated into one big BCJ now.
Sadly after those tests we did not expect an issue with large BJ beeing copyed by a large BCJ
.
- 1500 VMs in one BJ (but not job copying those) -> Works very well and still works very well in V11
- 1000 VMs from multiple BJ in one BCJ -> Works very well in V10 and V11 - this is our new default and has worked now for several weeks. All our "other backup jobs" have been consolidated into one big BCJ now.
Sadly after those tests we did not expect an issue with large BJ beeing copyed by a large BCJ

-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
This was not the first BCJ case for SQL issues since our upgrade to V11. In case 04819367 we tried to delete the backups of a BCJ that copied 20 individual source backups with about 1000 VM total.
Doing so rendered the Veeam interface unusable because of major database locks. Running jobs also stalled.
That was before we fully implemented our new very large BCJ.
I cannot remember such issues in V10. We did a lot of tests with ReFS and deleting backups, that never stalled Veeam like this.
Doing so rendered the Veeam interface unusable because of major database locks. Running jobs also stalled.
That was before we fully implemented our new very large BCJ.
I cannot remember such issues in V10. We did a lot of tests with ReFS and deleting backups, that never stalled Veeam like this.
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
One more intersting finding: Yesterday we found that even primary backups seemed to show gaps in the transfer. Today we did the same backup again and found that the backup has no gaps at all and the transfer is running at a steady 1,3 GB/s via NBD!
The only difference is that currently the copy job is still deleting GFS backup points.
It seems that the backup copy job when scanning for restore points can have an adverse impact on the primary backup job! I believe if we get the immediate backup copy job bottleneck fixed the whole system will run much smoother.
BTW a tape job has no issues whatsoever with the situation (many VMs in 3 source jobs).
The only difference is that currently the copy job is still deleting GFS backup points.
It seems that the backup copy job when scanning for restore points can have an adverse impact on the primary backup job! I believe if we get the immediate backup copy job bottleneck fixed the whole system will run much smoother.
BTW a tape job has no issues whatsoever with the situation (many VMs in 3 source jobs).
-
- Product Manager
- Posts: 14833
- Liked: 1785 times
- Joined: Feb 04, 2013 2:07 pm
- Full Name: Dmitry Popov
- Location: Prague
- Contact:
Re: V11: Huge backup copy jobs stalling
Hello Markus,
Thanks for sharing the news, QA team is still investigating the issue. I'll update this thread once I hear back anything.
Thanks for sharing the news, QA team is still investigating the issue. I'll update this thread once I hear back anything.
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
Interesting. While the copy job is running the veeam backup manager executable takes 321 GB of RAM... Is that normal?
-
- Chief Product Officer
- Posts: 32374
- Liked: 7727 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
Our monitoring found that it got worse and worse with every backup run... Looks like a leak. Going to reboot now and will see how it develops.
-
- Product Manager
- Posts: 14833
- Liked: 1785 times
- Joined: Feb 04, 2013 2:07 pm
- Full Name: Dmitry Popov
- Location: Prague
- Contact:
Re: V11: Huge backup copy jobs stalling
Markus,
Talked with QA managers today, seems that the investigating is going pretty well and they've already working on a private fix for some of the discovered issues. Please keep working with our support (I believe they still waiting for some information to investigate the memory leak). Thanks!
Talked with QA managers today, seems that the investigating is going pretty well and they've already working on a private fix for some of the discovered issues. Please keep working with our support (I believe they still waiting for some information to investigate the memory leak). Thanks!
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
Will install the first private fix in a few hours as soon as i have i slot where there are at least no primary jobs running.
I have 2800 VMs in the queue to be copied then. Lets see what this fix can do
I have 2800 VMs in the queue to be copied then. Lets see what this fix can do

-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
The first patch increased the average network rate of the copy job already from 1 GBit/s to nearly 6 GBit/s - i see no gaps anymore!
Currently trying to do a copy while the backup is running.
Currently trying to do a copy while the backup is running.
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
Just a quick test result - instead of running for more than 12 hours our backup copy job now finishes in slightly over 2 hours - WOW!
For an issue that only came up less then 2 weeks ago and which has an impact only on "crazy" customers with lots of VMs in one Job the result of the patch that Veeam quickly provided is astonishing! Still i believe that this will benefit all cutomers in the end.
Big thanks to Veeam support and developers! You are just great!
Lets continue to work on this until we optimized everything!
For an issue that only came up less then 2 weeks ago and which has an impact only on "crazy" customers with lots of VMs in one Job the result of the patch that Veeam quickly provided is astonishing! Still i believe that this will benefit all cutomers in the end.
Big thanks to Veeam support and developers! You are just great!
Lets continue to work on this until we optimized everything!
-
- Product Manager
- Posts: 14833
- Liked: 1785 times
- Joined: Feb 04, 2013 2:07 pm
- Full Name: Dmitry Popov
- Location: Prague
- Contact:
Re: V11: Huge backup copy jobs stalling
Hello Markus,
Awesome! Thank you for the kind words and all your help with this investigation, shared your feedback with the folks and they are very proud. Cheers!
Awesome! Thank you for the kind words and all your help with this investigation, shared your feedback with the folks and they are very proud. Cheers!
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
I just tested the second (or fourth if you count SQL indices) patch. It was for long running synthetic full preperation.
Last week the synthetics of 3 jobs with about 3000 VMs was distributed over 3 days and took added up 6 hours 47 minutes.
Now we did the synthetics all on one day to really test the patch. Still, in total if took only 54 minutes!
Quite impressive!
Last week the synthetics of 3 jobs with about 3000 VMs was distributed over 3 days and took added up 6 hours 47 minutes.
Now we did the synthetics all on one day to really test the patch. Still, in total if took only 54 minutes!
Quite impressive!
-
- Enthusiast
- Posts: 38
- Liked: 13 times
- Joined: Mar 22, 2013 10:35 am
- Contact:
Re: V11: Huge backup copy jobs stalling
Those are some impressive scaling numbers, lovely feedback, thanks Markus!
@ Veeam team: could you drop a line here when these get bundled into a CU pretty please?
@ Veeam team: could you drop a line here when these get bundled into a CU pretty please?

-
- Chief Product Officer
- Posts: 32374
- Liked: 7727 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: V11: Huge backup copy jobs stalling
We're not planning for more CUs at this time, as the current one addresses all common support issue. Next build will be a minor release (11a), we're targeting August for it. Just in time for folks who will be coming back from vacation 
This will also give us more time to test all these recent optimizations for very large environments thoroughly prior to rolling them out to all users.

This will also give us more time to test all these recent optimizations for very large environments thoroughly prior to rolling them out to all users.
-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
The nice thing is that at the current pace we will get all these scalability issues fixed with support until then. I hope the next fixes will solve the retention processing and backup infrastucture availability detection performance issues. But they all the major issues are already nicely fixed 

-
- Veeam Software
- Posts: 15
- Liked: 5 times
- Joined: Nov 18, 2019 3:35 pm
- Full Name: Chris Evans
- Contact:
Re: V11: Huge backup copy jobs stalling
I created this forum account the day I was hired at Veeam (nearly 2 years ago) and have never posted on here, but I went through the trouble of digging up my username/password just so I could post here and say how much I appreciate mkretzer constantly providing details on his tests AND the results. Just wanted to throw some love at mkretzer on my very first forum post because it's having clients like him that make working for Veeam just so damn enjoyablemkretzer wrote: ↑Jun 04, 2021 3:24 pm Just a quick test result - instead of running for more than 12 hours our backup copy job now finishes in slightly over 2 hours - WOW!
For an issue that only came up less then 2 weeks ago and which has an impact only on "crazy" customers with lots of VMs in one Job the result of the patch that Veeam quickly provided is astonishing! Still i believe that this will benefit all cutomers in the end.
Big thanks to Veeam support and developers! You are just great!
Lets continue to work on this until we optimized everything!

-
- Chief Product Officer
- Posts: 32374
- Liked: 7727 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: V11: Huge backup copy jobs stalling
Markus is the real legend 

-
- Veteran
- Posts: 1267
- Liked: 456 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: V11: Huge backup copy jobs stalling
Thanks very much - i love working with people in the industry who love their work like i do - working toward a common goal. 
To be honest this is one of the reasons we stay with Veeam. Every now and then another company shows us their "superior, much better than Backup & Replication" product. My question is always: "In the case of an issue that only affects us at first are you willing to put in many hours to find a way to optimize your product?"

To be honest this is one of the reasons we stay with Veeam. Every now and then another company shows us their "superior, much better than Backup & Replication" product. My question is always: "In the case of an issue that only affects us at first are you willing to put in many hours to find a way to optimize your product?"
Who is online
Users browsing this forum: Baidu [Spider], Bing [Bot] and 27 guests