Comprehensive data protection for all workloads
Post Reply
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

V11: Huge backup copy jobs stalling

Post by mkretzer »

Case 04822737

With V11 we started to switch to much larger backup and backup copy jobs. Because of the scalability enhancements we started to group our VMs by similar backup settings and VM types. Before we created 30 jobs based on VM datastore which always proved to be quite problematic because of storage drs.

Also, we took it slow. First we created a large immediate copy job with ~1000 VMs but with 20 source jobs - we had no issue whatsoever! We got more courageous and created a large backup job with ~1500 VMs to a test repo all in one job. This also caused no issues. This was all with V10.

After upgrade to V11 we implemented our plan in production. For the backups it still worked quite well. Sadly, problems came up by the copy job copying the backups of the ~1500 VM job! The job hangs at the individual VM level at 0 % (network traffic will be encrypted) or at 99 % (finalizing). Source and target repo show gaps without any load. So we focused on Veeam SQL.

We found:
- alot of commands locking each other
- wait times 49 % Buffer Latch 49 % Latch
- The stored procedure ReportExpandedBackupsByBackupIdsView doing 800000 logical reads all the time taking 15-20 seconds
- SQL Query store telling us there is an index missing in Backup.Model.SqlOIBs

I know 1500 VMs is alot. Still, i believe Veeam can make this work!
We provided logs, sql trace and screenshots to Veeam support and they told us "why don't you just split jobs". To be honest we don't want to do this and we want to help Veeam getting this working. Is that something Veeam is interested in? Or do we have to roll back everything?
Right now jobs take very long but we can "survive" for a while like this!

Markus
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

BTW in that configuration (3 jobs 269, 1432 and 1081 VMs, all starting at the same time, backed up by 6 6-core linux proxies) we were able to backup these 2782 VMs in about 3 hours! All while the immediate copy jobs still ran and two LTO8 streamed at 300 - 400 MB/s. This is a new record for us!
We love V11 and hope we get the copy issue fixed!
foggy
Veeam Software
Posts: 21171
Liked: 2157 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson

Re: V11: Huge backup copy jobs stalling

Post by foggy »

Hi Markus, we're definitely interested and will take a look - thanks!
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

Support again told me: "To alleviate the load on SQL server hosting Veeam DB and improving job processing speed I would recommend you to split this single job into several jobs of lesser size (around 200 VMs in each)."

We are not really happy with that answer...
Dima V.
Veeam Software
Posts: 50
Liked: 12 times
Joined: Oct 21, 2010 8:54 am
Full Name: Dmitry Vedyakov
Contact:

Re: V11: Huge backup copy jobs stalling

Post by Dima V. »

Hi Markus. R&D will look in to this case more deeply. Thank you for sharing this issue. We are constantly working on improving our products.

BTW. The issue is that BCJ and 1500 vm's backup job as source works a way worse than several BCJ, or several BJ (less VM's per backup job)? And in it was working well in v10?
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

In V10 we did initial testing of the following:

- 1500 VMs in one BJ (but not job copying those) -> Works very well and still works very well in V11
- 1000 VMs from multiple BJ in one BCJ -> Works very well in V10 and V11 - this is our new default and has worked now for several weeks. All our "other backup jobs" have been consolidated into one big BCJ now.

Sadly after those tests we did not expect an issue with large BJ beeing copyed by a large BCJ :-(.
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

This was not the first BCJ case for SQL issues since our upgrade to V11. In case 04819367 we tried to delete the backups of a BCJ that copied 20 individual source backups with about 1000 VM total.

Doing so rendered the Veeam interface unusable because of major database locks. Running jobs also stalled.
That was before we fully implemented our new very large BCJ.

I cannot remember such issues in V10. We did a lot of tests with ReFS and deleting backups, that never stalled Veeam like this.
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

One more intersting finding: Yesterday we found that even primary backups seemed to show gaps in the transfer. Today we did the same backup again and found that the backup has no gaps at all and the transfer is running at a steady 1,3 GB/s via NBD!

The only difference is that currently the copy job is still deleting GFS backup points.

It seems that the backup copy job when scanning for restore points can have an adverse impact on the primary backup job! I believe if we get the immediate backup copy job bottleneck fixed the whole system will run much smoother.

BTW a tape job has no issues whatsoever with the situation (many VMs in 3 source jobs).
Dima P.
Product Manager
Posts: 14833
Liked: 1785 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: V11: Huge backup copy jobs stalling

Post by Dima P. »

Hello Markus,

Thanks for sharing the news, QA team is still investigating the issue. I'll update this thread once I hear back anything.
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

Interesting. While the copy job is running the veeam backup manager executable takes 321 GB of RAM... Is that normal?
Gostev
Chief Product Officer
Posts: 32374
Liked: 7727 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: V11: Huge backup copy jobs stalling

Post by Gostev »

No.
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

Our monitoring found that it got worse and worse with every backup run... Looks like a leak. Going to reboot now and will see how it develops.
Dima P.
Product Manager
Posts: 14833
Liked: 1785 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: V11: Huge backup copy jobs stalling

Post by Dima P. »

Markus,

Talked with QA managers today, seems that the investigating is going pretty well and they've already working on a private fix for some of the discovered issues. Please keep working with our support (I believe they still waiting for some information to investigate the memory leak). Thanks!
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer »

Will install the first private fix in a few hours as soon as i have i slot where there are at least no primary jobs running.

I have 2800 VMs in the queue to be copied then. Lets see what this fix can do :-)
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer » 2 people like this post

The first patch increased the average network rate of the copy job already from 1 GBit/s to nearly 6 GBit/s - i see no gaps anymore!
Currently trying to do a copy while the backup is running.
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer » 7 people like this post

Just a quick test result - instead of running for more than 12 hours our backup copy job now finishes in slightly over 2 hours - WOW!

For an issue that only came up less then 2 weeks ago and which has an impact only on "crazy" customers with lots of VMs in one Job the result of the patch that Veeam quickly provided is astonishing! Still i believe that this will benefit all cutomers in the end.

Big thanks to Veeam support and developers! You are just great!

Lets continue to work on this until we optimized everything!
Dima P.
Product Manager
Posts: 14833
Liked: 1785 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: V11: Huge backup copy jobs stalling

Post by Dima P. » 3 people like this post

Hello Markus,

Awesome! Thank you for the kind words and all your help with this investigation, shared your feedback with the folks and they are very proud. Cheers!
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer » 7 people like this post

I just tested the second (or fourth if you count SQL indices) patch. It was for long running synthetic full preperation.

Last week the synthetics of 3 jobs with about 3000 VMs was distributed over 3 days and took added up 6 hours 47 minutes.
Now we did the synthetics all on one day to really test the patch. Still, in total if took only 54 minutes!

Quite impressive!
thomas.biesmans
Enthusiast
Posts: 38
Liked: 13 times
Joined: Mar 22, 2013 10:35 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by thomas.biesmans »

Those are some impressive scaling numbers, lovely feedback, thanks Markus!

@ Veeam team: could you drop a line here when these get bundled into a CU pretty please? :)
Gostev
Chief Product Officer
Posts: 32374
Liked: 7727 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: V11: Huge backup copy jobs stalling

Post by Gostev » 2 people like this post

We're not planning for more CUs at this time, as the current one addresses all common support issue. Next build will be a minor release (11a), we're targeting August for it. Just in time for folks who will be coming back from vacation :D

This will also give us more time to test all these recent optimizations for very large environments thoroughly prior to rolling them out to all users.
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer » 3 people like this post

The nice thing is that at the current pace we will get all these scalability issues fixed with support until then. I hope the next fixes will solve the retention processing and backup infrastucture availability detection performance issues. But they all the major issues are already nicely fixed :-)
c.evans
Veeam Software
Posts: 15
Liked: 5 times
Joined: Nov 18, 2019 3:35 pm
Full Name: Chris Evans
Contact:

Re: V11: Huge backup copy jobs stalling

Post by c.evans »

mkretzer wrote: Jun 04, 2021 3:24 pm Just a quick test result - instead of running for more than 12 hours our backup copy job now finishes in slightly over 2 hours - WOW!

For an issue that only came up less then 2 weeks ago and which has an impact only on "crazy" customers with lots of VMs in one Job the result of the patch that Veeam quickly provided is astonishing! Still i believe that this will benefit all cutomers in the end.

Big thanks to Veeam support and developers! You are just great!

Lets continue to work on this until we optimized everything!
I created this forum account the day I was hired at Veeam (nearly 2 years ago) and have never posted on here, but I went through the trouble of digging up my username/password just so I could post here and say how much I appreciate mkretzer constantly providing details on his tests AND the results. Just wanted to throw some love at mkretzer on my very first forum post because it's having clients like him that make working for Veeam just so damn enjoyable :)
Gostev
Chief Product Officer
Posts: 32374
Liked: 7727 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: V11: Huge backup copy jobs stalling

Post by Gostev » 2 people like this post

Markus is the real legend :D
mkretzer
Veteran
Posts: 1267
Liked: 456 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: V11: Huge backup copy jobs stalling

Post by mkretzer » 3 people like this post

Thanks very much - i love working with people in the industry who love their work like i do - working toward a common goal. :-)

To be honest this is one of the reasons we stay with Veeam. Every now and then another company shows us their "superior, much better than Backup & Replication" product. My question is always: "In the case of an issue that only affects us at first are you willing to put in many hours to find a way to optimize your product?"
Post Reply

Who is online

Users browsing this forum: Baidu [Spider] and 2 guests