SharePoint root site suddenly takes days to fail to backup Case 05013518

AlexHeylin · Post by **AlexHeylin** » Sep 29, 2021 6:10 pm this post

We're running VBO for multiple tenants with a Tenant = Tenant job = Tenant Repo strategy. For one tenant which has been happily completing its daily incremental jobs in about 1.5 hours, we're now seeing the root SharePoint site taking so long to process that effectively the job stalls and never completes.

This job has not completed since the 2nd September. The job has run for something like 300+ hours total in that time, and with individual runs of up to 90 hours per run. You can't say we're being impatient and haven't given it a chance to do the work. We're committed to a daily RP but this job hasn't generated a complete RP in nearly a month.

If we could see significant usage of disk IO for the DB, or network IO for the comparison, or heavy CPU for the comparison then at least we'd be able to see it is doing the best it can within the resources available (and look to increase available resources).
However right now this is what I see
- Disk IO for the DB 3% average disk time
- CPU for proxy: 13% average
- Network IO for proxy: 700 bytes / second (~ 7 Kbps)

The proxy is running with 64 threads (32 HT cores), 32 backup accounts for this tenant (it was using one before this issue occurred), and while memory usage is high no page faults are occurring. Backup is to local JET, server has 1Gb Internet and I've seen it pull 600 Mb from O365 before - so resources doesn't seem to be the issue.

The log says each thread is taking about 11 minutes to process 100 items. That seems ridiculously slow given there's over 250,000 items in this SP site - that would give an expected run time of 27,500 threadminutes (~7.2 hours at 64 threads / 32 accounts) - for a job that was completing on this site in 1.5 hours using 64 threads but only a single backup account. That would be bearable if the job actually completed in 8 hours - however it doesn't complete even after 90+ hours.

It doesn't help that VBO GUI doesn't report backup progress in terms of items scanned and not copied - only in terms of items copied, which adds to the impression it's not doing much.

Support seem to be saying this is expected performance. If that's true, we need to dump VBO and get a backup solution that can cope with this.

Please can someone in PM take a look at this case and advise if we / support are barking up the wrong tree here, or if this is something R&D can address?

We don't want to change product - but at this rate we're going to have no choice unless a fix is forthcoming from Veeam soon.

Thanks

Sep 29, 2021 8:30 pm

I had that too.
Initial Backup of 700000 items (1TB).
Veeam support has explained to me, that each version of this items is another item for veeam to backup.

I was told:
1) to exclude the root spo site from the backup job
2) create a new backup job with only the root spo site, use the old backup repo with the other spo sites as a target
3) run both jobs
4) as soon the new job with the root site is finished, you can disable or delete the new job and remove the exclusion from the original job

This way, all non root spo sites like teams could processed until the root spo site could be processed 100%.
The new job with the root site was completed after 3-4 days. Not everything was downloaded again, because it was already in the repo from the other job. After reverting back to the original job, daily backup runtime was back to normal.

Make sure, that you use 10-20 backup accounts for the spo backup.

https://helpcenter.veeam.com/docs/vbo36 ... tml?ver=50

Sep 30, 2021 8:50 am

Hi Alex,

I don't want you to change the product either )
What Mildur suggests above is one of the workarounds; but let me check your case details first before coming to any conslusions.

Thanks!

AlexHeylin · Post by **AlexHeylin** » Oct 06, 2021 8:04 pm this post

Thanks. Support told me to do that too - however it doesn't really solve the problem, just makes it not prevent the scheduled backup of other items.
The other problem with this "solution" (if I've understood what support have told me) is that is causes a "full backup" of the SP site in question. Now, if an incremental was taking days to complete, and a "full" requires a complete recheck of every version of every file which takes much longer - how does that help get the backup of the SP site completed faster? It seems to me that this approach allows other objects to back up, but at the cost of even higher runtime for the SP site in question. As of today, this client of 266k files has now been backing up for over a month!. They've got virtually the same file count they had before this, yet VBO has been reporting deleted item counts as high as 1M items. How can 266k items produce 1M deleted items in a job?

I say "full backup" because it's not what I'd call a full backup. It's a full compare with only any missing data being copied into the repo. A true full backup would copy all data from the source to the repo regardless.

I have two questions that arise from this.
1. If files are moved in Sharepoint, can't VBO easily track that and just update its records off that info instead of rescan every file in both its new location and its old one, and check every version of each file against the repo?

2. If VBO has this problem then shouldn't the scheduler be designed to treat each O365 object (mailbox / OneDrive / Sharepoint site) as a separate job / sub job, so that one object stalling due to this known issue doesn't kill the whole job?

Right now I don't see how we can expect this solution to scale to the use we expect. This is one of our smaller clients, and compared to some of our clients their file storage is tiny both in numbers and data size. Many of our clients are looking to migrate their file servers to SharePoint. We have one client with ~ 150 TB of file data in billions of files. We cannot say to the client "you have to split each of your file shares (of 5M++ files) into SP sites of no more than 250k (for example) files just because Veeam won't back it up properly otherwise". I think VBO needs to seriously rethink some of this.

AlexHeylin · Oct 11, 2021 1:20 pm

Just to add to this - for the sake of anyone else who gets this. Each time the job is restarted (or the back of the SP site is moved to another job etc) it starts the whole backup process over from the beginning. So if it was going to take 6 days to complete, and on day 4 your restart the job - it will take another 6 days to complete, not 2 days. Support repeatedly told us to restart the job / make changes that required job restarts. Think carefully before doing this.

If resources is implicated as a bottleneck and you're going to add more and restart the job, my advise is to throw everything you have at it in the first instance.
Don't be tempted to add a few more resources and restart the job (remember - it starts again at the beginning). Give it everything you can and only restart the job once to activate the additional resources.

Read the sizing guidelines but 2 threads per hyperthreading core, and plenty of RAM (24-32GB+), and LOTS of additional backup accounts.
https://github.com/AlexHeylin/PowerShel ... counts.ps1 will help you quickly and easily set up additional accounts. We found going from 1 account to 48 accounts we were still getting throttling - 128 additional accounts ($AdditionalAccountSets = 16) isn't too many! Bear in mind that (as far as I understand it) O365 throttles (delays) requests a long way before it starts responding with codes that say "no" - which is how Veeam knows throttling is kicking in. The more additional backup accounts you have, the less this is a problem. With 32 backup accounts, the job ran for 99 hours with no sign of completing anytime soon. After increasing to 128 backup accounts, the job completed in 77 hours and subsequent runs complete in 20 mins whereas they used to take 90 mins using just one backup account.

While the underlying problem remains hard to solve, there's a lot VBO as software, and Veeam support could do to deal with this less painfully. Automatically creating additional backup accounts when required would help quite a lot here - for a start.

A.Rogers · Post by **A.Rogers** » Oct 18, 2021 6:17 pm this post

Did you track down the reason behind why it started taking so long? When my jobs drop into rescans I have the same thing but this only affects a single tenant. We have been seeing thousands of errors like below like something is changing the files but if you look at the files, modified details are all years ago. It is all historical data as well.

17/10/2021 00:00:18 87 (5684) Starting download of resource: /sites/Weekly Stats/September 2019/W2/Week 2.xlsx, URI: xxx
17/10/2021 00:00:21 87 (5684) Warning: HTTP status code 412 returned for /sites/Weekly Stats/September 2019/W2/Week 2.xlsx. Download will be retried
17/10/2021 00:00:50 87 (5684) Failed to process item: /sites/Weekly Stats/September 2019/W2/Week 2.xlsx. Item has been changed during backup (old version: 1:1, new version: 1:2)

AlexHeylin · Post by **AlexHeylin** » Oct 25, 2021 11:53 am this post

We're not completely clear on what happened. According the the tenant they did move "some" files around and moved a fair chunk of old job folders to an "archive" folder to tidy up their main shared filing (so the file paths changed). However, the file counts we were getting from VBO seemed higher than we'd expect. I don't have logs to prove this, but my intuition is that the incremental backup that occurred following this probably took much longer than the original full backup did in the first place. Forcing a "full" backup by moving the SP site to a new job as instructed by support made matters worse in that while everything else backed up as scheduled, the affected SP site then had to do a "full" backup which as I understand it isn't actually a full backup in the normal sense - it's an incremental based on the data in O365, rather than based on the change log / index of the data in O365. Having to check each file against the repo added a load of extra work, which I think if we'd left the SP site in the original job and thrown all the resources we ended up giving the tenant anyway at it would probably have been quicker to complete (and certainly wouldn't have taken a month which is what it took going the route support instructed).

Post by **Jamesn-CB** » Oct 28, 2021 12:52 am this post

We had a similar scenario. Recently almost resolved. Started 6 months ago.

Client created 2 new SP sites.They'd moved their data across from many smaller existing SP sites that were protected fine, ending up with 2 big sites. 1TB/300K items, and 4TB/4M items

Initially support advised to create standalone jobs for the 2 sites. We saw these jobs begin to take 80-100/300-400 hours for each of the sites to run. We had other sites in other tenants with similar data amounts and item counts as the smaller of the 2 sites in question. We checked MS SharePoint limitations carefully, and also checked the site configuration carefully. The main difference was this client had a high rate of change.

We continued to see the long backup times.
Escalating with support, someone was able to explain in more detail some of the potential bottlenecks in the system. At supports suggestion, we created new cache repos, and S3 bucket connections for the 2 standalone jobs. One set each.

There was some improvement after the Fulls completed. But not really enough to call it solved from our point of view. The info and guidance from support was helpful in leading us to our next steps.

Support agreed the service accounts numbers seemed to be enough. But didn't object when I suggested creating more. We doubled it and saw the most improvement so far. Still not where we want it, so it's now getting significantly more.

AlexHeylin · Post by **AlexHeylin** » Nov 01, 2021 12:25 am this post

My experience suggests that in this scenario 128++ backup accounts is a good starting point. As I understand it - Veeam support only observe throttling is an issue (and advise to act on it) when they see hard throttle stop errors coming back from O365. By then it's FAR too late. Soft throttling happens a LONG time before that and requests are delayed by O365 but do complete after the throttle delay. That delay seems to be a major source of the time it takes for the job to run.

At the end of last week we moved a sizable client from file server to SharePoint, and I saw the backup ran longer than normal. I'll check it in the morning and I won't be surprised if it's still running from the first run after syncing the file server to SP.

The only advantage I've seen from creating separate jobs for slow / large / high-change SP sites is that they don't stop everything else backing up. However doing it when you've already got a backlogged backup seems to only make backlog situation much for those SP sites. I suspect massively increasing the backup accounts even if this requires a job restart is the most effective way forward. After all, if the accounts are being throttled splitting the jobs without adding LOTS more backup account just makes the jobs contend with each other.

My advice for anyone who knows in advance there's going to be a lot of change is to add LOTS of backup account well before the change occurs so VBO has plenty of resources ready. The stupid this is that we've now got some clients where they have 6 x as many backup accounts as they do O365 users. For our larger customers we might easily create 1000 - 2000 backup accounts based on our experience so far with throttling.

I really think VBO should take this in-code, rather than rely on a community script that's not even Veeam hosted / supplied.
It's certainly something the backup estimator has data that should enable it to give a good starting number of accounts to use.

Post by **Mike Resseler** » Nov 02, 2021 3:30 pm this post

Alex,

I understand your request of taking this into account, but recently we have discussed this with Microsoft and in the future it might even become worse where initial backups could take months and longer. The SharePoint Online team wants to throttle it even worse and make sure that you can only use 1 "application" at the same time. That being said, we continue to work with MSFT to make sure that the delay is not too large, but if we are talking about TB of data, it probably will take a long time, even if we can work fully through the graph API which should be the future

AlexHeylin · Post by **AlexHeylin** » Nov 15, 2021 8:18 pm this post

Mike,

If it's any help - you can push back on MS saying "We're being told by end customers that they won't migrate to O365 without reliable and speedy backup." I'm VERY happy to sign off on a note like that.

If MS carry on like this, we're going to start pushing back to customers and writing caveats into O365 migrations saying why this could be a problem and how it won't be our fault and there may be little we can do about it, and that they need to blame MS for doing this deliberately. Right now our best use cases for migration to O365 are practically unable to migrate due to lack of reliable and speedy backup at their scale (100TB++). You might also mention MS jacking the O365 price up and shafting their resellers, at the same time as actively obstructing O365 backup. They seem to forget that we influence the end customers - not MS.

I understand why MS don't want this load going through their front end systems - however the need for backup remains, and they need to facilitate it reasonably not actively block the APIs and throttle the connections to unusable speed in an attempt to make it impractical.

Thanks for keeping on trying!

AlexHeylin · Post by **AlexHeylin** » Nov 18, 2021 3:12 pm this post

...and we're back to 40+ hour runs again on the same tenant. However this time I can see clearly something we thought we saw before but couldn't reproduce. After 40+ hours running, VBO is only using one thread to do this work. In that case no wonder this is WAY too slow.
New support Case #05139337

AlexHeylin · Post by **AlexHeylin** » Nov 18, 2021 7:00 pm this post

It looks like backing up a SharePoint site only uses one thread. In this case that means the proxy machine is sat there virtually idle, with capacity to run 61 more threads than VBO is using to process this work.

Is anyone able to confirm if a single SP site is only processed by a single thread, and if that's by design?

If so - I suggest that throttling from O365 isn't the biggest challenge VBO faces - it's lack of adequate threading for dealing with large / high change SharePoint sites. Running a job at 1.6% of the speed it could be running at is definitely "suboptimal" and bound to lead to problems in some cases.

AlexHeylin · Post by **AlexHeylin** » Nov 18, 2021 8:41 pm this post

<reads VBO proxy logs and despairs>
There appear to be many optimisations VBO is missing out here.
It keeps dropping to one thread, leaving 61 available threads unused.
Then it wastes 35 minutes using that one thread to hammer the throttle stop (429) on one backup account while ignoring the other 162 backup accounts it has available to use to get around throttling.

Is a single thread supposed to become obsessed with using a single backup account and ignore all the others it has available to use?

If it's supposed to change accounts - is that change supposed to take 35 minutes?

Is the throttle back off (response to getting a 429) supposed to be an apparently random time between 30 seconds and 6 minutes?

I see a few other threads run sometimes, but all they do is validate certificate and they're never heard of again.

It takes ~ 97 seconds to get from
18/11/2021 20:31:49 85 (11532) Sync time: 00:00:18.7919631
to
18/11/2021 20:33:26 85 (11532) Total data: 0
in the log. That's ~3x longer than it took to poll O365 for the run.

What's it doing for that ~97 seconds?

Given the current operation, it's not hard to see why sometimes you get seemingly never ending runs of otherwise quick jobs.

AlexHeylin · Post by **AlexHeylin** » Nov 20, 2021 2:43 pm this post

It's been confirmed in support Case #05139337 that for incremental backups, a SharePoint site is processed by only a single thread.

This means by far the biggest problem we face is not resources or configuration - it's program design.
This needs to be addressed urgently if Veeam want to stay competitive in the O365 backup market.

It's clear that we now understand why VBO has trouble scaling with large SharePoint sites. As designed right now is completely unsuitable for some of our use cases - typically those tenants with higher license counts, and just about any case where a fileserver to O365 migration or extensive use of SharePoint for file storage is planned. Unfortunately those are the cases where it's easiest to sell the benefits, and justify the costs, of backing up O365.

@Mike Resseler - please confirm you've opened an urgent enhancement request for full and extensive use of all available resources in processing all work. Multiple threads per SharePoint site is essential, and those threads should efficiently use all available backup accounts etc to avoid as much throttling delay as possible.

I'm rather disappointed that I had to pretty much diagnose this myself after support had already had a good look at this situation in the first ticket. Perhaps the knowledge of this single thread limitation is not known widely enough by the entire VBO support team. I've also not heard mention of it within this forum, even from product management. Perhaps someone had to actually go and read the code to check operation - and that's forgivable. I hope with the logging changes due in v6 this situation would have become apparent to support then anyway.

A.Rogers · Nov 21, 2021 9:48 pm

Your investigation results seem surprising bearing in mind that the Veeam advice is to create a lot of service accounts to help with poor performance. But I can't say I have really seen much improvement from this in Sharepoint backups and suffer massively backing up large tenants. Exchange backups always seem to fly through in my experience. If only the SP backups were half as fast as EXO, I would be a very happy man.

AlexHeylin · Post by **AlexHeylin** » Nov 22, 2021 5:29 pm this post

Sure - many backup accounts / backup applications help avoid throttling (though I suspect there's room for improvement there too). Unfortunately they do nothing when a job is processing each SharePoint sites with a single thread (and from what I've seen a single account until certain conditions are met - which imposes a ~30 minute delay each time throttling causes a 429 response).

No-one at Veeam is saying this... but I get the impression it was not common knowledge within Veeam that this is how it worked, and they might even have had to go and read the code to confirm my suspicion this was how it was working.

Nov 23, 2021 8:55 am

Hi Alex,

"Unfortunately they do nothing when a job is processing each SharePoint sites with a single thread (and from what I've seen a single account until certain conditions are met - which imposes a ~30 minute delay each time throttling causes a 429 response)."

Please, do not confuse the SharePoint site and SharePoint list

Veeam Backup for Office 365 can process a SharePoint site using multiple threads, but during incremental backup, we only utilize one proxy thread per one SharePoint list.

AlexHeylin · Post by **AlexHeylin** » Nov 23, 2021 3:14 pm this post

I've been advised to create a specific Enhancement Request post for this, so that's here veeam-backup-for-office-365-f47/feature ... 77845.html

AlexHeylin · Post by **AlexHeylin** » Nov 23, 2021 3:19 pm this post

Hi Petr,
Please excuse my terminology if it's not correct. SharePoint is not my specialism. I'm not going to pretend to know the difference between a SharePoint Site and a SharePoint List - but I do definitely know the difference between the backup completing in 20 mins, and it failing to complete in 20 days. I would be really happy if I didn't know that.

Post by **Mike Resseler** » Nov 26, 2021 3:29 pm this post

Alex,

Don't worry. I apologize that there were issues with your support experience. I responded to your other thread. Protecting a list with multiple threads is on our roadmap. Just for my information and future planning. Can you let me know how many "items" there are in your lists? (Average or more or less is fine...)

Thanks

Post by **infused** » Dec 05, 2021 9:04 am this post

Are you guys running your instances in Azure? Where abouts are you being throttled?

Running ours in Azure and we never seem to have throttling issues. Backup up some pretty big stuff only using one application per tenant.

Sounds like you're being limited at the edge.

Post by **Mildur** » Dec 05, 2021 9:07 am this post

We are using it onpremise, because we want to have the data out of M365.
That‘s very much possible, that microsoft throttling is not at the same level as for OnPremise, if you are copying data to their own infrastructure.

Dec 06, 2021 10:25 am

@infused

MSFT has guaranteed us that throttling rules outside or inside Azure are the same. In the end, Azure and M365 are two different clouds. (They are internally fully separated). However, you are not the first that tells us this, but our testing shows there is no difference. Might be (again) something they are changing in the backend.

Post by **infused** » Dec 07, 2021 4:15 am this post

Yeah, they are different. My tests can confirm. While being in separate networks, I was told they are basically side by side. I can pull at 6gbit/s from our azure box, where i'd barely get over 150mb/s on prem. There's definitely something going on.

Dec 09, 2021 10:51 am

That is a big difference...

AlexHeylin · Post by **AlexHeylin** » Dec 13, 2021 1:21 pm this post

For us O365 throttling has minimal impact on this solution as long as we give it enough accounts / apps. When it does impact, sometimes that impact is mostly from VBO doing suboptimal account / app swapping (spending 30 minutes hammering the same account / app and ignoring the ~150 others it has available). After 30 mins / 10 (?) retries it swapped to another account / app.

All our four instances run on-prem. Some at customer prem on their hardware, and one at our prem as hosted backup for most of our tenants.

@infused - if you're only pulling 150Mb for on-prem during the first backup of a tenant (no other backup should be used for measurement as they're not reproducible / comparable) then you've got a problem of some sort or really need to tune your settings. We have an instance that easily floods out the 500 / 500 Mb line in that customer prem, and even with VBO set to throttle to 200 Mb still peaks at 350Mb and does more like 300Mb average. With the throttle off, it was pulling an average of 450Mb down from O365 and 300Mb up to S3 at the same time. For that tenant - that's probably maxing the available bandwidth out allowing for other uses.

I know VBO can pull 1Gb on-prem because I've done it on our own environment. I suspect it actually did more by getting load shared over out multiple Internet lines - but I no longer have the data to confirm.

Bear in mind that due to the highly "back and forth" nature of the backup process, latency will have a large impact on maximum throughput. All our instances run at < 5ms to 8.8.8.8.

AlexHeylin · Post by **AlexHeylin** » Dec 13, 2021 2:24 pm this post

@Mike - for our most problematic SP site (file server replacement, the one in the case) we see 261k files. As I understand it, that means at least 261k items in a single list. That's not even a large site or tenant compared to what we'd like to move to SP / OneDrive.
For reference, that's only 349GB of files - with a lot of photos so I can see tenants easily hitting 500k files in a list. We've got one tenant we'd like to move to O365 where a single SP site / list would currently have 6M items (7TB) in it, that would be 10M within a couple of years. Seeing 20M items in a single list would not surprise me, and that's ~7.6 times larger than this list that's currently problematic. This tenant would have 20 - 50 sites / lists totalling 30-50 TB. That's assuming they keep their archiving process - if not those numbers get even bigger over time.

In terms of quantity of list change - with the tenant in this case restructuring their file paths (effectively removing two levels of folders to deal with a path length issue), we were seeing changes of 10k - 100k items moving between incremental runs of the VBO job. I'm unclear if an item move is effectively one change (item A moved from B to C), or two (item A got deleted & item D got created) at the level VBO interrogates O365.

Post by **Mike Resseler** » Dec 21, 2021 3:27 am this post

@AlexHeylin

Unfortunately, 261k files in a single list (actually, a document library but lets consider it the same) is a very big number. Although MSFT states 30M is the maximum. Sizing in this case isn't even important. It is the amount of items.

As said above, we will continue to investigate this further. But if you look here: https://docs.microsoft.com/en-us/office ... ine-limits, you see that MSFT advises a max of 300k items

And the biggest issue is that if you connect to the site, you have a limit of 5k per connection

AlexHeylin · Post by **AlexHeylin** » Dec 29, 2021 12:58 pm this post

Hi Mike,

I partly take your point - though if MS says 300k, but VBO becomes unworkable at 50-100k then again the question arises about VBO's suitability.

At this point it seems sensible to try and tell customers to stop migrating file servers to O365 - but I don't think that's going to fly. To the point we might lose customers over it. I'll pass on the limits you're suggesting to our staff.
Thanks

R&D Forums

SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Re: SharePoint root site suddenly takes days to fail to backup Case 05013518

Who is online