[Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

AlexHeylin · Post by **AlexHeylin** » Jan 10, 2023 1:23 pm this post

Case: 05796167
Due to catastrophic local storage failure, all performance extents suffered "unplanned removal" from the SOBR. We added a temporaty extent and relied on the capacity tier while the local storage was rebuilt. Once rebuilt, the three local performance extents were re-added to the SOBR, rescanned, the temp extent was removed, and the SOBR rescanned again. At this point the DB / indexes etc should all have been in sync - that's the purpose of a rescan; "go find what's there and make it work".

We've been plagues by ongoing "Local index is not synchronized with object storage, please rescan the scale-out backup repository" and offloads having high failure rates.
DB changes like

Code: Select all

UPDATE [dbo].[Backup.Model.BackupArchiveIndices] set version = '210' where archive_index_id = '2a1aa577-b9d7-4043-9662-fd4fd933674a'

have had to be made at support's request. If a person could figure this out, shouldn't that logic be in the resync code so a resync deals with it?

The issues persist. Today I noticed this

Code: Select all

Index has been recently resynchronized and can only be accessed in 4 hours 45 minutes

near the bottom of a (mostly failing) offload job.
Does that mean "I see need a resync, and one was done recently so I'm going to ignore the new results for 4.75 hours"? Why?

Generally - SOBR offloads generate errors in normal operation, they should not. If there's "normal" level of errors - perhaps there should also be a threshold at which this does something MUCH more vocal than log each error to Windows event log and put the results in a semi-hidden job that no-one looks at?
Please remember that for many use cases of SOBR + Object, the object is "the" offsite copy. It's important it happens in a timely manner and real failures are reported loudly.

Without wanting to be unreasonable here - SOBR offload and index sync etc just needs to be more resilient and more reliable. Errors should not be "expected" in normal production operation. I see there's changes to object storage in v12 - but I've not had a chance yet to catch up on if that affects SOBR operation, and what the migration path to "the new world" is especially with immutability configured in S3.

Thanks guys

Jan 10, 2023 6:28 pm

V12 literally cures this headache with a guillotine by removing said "local index" completely

this index and its synchronization was pretty much the only consistent source of support cases around our object storage integration... while the benefit it was designed to bring appeared to be non-existent with real-world data sets.

AlexHeylin · Post by **AlexHeylin** » Jan 11, 2023 7:36 pm this post

THANKS Gostev!

Any chance there's good migration paths to the new world from v11 style SOBRs, especially where immutability is in use? Please can we have docs on this at / soon after release? We'd REALLY like not to have to work this out for ourselves, or contact support and get told "we're not sure" or "create a new one and delete the old one" (incurring additional costs over months - years) etc.
Thanks

Jan 12, 2023 12:35 am

AFAIR it's just a quick automated in-place upgrade of existing backups' metadata...

AlexHeylin · Post by **AlexHeylin** » Jan 16, 2023 10:32 am this post

mcz · Post by **mcz** » Jan 16, 2023 11:55 am this post

am I the only one who does not see that image?

Post by **Gostev** » Jan 16, 2023 1:11 pm this post

It's a possibility because I can see it.

seirui · Post by **seirui** » Jan 17, 2023 5:02 pm this post

Just saw this thread from the the forum email digest. Quick question: Is the index going to be removed in v12 as an option in all backup jobs, due to it's lack of benefit (performance wise), or is this just for local copies in SOBR repositories?

Jan 17, 2023 5:44 pm

The index will not be used anywhere any longer. Consider it a new (V2) format of storing our image-level backups on object storage, used everywhere in VBR and even outside VBR (in Veeam Backup for AWS/Azure/GCP). There's no other format going forward.

talltim · Post by **talltim** » Jan 19, 2023 2:56 pm this post

oooh, I have a feeling that this will solve a long running issue I have had, where running an SOBR rescan fixes one groups' worth of servers while breaking anothers'. It seems that the rescan rebuilds the index from scratch, rather than just fixing the broken bits.
This can bee seen in the this chart of Errors Vs Success. Originally all the server groups were working, with the occasional individual server failing, (normally a certificate error) then after a rescan (yellow in time/date at top) some groups started failing (with a please rescan the scale-out backup repository error), further rescans have just moved the problems around.

I've had a call with support open about this (Case #05753412 — Issue with SOBR Offload job) , but so far we've only managed to make it worse!

AlexHeylin · Post by **AlexHeylin** » Jan 23, 2023 10:11 am this post

TallTim - what's that report from?
Thanks

Post by **veremin** » Jan 23, 2023 12:35 pm this post

I've had a call with support open about this (Case #05753412 — Issue with SOBR Offload job) , but so far we've only managed to make it worse!

It seems that you are about to have a remote session with our engineer today. The engineer will collect the additional details necessary for further escalation. Thanks!

talltim · Jan 30, 2023 9:58 am

AlexHeylin wrote: ↑Jan 23, 2023 10:11 am TallTim - what's that report from?
Thanks

It's manually created from the HTML reports of each job. There may be a better way of doing it*, but I needed something that showed me the pattern over time. It helped a lot to see the how individual servers are retried (usually successfully) after a fail, and how running a rescan was fixing some groups while breaking others.

* let me know if there is!

AlexHeylin · Post by **AlexHeylin** » Jan 30, 2023 12:16 pm this post

Thanks TallTim. Our method of doing something similar is probably more complex and too specific to our systems and use case to be any use to anyone else if I shared it.

AlexHeylin · Post by **AlexHeylin** » May 17, 2023 4:38 pm this post

Veremin,

Please can you get someone serious to look at Case # 06045964 and the history before I lose my cool about this ongoing / recurring issue and support ignoring P2 tickets for days then fumbling the responses I do get and apparently not bothering to read the previous tickets I've told them exist for this issue.

Case #04800922
Case #04832410
Case #05930792
Case #06033309
Case #06045964

Thanks!

Alex

AlexHeylin · Post by **AlexHeylin** » Jul 31, 2023 9:45 am this post

Case #06207611
Yet more SOBR offload errors

Jul 31, 2023 4:01 pm

Sorry for the unpleasant experience, we will review the new ticket internally this week and see what might be the root cause of the given issues.

I will keep the thread updated.

Thanks!

Post by **veremin** » Jul 31, 2023 4:21 pm this post

The current logs seem to missing. Based on the provided subset, it seems that our data mover is locking the process, but we need to get the log bundle to speak more accurately about the reasons.

Thanks!

AlexHeylin · Jul 31, 2023 4:53 pm

Log now uploaded thanks.

Aug 01, 2023 10:43 am

Passed to the QA team - however, it might take a couple of days to analyze them further due to their current load. Thanks, Alex.

AlexHeylin · Post by **AlexHeylin** » Aug 01, 2023 12:45 pm this post

Thanks Veremin

Post by **Gostev** » Aug 02, 2023 12:14 pm this post

AlexHeylin wrote: ↑May 17, 2023 4:38 pm Please can you get someone serious to look at Case # 06045964 and the history before I lose my cool about this ongoing / recurring issue

Case #04800922
Case #04832410
Case #05930792
Case #06033309
Case #06045964

Since this list looks too scary indeed and may leave an impression of Veeam offload being completely dysfunctional, I decided to personally review these cases with our support engineers.

Case #04800922. Cause: customer entered incorrect email in the notifications settings.
Case #04832410. Cause: API problem on the Wasabi side. Fix: patched on the Wasabi side.
Case #05930792. Cause: object storage occasionally taking too long to answer API calls. Fix: improved error/timeout handling on the VBR side in P20230412.
Case #06033309. Cause: support case opened with invalid Support ID/account and was not processed at all.
Case #06207611. Cause: backup file locked by some process on VBR server (like antivirus). Not a known issue, still under investigation but chances are it's an environment-specific issue as no other similar cases.
Case #06045964. Cause: some existing configuration database problems migrating to V12 that is not prepared for them. Some invalid entries after multiple in-place upgrades etc. Known issue observed at a few customers, but same reason.

Bottom line: the list does not look as bad as it seems

at least fresh new V12 installs should not be running into any of these problems.

AlexHeylin · Post by **AlexHeylin** » Aug 02, 2023 1:55 pm this post

Hi Gostev,

Thanks for taking the time to review all these. I agree it looks scary, and I'm not saying it is completely dysfunctional. However for us it's not been completely functional either. While I agree there have been varied causes, we've not had fully reliable offload that doesn't generate errors which cause tickets and toil in at least two years. I think you'll understand that we have not been able to either safely ignore the errors or resolve them. This has been a significant ongoing pain point for our backup engineers and support desk. To say nothing of the time wasted every week. Time we're under pressure from management to minimise.

I see a number of improvements were made in the most recent release, including to offloading. I expect some of those arose from the cases I created - so while it might look like I'm just having a moan, I hope you found them useful in resolving the issues.

Just to pick up that Case #04800922 covered many issues / improvements. I don't see anything about an email address for notifications. I see the primary issue as a required port missing from / unclear in the documentation and unhelpful ("incorrect") messages in the logs. QA also found misbehaviour in the product and have since fixed it.

Hopefully the latest version will now work better and we can all move on to doing more useful things.

Thanks

Alex

AlexHeylin · Aug 02, 2023 2:14 pm

I spoke too soon. Different parts of VBR are still fighting for file locks during offload. Case #06207611

FYI Case #06207611 - after a lot of hunting we traced this to MS Defender. We were not aware that when installing 3rd party AV on Windows Server the AV is prohibited from disabling or uninstalling Defender. On Windows workstation it disables it. Only one of our AV vendors mentions this in their docs, and almost in passing - rather than as a "you should do this" instruction. This led to numerous issues of file locking on multiple VBR repos both in our SP infrastructure and tenant-side. To resolve this it's necessary to completely uninstall Defender. It would be helpful if Veeam support had mentioned this in any of the several cases we opened for this, or if it was in the docs as a "gotcha!" to watch out for. Just following the docs won't work if there's another AV running which you think is inactive.

AlexHeylin · Post by **AlexHeylin** » Aug 02, 2023 3:18 pm this post

Just to add Case 06045964 - some of this was due to changes made on tenant side which effectively broke the chains on the SP side in a permanent way which meant the SP side had lots of offload errors and the only solution was to get VBR SP to ignore the chains and reupload the all backup points from the tenant side. It looks like some features were implemented without thinking about or testing what would happen if the target repo was a SP with SOBR offloading to object. That issue has taken three months to resolve. I don't even want to look and see how many hours work this caused.
Unfortunately telling us that issues we're facing don't affect new users doesn't help us. It's good for them that they're not facing these issues, but it's crippling us.
Thanks
Alex

AlexHeylin · Post by **AlexHeylin** » Sep 06, 2023 2:00 pm this post

Latest case for offload headaches is #06262278.
Any sign of a durable fix to completely stabilise and normalise SOBR capacity tiers which were upgraded to v12 instead of fresh in v12?
We're getting rather fed up of this game of whack-a-mole and fixes / workarounds that seem to "resolve" one issue only for another similar one to appear later.
Thanks

Sep 06, 2023 2:45 pm

Apologies Alex for this experience. I'm sure you realize that upgrades involving architecture/format changes are always painful, especially when the upgraded data spans multiple previous versions, and has unexpected deviations accumulated from those versions which cannot be all reproduced in the upgrade testing. This is not some unique issue, and is the reason why most people always prefer fresh Windows installs vs. in-place upgrades, for example.

I'm afraid there's no option here other than diligently working through all those issues caused by unexpected deviations in the upgraded backups.

AlexHeylin · Post by **AlexHeylin** » Sep 06, 2023 3:40 pm this post

Thanks Gostev. As it's often weeks before we get any real progress on them, is there any way to expedite investigation and resolution of each issue?
Each issue is seen by our management and business owner every day in daily reports and the time wasted is seen in our daily / weekly / monthly time summaries. It's damaging Veeam's reputation within our business and we're already under pressure from management to change to a system that's doesn't take so much time to manage. At least if we could get speedy resolution that would mitigate the damage and give us (the tech team) something to push against management with. If these issues just need a really senior engineer to get on a screenshare with us and work through it - maybe that's what needs to happen. Current case has now been open seven business days with no real progress other than the expected "upgrade and reproduce it then send us logs" step.

Post by **Gostev** » Sep 06, 2023 6:28 pm this post

AlexHeylin wrote: ↑Sep 06, 2023 3:40 pmAs it's often weeks before we get any real progress on them, is there any way to expedite investigation and resolution of each issue?

Sure. If you feel the issue is with the support engineer assigned to the case, you can use the Talk to a Manager functionality in the Customer Portal.

AlexHeylin · Post by **AlexHeylin** » Sep 07, 2023 9:29 am this post

Thanks Gostev

R&D Forums

[Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Re: [Enhancement request] Make SOBR rescan of S3 work better, and offload more reliable generally

Who is online