Discussions related to using object storage as a backup target.
Post Reply
dmtinklenb
Influencer
Posts: 15
Liked: 1 time
Joined: Jan 11, 2021 6:26 pm
Contact:

RMAN Offloads Jobs to AWS S3 Failing During Archive Cleanup

Post by dmtinklenb »

Case# 05484819

I opened a support case for this issue over a month ago and we are no closer to a resolution than we were on the day we opened the case and in fact the issues have gotten worse.

I have two separate RMAN scale out backup repositories, each with their own capacity tier bucket in AWS S3 storage. I have the SOBR's set to immediately copy to AWS with immutability set at 14 days.

I noticed in June that the offload jobs for these SOBR's would sit at 99% completion for hours and then usually fail with the error:

Error: REST API error: 'DeleteMultipleObjects request failed to delete object.

I opened a case with support and in looking through the logs the support engineer saw multiple S3 Slowdown errors during the archive cleanup process. The data was successfully being copied to AWS without any of the errors but once the cleanup process started the S3 Slowdown errors would start showing up in the logs. His conclusion was that these slowdown errors were causing the issue and the archive cleanup wasn't taking place. He had me change the permissions on our AWS bucket, which didn't help. He then had me change a registy setting for S3ConcurrentTaskLimit from 64 (default) to 32. Didn't help. Then change to 16. Didn't help. Then change to 8. Didn't help and at this point all of our offload to AWS jobs were backing up and not able to complete.

The next option he gave was to add another registry key, S3MultiObjectDeleteLimit, to 100 (default is 1000). Didn't help. We then started experimenting with different combination of the two registry keys to try and find a sweet spot where we wouldn't see the S3 slowdown messages anymore. Nothing has worked. And in the month since we started doing all this experimenting, our S3 buckets sizes have increased over 15 TB and the object counts for both buckets has increased over 300M objects.

Because of this indexing for these two offload jobs has increased to over 6 hours and then archive cleanup will run another 12 hours or more and then finally fail with the above error.

I'm at my wit's end, so any suggestion or assistance anyone can give would be greatly appreciated.
PetrM
Veeam Software
Posts: 3258
Liked: 525 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: RMAN Offloads Jobs to AWS S3 Failing During Archive Cleanup

Post by PetrM »

Hello,

First of all, I'd like to say that I wouldn't consider all these attempts to fix the issue with different registry settings as a wasting of time. This extensive testing helped our engineers to narrow down the issue by excluding most probable versions of the root cause. Sometimes, we encounter sophisticated cases that require a lot of time and attention from the support team to carry out RCA, looks like the case described above is a tricky one. Anyway, I will contact our support team leaders to request escalation of this case.

Thanks!
dmtinklenb
Influencer
Posts: 15
Liked: 1 time
Joined: Jan 11, 2021 6:26 pm
Contact:

Re: RMAN Offloads Jobs to AWS S3 Failing During Archive Cleanup

Post by dmtinklenb »

So, after working with Veeam support for 4 months on this case, it was a bug in VB&R in which the S3 Archive Cleanup being done by VB&R was sending delete commands to AWS, but without the version ID in the request, which causes AWS to create Delete Markers for those objects. Without the version ID's, those Delete Markers never get removed.

The crazy thing is that I opened this case on June 14. The files in the hotfix show a modification date of June 20, which tells me this issue was known around the same time I opened the case. Why then did we have to go four months back and forth between Veeam and AWS support before Veeam finally admitted that there was a bug with an available bugfix?
PetrM
Veeam Software
Posts: 3258
Liked: 525 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: RMAN Offloads Jobs to AWS S3 Failing During Archive Cleanup

Post by PetrM »

Hello,

My assumption is that the main challenge was proving that your problem matches the one fixed in June. In rare cases, it takes much time to come to the conclusion that the symptoms fully fit the conditions under which the known issue occurs. We always validate solutions that we propose and must be sure that a fix is absolutely safe and will not provoke other problems, it may require additional testing in our environments. I'm going to ask our support team leaders to shed light on it and to contact you to get detailed feedback that no doubt will help us to enhance case-handling processes in future.

Thanks!
Post Reply

Who is online

Users browsing this forum: david.tosoff and 7 guests