Wasabi US-East-2 Issues

Sep 27, 2022 12:41 pm

Anyone else seeing problems offloading to Wasabi US-East-2? It started around 10-14 days ago. Multiple deployments of ours offloading to US-East-2 are having timeout errors. Veeam offload jobs have hours of time where it just sits transferring nothing, and letting jobs occupy job slots. We see errors such as:

Code: Select all

HTTP exception: Retrieving message chunk header, error code: 110
Exception from server: HTTP exception: Retrieving message chunk header, error code: 110
Checkpoint cleanup failed Details: HTTP exception: Retrieving message chunk header, error code: 110
Could not allocate processing resources within allotted timeout (14400 sec)

It's just a mess with offload jobs piling up and not completing. I have Veeam and Wasabi cases open, but so far neither are really going anywhere. Veeam says they are tracking an increasing number of issues with customers offloading to Wasabi US-East-2. Wasabi tells me to just run new active fulls for everything, which Veeam says not to do, and I don't want to do that either. At the end of the day, we are left with very long running Veeam jobs where they sit and sit doing nothing, the whole copy operation grinds to a halt. I have to run the performance extent of the SOBR at twice the concurrent jobs as the capacity extent at Wasabi so inbound copy jobs to the SOBR still run and aren't waiting for a job slot from the never ending Wasabi offloads. Jobs sit at 99% forever, or just stop sending anything at random other percentages of completion. I'll look at an offload job and 4 VM backups are sitting there at various percentages of completion, not moving at all, no traffic sent in hours. It seems that when Wasabi does have issues, which I do believe this to be a Wasabi issue, but Veeam doesn't handle the failures well, and the whole thing grinds to a halt.

Veeam Ticket: 05638326
Wasabit Ticket: 73029

Anyone else seeing any issues with Wasabi lately?

Sep 27, 2022 1:20 pm

We don't have any buckets in US-EAST-2, but I have one customer in EU-CENTRAL-2 which does experience a similar issue.
Since 2 weeks offloads start, upload some data and then get stuck without any error; after that point no further progress does happen and you need to abort the offload tasks.
Haven't seen any real errors in the log besides the one you've posted regarding the timeouts. Veeam Ticket number is #05634023.
That's one of many customers in that region, so we didn't contact Wasabi so far.

Post by **veremin** » Sep 27, 2022 4:18 pm this post

Thank you, Max and Rob, for posting the support ticket numbers. We will check the cases internally with the QA team and see what might be the root cause of the issue.

I will keep the topic updated.

Post by **veremin** » Sep 30, 2022 2:48 pm this post

We double-checked the solution on our side and have not found any issues. So for similar problems, we recommend reaching the Wasabi support team directly. Thanks!

Post by **Regnor** » Oct 02, 2022 9:14 am this post

Thanks Vladimir for checking! We'll next disable any security solutions and see how it goes. If this doesn't help we will contact Wasabi in parallel.

Post by **veremin** » Oct 04, 2022 4:19 pm this post

You're welcome, Max; we'd appreciate it if you updated us on the results of your investigation. Thanks!

Post by **RobMiller86** » Oct 06, 2022 1:57 pm this post

So far we have made no progress. I have submitted Veeam logs and bucket logs to Wasabi, however no root cause analysis results yet. This is all I have so far.

Thanks for sending over those results. In our efforts to investigate this we have noticed that these timeouts reported in Veeam generally look like they are associated with cleanups of the bucket end. We know that there is an API call that Veeam makes "MultipleDeleteObjects" as a part of the cleanup process. It is possible that due to the large amount of "child" API calls that are in one of these calls (up to 1000 DELETE requests in a MultipleDeleteObjects "parent" call), that it is taking a long time to complete and therefore failing due to timeout.

We are looking into our handling of this MultipleDeleteObjects request as a cause of this issue. To help with the investigation, would you mind enabling bucket logging, reproduce the issue within Veeam, then send over the application and bucket logs from the relevant timeframe?

Post by **RobMiller86** » Oct 06, 2022 4:40 pm this post

Thanks for sending over those logs. As previously mentioned, we have noticed that these timeouts reported in Veeam generally look like they are associated with cleanups of the bucket end. The timeout is coming from Veeam (we're not sending the timeout), but we are led to believe from looking into this that the amount of time it takes for us to process the delete operations is taking longer than Veeam's allotted timeout period.

In the meantime while we continue to investigate on our end, what you will need to do is increase the S3 timeout within the Veeam application by setting a higher value for the "S3RequestRetryTotalTimeoutSec" registry key (manual entry) and lower any other high process concurrency based on the expertise of the Veeam Support team. Please note that increasing "S3RequestRetryTotalTimeoutSec" will lead to additional time on the jobs/operations.

Here is an example of the registry change:

("S3RequestRetryTotalTimeoutSec"=dword:00007200)

Please contact Veeam to confirm this value and for assistance in making this update. I'd be happy to join the call with Veeam to discuss what we know so far and make that change to the registry.

I'm going to try this registry change, but it does already take a LONG time to timeout, so it still seems odd to me that offloading now takes forever at Wasabi.

Post by **RobMiller86** » Oct 06, 2022 4:58 pm this post

I'm not even sure precisely where to make this key. It doesn't exist on the Veeam server. They did say it was a manual entry but didn't say where to make it. However, I do also see errors in Veeam such as:

Code: Select all

Timed out waiting for backup infrastructure resources to become available (14400 sec)

7200 looks to be half of that, which would only make things worse if it's the same timeout.

Post by **vscp0514** » Oct 06, 2022 5:16 pm this post

We've seen the same issues over the past few weeks with offloading to Wasabi US-East-2.

This was impacting around 15 separate VBR instances, but thankfully has cleared up to only 3-4 VBR instances continually having issues.

Really hope to get this cleared up soon, tired of babysitting and having to manually stop/start offloads/moves. We also have 1-2 VBR servers that are still weeks behind on offloads now.

Veeam case: 05635736
Wasabi case: 72733

RobMiller86 wrote: ↑Oct 06, 2022 4:40 pm I'm going to try this registry change, but it does already take a LONG time to timeout, so it still seems odd to me that offloading now takes forever at Wasabi.

Wasabi tried to pass the same regedit over to me, made no difference on the few servers I tested with. Location should be HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication\ like normal.

Post by **veremin** » Oct 07, 2022 9:54 am this post

It does not look like implementing these registry keys will solve region-specific issues. Anyway, we will check the cases opened with our support team and verify the cause. Thanks!

Post by **RobMiller86** » Oct 10, 2022 2:22 pm this post

I am removing the registry key. It has only made the issues worse as they never resolve and only take longer to fail. Still no resolution at this point.

Post by **veremin** » Oct 10, 2022 5:52 pm this post

Our understanding is that the issue does not have anything to do with our software, and implementing registry keys won't resolve the problem. I will update the topic if further investigation shows something different. Thanks!

Post by **Regnor** » Oct 10, 2022 6:29 pm this post

We've received a hotfix from Veeam because of a SSL issue. I hope to see tomorrow of it changed anything.

@Vladimir: If it's an issue from Wasabi, it would be great if Veeam would catch this and timeout at some point.

Post by **Regnor** » Oct 11, 2022 6:21 am this post

Update: With the hotfix the offload tasks now don't get stuck anymore. At least one task did complete, which never happend in the last weeks.
But now we're seeing new errors which need to be checked.

Post by **veremin** » Oct 11, 2022 11:00 am this post

Max, the QA team checked your ticket and found that you are affected by another issue for which you were provided the hotfix - the problem is related to our software and is different than the one reported in this thread.

To solve the original problem (sudden cleanup performance decrease spotted in certain regions), QA engineers believe it's better to reach Wasabi representatives. The corresponding reg keys increase the interval during which a backup server waits for a response from the cloud, but it does not solve the underlying problem.

Thanks!

Post by **DeanCTS** » Oct 13, 2022 2:43 am this post

I would advise everyone to reduce the number of concurrent tasks against Wasabi scale out backup job repository in order to expedite resolution for everyone involved. In our environment I've reduced the number from 16 to 12.
The issue we are observing is very similar to active DDoS attack but in this case most likely capacity limitations Wasabi is dealing with, potentially caused by unexpected influx of high-volume customers.

Post by **Regnor** » Oct 13, 2022 7:24 am this post

@Vladimir: At least our problems seems to be fixed after implementing the hotfix. Or Wasabi did change something on their side at the same time.
The offloads are now completing again, so it looks very promising at the moment.

Only the checkpoint cleanup is now failing. Is this related to the cleanup performance decrease you were mentioning, or is this something different?

Post by **veremin** » Oct 13, 2022 11:25 am this post

Originally you came across the certificate request timeout issue for which you were provided the hotfix.

Only the checkpoint cleanup is now failing. Is this related to the cleanup performance decrease you were mentioning, or is this something different?

Does it fail with the "timeout exceeded" error or similar? If so, you might experience the problem discussed in this thread.

Thanks!

Post by **Regnor** » Oct 13, 2022 11:56 am this post

No, it's failing with Access denied ("DeleteMultipleObjects request failed to delete object [...] error: AccessDenied, message: 'Access Denied' ")

Post by **veremin** » Oct 13, 2022 12:13 pm this post

It does not look like the issue discussed here, so I'd continue working with our support team. I will reach our QA engineers again tomorrow (currently, most of the team members are unavailable) and ask them to assist with the investigation. Thanks!

Post by **Regnor** » Oct 14, 2022 5:57 pm this post

Then thanks for your support Vladimir. I'll update this topic as soon as our problem is resolved.

Post by **DeanCTS** » Oct 15, 2022 12:18 pm this post

Did anyone have any luck with getting past this issue? We're down to 8 simultaenous connections from 16, hopefully should be resolved.

Post by **veremin** » Oct 17, 2022 10:18 am this post

Any chance you have reached the Wasabi representative already? As far as we know, they are aware of the issue and are investigating it now.

We can summon @vbelagodu and see whether he has any updates on this matter.

Thanks!

Post by **RobMiller86** » Oct 17, 2022 11:57 am this post

We still have no resolution. It's been almost a month and still nothing concrete. The last thing they told us to do was to reduce the amount of deletes Veeam performs from the default of 1000 down to 500.

I just got off of a call with my manager and a couple members of our Engineering team. While we continue to investigate and test on our end, we do have a potential solution to help mitigate the issue in the meantime. Since we believe the issue is related to the amount of keys that Veeam is sending during a DeleteMultipleObjects API call, we can limit this number in Veeam from the default of 1,000 objects:

HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication
S3MultiObjectDeleteLimit
Dword
default value: 1000

Would you mind setting the value to 500 and let me know if the situation improves?

Nothing has worked, and I have to babysit this every day. Frustration levels are high. All I get is "please submit more logs" over and over even though the behavior is the same.

Post by **vscp0514** » Oct 17, 2022 12:55 pm this post

In the same position as Rob.

I have one VBR instance so backed up on cleanup jobs that the performance tier is out of space and scheduled backups are disabled. Waiting on Veeam support to help investigate the cleanup queue to see if it's actually decreasing or if I'm stuck in a loop. It seems to be VBR instance specific, but I've gotten cleanup stable enough on this VBR server that I can cleanup for 6-8 hours without errors. Which I've been doing for days

My team has wasted way too many hours babysitting VBR instances with no end in sight. I understand Wasabi failing to respond in a timely manner to delete requests is not Veeam's fault, but I had hoped being a month in we'd have a workaround/stop gap in place from Veeam.

Post by **veremin** » Oct 17, 2022 1:01 pm this post

We have reached the Wasabi team to get an update on the issue resolution status. We will post back once we have more information. Thanks!

Post by **RobMiller86** » Oct 18, 2022 12:35 pm this post

One of my VMs, a fairly IO intensive VM, hasn't been offloaded in a week. Then I see Veeam is saying this every time, and calls it a success?

Code: Select all

Timed out waiting for the index lock to release, backup chain will be processed during the next offload cycle

Why would Veeam call this a success during every offload cycle?

Post by **vscp0514** » Oct 18, 2022 5:14 pm this post

After working further with a Tier 2 escalation SOBR specialist I have some new keys in place which is making a difference and allow me to offload/cleanup. It's slow but at least I'm finally making progress and working around the Wasabi issues (which are still occurring).

This was for HTTP exception: WinHttpQueryDataAvaliable: 12002 errors, specifically on cleanup.

@robmiller86 If that is one of the common errors you are seeing, try escalating to the tier 2 SOBR team. Else maybe @veremin can review our tickets and see if the same keys are applicable and have someone reachout to you.

Post by **veremin** » Oct 19, 2022 9:07 am this post

As mentioned, the keys are a temporary solution – they increase the waiting timeout but do not address the clean-up performance decrease issue (reported in several Wasabi regions). For that, we are still looking for an update from the vendor.

Thanks!

R&D Forums

Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Re: Wasabi US-East-2 Issues

Who is online