[Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Post by **AlexHeylin** » Nov 28, 2023 12:11 pm this post

Arising from object-storage-f52/intermittent-failure ... 90440.html
post393293.html?hilit=throttle#p393293
post496168.html?hilit=throttle#p496168
and other experiences

Background
By default VBR does not have a concurrent job limit on S3 repos (and the docs don't tell you to set one) so it pushes them as fast as it can.

Current behaviour
Sometimes for legitimate reasons the storage replies with a "rate limit exceeded" response "error". VBR treats this as a terminal error and stops the task (and maybe the job) which it then marks as failed. The most commonly suggested "solution" to this is to apply a concurrent job limit on the S3 repo. This is crude and imposes a hard limit which artificially slows down the repo all the time, not just when under throttling conditions. It also requires manual tuning and depending on the storage / load the setting required to avoid ALL terminal errors could impose a much lower throughput on the repo than it's capable of most of the time.

Requested behaviour
Sometimes the storage replies with a "rate limit exceeded" response "error". VBR should treat this like VB365 does with the throttling messages it gets from M365 - by backing off and continuing to process as fast as the storage will allow. This auto-tuning should enable the repo to always run as fast as the storage will allow at that specific time, and removes both the need to manually tune the concurrent job limit (or any other throttling limit) and also ensures that storage stress causes jobs to run more slowly but allows them to complete successfully, rather than end in a terminal error.

With the move to object as a first class, and recommended, repo the reliability and speed become more important. We're seeing partners moving from VCC-B to "tenant to object" as a means of getting away some of the well known challenges with VCC-B. We'd really like to not have a new set of headaches due to the implementation of the object repos.

Please note, the above is from memory and discussion with a fellow engineer so some details may be slightly incorrect but the overall theme should be correct.

Nov 28, 2023 6:48 pm

Hi, Alex.

What HTTP error code are you talking about specifically when you say this the following:

the storage replies with a "rate limit exceeded" response "error"

Is there any way for you to find out?

I'm asking because VBR handled 503 errors with an exponential backoff algorithm since the beginning of our object storage integration, so the capability is already there in the engine. If there are other HTTP error codes that should be treated identically, then it should be a very simple change on our side that we could potential deliver even as a hotfix for you to try.

Post by **AlexHeylin** » Nov 29, 2023 1:11 pm this post

Hi Gostev,

We've looked through our servers, and can't find a direct answer for you immediately. We're going to continue to look, and check our old cases as I know we've had at least one for this issue before but I can't find it currently.

Based on the other reports linked above, those cases are
# 06358590
# 06234303

From post496168.html?hilit=throttle#p496168 which is for Wasabi, we have

Code: Select all

Failed to pre-process the job Error: REST API error: 'S3 error: Your account exceeded the limit for s3:PutObjectRetention requests. Please try again later.
Code: RequestRateLimitExceeded', error code: 403

This was posted as an expected result by chrisWasabi, who seems to work for them.

So if you meant 503 (not 403) in your message, it looks like while Amazon S3 might use 503, Wasabi uses 403. All our S3 is Wasabi. I expect the details of their implementation is something the their R&D team can discuss with you R&D team as they're a major partner.

Please let us know if that's enough info for you, so we can discontinue our searching.

Many thanks

Alex

Post by **AlexHeylin** » Nov 29, 2023 2:19 pm this post

Hi Gostev,

My colleague found this

which would be typical of this issue we've seen.

It's likely that every time we've seen this has been via SOBR offloading. We've only started using S3 directly (not via SOBR) relatively recently, and I think we've not seen this on direct S3, but we've definitely seen it several times on different SOBR on different servers / VBR installs. Again - we can only speak to Wasabi, not any other S3 / object.

Based on which server this is, and the date, this is running the latest v12 public build.

Thanks

Alex

Nov 29, 2023 2:38 pm

Thank you, Alex!

@veremin could you check with the devs if our exponential backoff algorithm is for 503 errors only?
And if so, let's add error 403 into the same algorithm and create a 12.1 hotfix for users to test.

Separately, in the screenshot above it is suspicious that 503 error causes the job to fail instead of using said exponential backoff algorithm.
Could it be that the algorithm got broken at some point? Or perhaps it does not backoff long enough and fails too early (timeout too short)?

Post by **AlexHeylin** » Nov 29, 2023 2:44 pm this post

Thanks Gostev,

Just in case you missed it - the image above shows 503 during offload. From what you said, this should already be implemented?

Alex

Post by **Gostev** » Nov 29, 2023 2:45 pm this post

We were typing at the same time

I was expanding my previous post with the same.

Post by **veremin** » Nov 29, 2023 5:10 pm this post

Sure, I will verify the details above with the R&D team this week.

Dec 01, 2023 1:09 pm

There is indeed a backoff algorithm used for handling 503 errors – upon receiving an error a product waits for a certain amount of time before trying to retransmit the request. If it continues to receive the service unavailable errors for 30 minutes, the product stops.

Based on the screenshot, that is exactly what happened in that case – the product received 503, tried to retry it for half an hour, and then failed.

As to 403 errors, by general definition, the access denied errors should not be retriable.

However, we understood that the implementation of standards varies from vendor to vendor, and we provided some Wasabi-specific exceptions – for instance, we are already retrying the 403 (ConnectionLimitExceeded) error.

Adding a new 403 error to this scope should not be a big problem, so we’ve tracked it for future patches.

Also, it might be worth reaching the storage provider and checking their plans for structuring the error codes according to common standards.

Thanks!

Post by **AlexHeylin** » Dec 07, 2023 7:19 pm this post

Thanks Veremin.

Just checking - are these Wasabi variations only applied if the repo was created as a "Wasabi" repo, or do they apply to the S3 generic too?
Bear in mind the Wasabi GUI option is recent and many of us are using repos created before it existed (which cannot be easily recreated).

My interpretation of that screenshot is that the job ran OK for 30 mins, then got a 503 and ended with error. I don't think it's entirely fair to say that the job ran, almost immediately got a 503 and retried it for 30 mins before ending with error, based just on what that screenshot shows as there's no evidence of any retrying in the job results screen. That might simply be that the retrying isn't reported in the job results, but it led us to interpret as it stopped after a single 503. We're happy to supply logs if you want to check.

It certainly seems like a bit more inter-operability optimisation could be done, and given Wasabi are a big tech-partner for Veeam it seems it would make sense for those two tech teams to talk directly about this and ensure all the strategies in the code are optimal - for speed, storage load, and job outcome.

Thanks!

Post by **veremin** » Dec 08, 2023 11:24 am this post

Just checking - are these Wasabi variations only applied if the repo was created as a "Wasabi" repo, or do they apply to the S3 generic too?

Those are Wasabi-specific exceptions.

My interpretation of that screenshot is that the job ran OK for 30 mins, then got a 503 and ended with error.

It didn't transfer any data during this period, so it suggested that it ran into connectivity issues immediately. However, you can open a ticket with the logs covering the time you got this issue and we can re-verify this assumption.

Thanks!

Dec 08, 2023 2:11 pm

Thanks veremin. Regarding these being implemented only if the repo is created as a Wasabi repo - I see two problems with that approach.

1. Most of your clients are using upgraded installs (not new v12 installs with a Wasabi option), and many are using repos created as generic S3 before there was a Wasabi specific option. How are these supposed to use these optimisations without recreating the repo?
This likely explains some of our issues, because the repos were created as generic S3 under v11 when that was the only option.

2. Wasabi can be white-labelled (rebranded) by a reseller. By requiring use of the Wasabi repo type to get these optimisations, you're effectively forcing the partner to undo their expensive white label and reveal that the storage is Wasabi underneath. Both (using generic S3 and have errors due to interop issues) or (use Wasabi and reveal the storage vendor underlying the white label) are bad options here.

Why not just implement these interop improvements across both generic S3 and the Wasabi branded repo, so that the Wasabi version is just branding and minor simplification?
This would cleanly resolve both of the cases above.

We're going to gather info on if we're seeing this occurring on "Wasabi" repos, or generic S3 repos. It is likely that we're asking you to make changes you've already made because we're using generic S3, many dating from before there was a Wasabi option. If so - what's the easiest and lowest impact way to get VBR to recognise these as Wasabi repos?

As a general request - when making changes, yes it's great for things to be fixed for new installs but given most of your customers are running upgraded installs I suggest it's even better if it's fixed for upgraded installs. Fixing for both is ideal. Please can Veeam bear this in mind in future and design to fix for both - not only for new installs, which is a situation we've run into repeatedly over the last couple of years and has caused us a lot of pain and work.

In addition - we've seen nothing that says "If you're using Wasabi, you MUST change to using the Wasabi repo type for interop reasons because we wrote interop fixes in there only so don't use generic S3 for Wasabi as you have been". Fixing things which require a change in "user" action doesn't fix them unless the user knows. The only mention of "wasabi" in the v12 release notes is

Once backups are created, they can be copied (for redundancy) or offloaded (for long-term retention) to one of
the following hot object storage types using the scale-out backup repository Capacity Tier:
• Amazon S3 (including AWS Snowball Edge)
• Google Cloud Storage
• IBM Cloud Object Storage
• Microsoft Azure Blob Storage (including Microsoft Azure Data Box)
• Wasabi Hot Cloud Storage
• Any S3-compatible object storage (on-premises appliance, or cloud storage provider)

This type of change seems to be exactly what I'd expect to be in the release notes. IMO release notes are likely to be read by existing customers (in lieu of, or alongside, a What's New doc) where as the full docs are not likely to be re-read by existing customers unless they know something specific has changed. Even the full docs don't say "You MUST use the Wasabi repo type with Wasabi or you'll have interoperability problems".

Thanks

Alex

Jan 10, 2024 4:06 pm

I would like to clarify my previous answers regarding 403 errors.

Generally, these errors indicate "access denied" issues and are not meant to be retried. This is a standard practice followed by most vendors.

However, Wasabi takes a slightly different approach and uses 403 errors to also indicate performance problems. When such errors occur, Wasabi generates 403 and includes details in its response (for example, ConnectionLimitExceeded).

In version 10, we implemented an algorithm to analyze the response for 403 errors and automatically retry them if specific details are present (such as ConnectionLimitExceeded). This mechanism has been functioning for a couple of versions, and it works irrespective of how Wasabi Hot Cloud Storage is added to the backup server, whether through a branded or general wizard.

I referred to these exceptions as Wasabi-specific only because other object storage systems do not handle 403 errors in the same way.

I hope this clarifies the situation.

Thank you!

Post by **AlexHeylin** » Jan 11, 2024 10:24 am this post

Thanks Veremin, that's a much better situation than your previous answer. Do you mean 403, 503, or both?
Wasabi seem to use 403 for request rate throttling, and 503 for connection throttling. Both cause jobs to fail if not handled correctly.

We've seen a number of instances where jobs fail citing the 503 as the reason. See my post from 29 Nov 2023, 14:19 where the job failed due to 503 from Wasabi.

It looks like 403 is handled, but 503 is still treated as a hard error and fails the job. My understanding is that the connection limit Wasabi apply is dynamic based on endpoint load, so setting a static connection limit / simultaneous job limit is suboptimal.

Please can you check on this, because the issue is manifesting itself and IMO could be handled better than hard failing the job.

Just to confirm - the "Wasabi" version of S3 setup in VBR etc is simply branding and simplified config... it does NOT affect operation so once configured it's identical in operation to the generic S3?

Thanks

Jan 11, 2024 10:44 am

AlexHeylin wrote:Please can you check on this, because the issue is manifesting itself and IMO could be handled better than hard failing the job.

I believe I've already addressed the questions about the 503 handling algorithm in my original response.

There is indeed a backoff process in place that handles these errors. When a 503 error is received, the product waits for a specified amount of time before attempting to retransmit the request. If the product continues to receive "service unavailable" errors for 30 minutes, it stops trying.

AlexHeylin wrote:Just to confirm - the "Wasabi" version of S3 setup in VBR etc is simply branding and simplified config... it does NOT affect operation so once configured it's identical in operation to the generic S3?

Correct.

I hope this clears up any confusion.

Thanks!

Post by **AlexHeylin** » Jan 11, 2024 11:46 am this post

Ah! I misinterpreted the logs to "I ran for 30 mins, then got 503 and failed" - and missed the detail of your response. Agreed - all working. Is the "30 mins" hard coded, registry controllable, GUI controllable? Thanks!

Jan 11, 2024 12:49 pm

There is a registry key to tweak it - S3RequestRetryTotalTimeoutSec. Thanks!

Post by **AlexHeylin** » Jan 11, 2024 12:57 pm this post

Perfect - thanks!

R&D Forums

[Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Re: [Feature Request] VBR to backoff in response to S3 throttle rate replies instead of throwing hard errors

Who is online