High S3 API request rate during delete operations (exponential backoff algorithm)

pirx · Post by **pirx** » Dec 01, 2020 6:50 am this post

Hi,

while performing delete operations on some old Disk (Imported) backups, we regularly run into error (#04509715).

[23.11.2020 12:07:24.566] < 15100> aws| WARN|HTTP request failed, retry in [1] seconds, attempt number [1], total retry timeout left: [1800] seconds
[23.11.2020 12:07:24.566] < 15100> aws| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.
[23.11.2020 12:07:24.566] < 15100> aws| >>  |Code: SlowDown', error code: 503
[23.11.2020 12:07:24.566] < 15100> aws| >>  |--tr:Request ID: 5A909BF002AA5C88
[23.11.2020 12:07:24.566] < 15100> aws| >>  |Other: HostId: 'uYjWmLE8lZnZmOLtmxcItYP7bkSChsuEDAUrPWkEFZTFPj/A2zHPSYJSU3fqZjrUfP7RZBQ/Z60='
[23.11.2020 12:07:26.066] < 39288> aws| WARN|HTTP request failed, retry in [34] seconds, attempt number [6], total retry timeout left: [1750] seconds
[23.11.2020 12:07:26.066] < 39288> aws| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.
[23.11.2020 12:07:26.066] < 39288> aws| >>  |Code: SlowDown', error code: 503
[23.11.2020 12:07:26.066] < 39288> aws| >>  |--tr:Request ID: 7F98B851D7101500
[23.11.2020 12:07:26.066] < 39288> aws| >>  |Other: HostId: 'QryEJu8LL2hUhC4ReZGdUyCqjDtJ/WCRtx8avkTW8lafD+BI1YnWJpqAyaJrHiKzQ9dGrakfc9A

I received feedback from AWS support that there is very high number of DELETE operations at this time (~3500).

I
I can see from 10:30am to 11:30am more than 3500 DELETE requests per second against the bucket.
S3 allow deleting multiple objects with a single HTTP request using the "DeleteObjects" API. The single HTTP request is tracked in CloudWatch, however internally, S3 processes a DELETE operation for each object in the original request. This explains why you can see about 350 in the metrics, while I see more than 3500 "internal" operations.

Internal S3 resources for this request rate aren't automatically assigned. Instead, as the request rate for a prefix increases gradually, Amazon S3 automatically scales to handle the increased request rate. This is why you are seeing these 503 errors.

I suggest you to contact Veeam support and if possible work with them to find a solution for the backup application to gradually increase the request rate, and retry failed requests using an exponential backoff algorithm, as explained in this documentation [1].
Additionally, you can ask them to distribute objects and requests across multiple prefixes, which is a best practice in rare cases where the supported request rates are exceeded.

[1] Error retries and exponential backoff in AWS - https://docs.aws.amazon.com/general/lat ... tries.html

I'm working on this with support and we already changed regkey S3MultiObjectDeleteLimit to a 500.

What I wanted to bring up is comment from AWS support: "gradually increase the request rate, and retry failed requests using an exponential backoff algorithm". Is this something Veeam is aware of and which will implemented at some point?

Dec 01, 2020 8:05 am

Hello Pirx,

the S3MultiObjectDeleteLimit should not be reduced as it will create additional load on the object storage as the same amount of data has to be deleted with now (at 500) double the amount of S3 API calls.

You need to work on the number of Task Slots at your Repository extends and the reg key value: S3ConcurrentTaskLimit

The math for the amount of operations running in parallel against object storage is in v10 and before: [Number of Repository Task Slots]*[S3ConcurrentTaskLimit Reg key]

Dec 01, 2020 8:31 am

Regarding the future. In v11 we will allow to select a specific count of task slots for offloading to object storage. There are really huge customers that have a lot of task slots and to work with the reg key to influence the load is not really practical.

If you bring too much load to the storage it will tell you this like the above message and we will retry after a timeout. This is the defined process by the S3 standard.

pirx · Post by **pirx** » Dec 01, 2020 1:24 pm this post

Reducing S3ConcurrentTaskLimit was the first thing support recommended, but this would also affect offload sessions. Thus we changed S3MultiObjectDeleteLimit. With Task Slots at your Repository, do you mean to reduce these to throttle S3 operations? We had a lot of jobs that were waiting on backup repositories, that's the reason we raised the number of concurrent tasks there.

Dec 01, 2020 1:31 pm

As I said, reducing S3MutliObjectDeleteLimit will create additional API pressure on the object storage system. Which is the opposite from what you wan to achieve, as the same amount of parallel processes are issued but over time a lot of more S3 operations are processed.

S3ConcurrentTaskLimit multiplied by the number of S3ConcurrentTaskLimit should not be higher than 2000 and you need to set things accordingly.

The fact that the storage runs into the waits above is by design, not harmful and in the end how object storage works under heavy load. While deletes are the most challenging operation for many object storage systems due to their architecture.

pirx · Post by **pirx** » Dec 04, 2020 6:57 am this post

Hi,

I forwarded your feedback and the feedback from Veeam support to AWS, here's the result. I guess the prefixes solution is not easy to implement as I can only define a bucket and a folder (which is already a different one for each of 4 SOBR's).

As i mentioned in the previous emails, Veeam is sending to S3 single API calls to delete multiple objects. This will internally translate in S3 in a multiple number of API calls.

Veeam -> 1 API call, to delete 500 objects = S3 -> 500 DELETE API calls Veeam -> 2 API call, each to delete 250 objects = S3 -> 500 DELETE API calls

It does not matter how many API call Veaam does send to S3; the number of objects to delete and the frequency of these operations are the important numbers.

From an S3 point of view, I suggest you, if possible, to distribute these DELETE operations on as many prefixes/folder as possible. This way, the bucket will be able to handle overall higher request rates and avoid 503 Slow Down errors.
With the logs you sent me on November 30th, I saw these operations were run against a single prefix: "xxxx-veeam-backup-fra/Veeam/Archive/yyyy/3d908fce-fa6d-4424-842d-cc604f218405/".

For example, you may be able to run a single API to delete multiple objects against these prefixes:
xxx-veeam-backup-fra/Veeam/Archive/yyyy/3d908fce-fa6d-4424-842d-cc604f218405/
xxx-veeam-backup-fra/Veeam/Archive/yyyy/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxA/
xxx-veeam-backup-fra/Veeam/Archive/yyyy/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxB/
xxx-veeam-backup-fra/Veeam/Archive/yyyy/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxC/
xxx-veeam-backup-fra/Veeam/Archive/yyyy/xxxxxxxxxxxxxxxxxxxxxxxxxxxxx1/
xxx-veeam-backup-fra/Veeam/Archive/yyyy/xxxxxxxxxxxxxxxxxxxxxxxxxxxxx2/
xxx-veeam-backup-fra/Veeam/Archive/yyyy/xxxxxxxxxxxxxxxxxxxxxxxxxxxxx3/

And so on. S3 will create internal partition for each prefix you run API against, and will be able to handle the high request rate.
I suggest you to contact Veeam support in order to verify if this approach is possible.

Dec 04, 2020 8:58 am

Yes, I see. But did you worked on reducing the delete load by reducing the S3ConcurrentTaskLimit or Repository Task slots, so that they are below 2000? This will reduce as well the pressure for the delete operations.

pirx · Post by **pirx** » Dec 04, 2020 10:15 am this post

We changed the S3ConcurrentTaskLimit before, but as this has an impact on offloads as well, we changed it back to default. We varied the Concurrent Task Limits for SOBR back and forth, but if we lower them too much, we have a lot of jobs waiting on repository resources. Currently it's hard to find the right tuning parameters for us. We are simple having too many different problems at once (MSSQL deadlocks during offload sessions, very slow syn operations that take days of even weeks). S3 deletes are also one, but not the one with highest priority. It was mainly a problem during cleanup of old backups in Disk (Imported) with a lot of delete tasks.

Dec 04, 2020 10:42 am

The point is that above a specific value the offloading is as well affected or will not run faster. Parallelization works only to the point that receiving end can handle it.
Can you share what you have selected S3ConcurrentTaskLimit and summarize as well the task slots selected in your extends?

pirx · Post by **pirx** » Dec 04, 2020 10:55 am this post

Code: Select all

S3ConcurrentTaskLimit    : 64
S3MultiObjectDeleteLimit : 100

Note: some extents are sealed as we are moving data to new ones.

Code: Select all

Name            ConcurrentJobsNow ConcurrentJobsMax
----            ----------------- -----------------
DE-WBE-B01                      0                 4
DE-WBE-B01-E01                  0                45
DE-WBE-B01-E02                  0                45
DE-WBE-B01-E99                  0                45
DE-WBE-C01                      0                 4
DE-WBE-C01-E00                 45                45
DE-WBE-C01-E006                 0                45
DE-WBE-C01-E007                 0                45
DE-WBE-C01-E008                 6                45
DE-WBE-C01-E01                 33                45
DE-WBE-C02                      0                 4
DE-WBE-C02-E01                  0                45
DE-WBE-C02-E02                  0                45
DE-WOP-B01                      0                 4
DE-WOP-B01-E07                  0                45
DE-WOP-B01-E08                  0                45
DE-WOP-B01-E99                  0                45
DE-WOP-C01                      0                 4
DE-WOP-C01-E00                 45                45
DE-WOP-C01-E002                 0                35
DE-WOP-C01-E003                 0                45
DE-WOP-C01-E004                 0                45
DE-WOP-C01-E005                 0                45
DE-WOP-C01-E006                 0                45
DE-WOP-C01-E007                 0                45
DE-WOP-C01-E01                  1                45
DE-WOP-C02                      0                 4
DE-WOP-C02-E01                  0                45
DE-WOP-C02-E02                  0                45

Dec 04, 2020 11:25 am

Are these all extends of the same Scale out Repository ?

pirx · Post by **pirx** » Dec 04, 2020 12:12 pm this post

It's like this, all extents with 3 digits at the end are old ones and sealed or even in maintenance mode.

Code: Select all

DE-WBE-B01                      0                 4
DE-WBE-B01-E01                  0                45
DE-WBE-B01-E02                  0                45
DE-WBE-B01-E99                  0                45

DE-WBE-C01                      0                 4
DE-WBE-C01-E00                 45                45
DE-WBE-C01-E006                 0                45
DE-WBE-C01-E007                 0                45
DE-WBE-C01-E008                 6                45
DE-WBE-C01-E01                 33                45

DE-WBE-C02                      0                 4
DE-WBE-C02-E01                  0                45
DE-WBE-C02-E02                  0                45

DE-WOP-B01                      0                 4
DE-WOP-B01-E07                  0                45
DE-WOP-B01-E08                  0                45
DE-WOP-B01-E99                  0                45

DE-WOP-C01                      0                 4
DE-WOP-C01-E00                 45                45
DE-WOP-C01-E002                 0                35
DE-WOP-C01-E003                 0                45
DE-WOP-C01-E004                 0                45
DE-WOP-C01-E005                 0                45
DE-WOP-C01-E006                 0                45
DE-WOP-C01-E007                 0                45
DE-WOP-C01-E01                  1                45

DE-WOP-C02                      0                 4
DE-WOP-C02-E01                  0                45
DE-WOP-C02-E02                  0                45

Dec 04, 2020 1:02 pm

pirx wrote: ↑Dec 01, 2020 6:50 amWhat I wanted to bring up is comment from AWS support: "gradually increase the request rate, and retry failed requests using an exponential backoff algorithm". Is this something Veeam is aware of and which will implemented at some point?

Just to close the loop on this: I've verified with developers that v10 already retries 503 errors with an exponential backoff algorithm. Thanks!

Dec 07, 2020 8:30 am

Regarding the settings:
S3MultiObjectDeleteLimit => You set it to 100 which means that 10x more S3 MultiDelete operations are sent to the storage. Suggest to delete the reg key or set the original 1000.

If I understand your above setup correctly, then you have 2 SOBRs that offload data to 2 buckets?
On the B one... 135 task slots x 64 operations in parallel = 8640 S3 operations happening in parallel at all time which looks a bit high. My guess is that you can reduce the S3ConcurrentTaskLimit without affecting the upload speed as at some point the parallel load will not lead into additional speed. We would run in less 503 errors and the storage could be used more efficiently without the delays.

pirx · Post by **pirx** » Dec 07, 2020 10:09 am this post

Sorry, this was a c&p error.

S3MultiObjectDeleteLimit : 1000 (default)

We have more SOBR, all of them are using a single bucket with multiple folders.

Code: Select all

Get-VBRBackupRepository -ScaleOut | where {$_.CapacityExtent -ne $null} | Select-Object Name,CapacityExtent

Name            CapacityExtent
----            --------------
DE-WOP-B01      DE-WOP-B01-S3
DE-WBE-C01      DE-WBE-C01-S3
DE-WOP-C01      DE-WOP-C01-S3
DE-WRB-B01      DE-WRB-B01-S3
DE-WBE-B01      DE-WBE-B01-S3

Dec 07, 2020 10:15 am

This is not ideal, as in case of metadata, they are handeled by the engine of the same single bucket then. For the bucket it is like one workload and all operations need to be counted together as one workload. While you have more than 700TB across these folders.

Post by **Gostev** » Dec 07, 2020 12:28 pm this post

So basically you will really benefit from using a separate bucket for each SOBR.

pirx · Post by **pirx** » Dec 08, 2020 2:35 pm this post

I guess so. We are thinking about this, but as we use immutability we can not just copy the date to new buckets and delete it in the old one. We are thinking about keeping the old bucket and starting with a new empty one for each SOBR. If I replace the capacity tier S3 bucket & folder in SOBR configuration with an empty one, will the data of the old bucket & folder then be in Object Storage (Imported)? It's probably easier to start with new empty ones instead of fighting with immutability. But we need to be able to access old data until retention time is over.

Post by **veremin** » Dec 08, 2020 4:20 pm this post

It won't be added to object storage (imported) node automatically, but you will be able to import it manually. Thanks!

pirx · Post by **pirx** » Dec 08, 2020 5:21 pm this post

So it's basically this: https://helpcenter.veeam.com/docs/backu ... ml?ver=100 Thx!

pirx · Post by **pirx** » Mar 17, 2021 11:23 am this post

veremin wrote: ↑Dec 08, 2020 4:20 pm It won't be added to object storage (imported) node automatically, but you will be able to import it manually. Thanks!

# 04618351

I hit a bug in Veeam and there exist old data on a capacity extent. Due to the recommendation here regarding API request and this bug I replaced the old capacity extent/folder with a new one. But import of exiting backups is not possible which is bad as we need this data for restore.

Code: Select all

[16.03.2021 17:24:46] <15> Warning  [ArchiveRepositoriesProcessor] Configuration DB already contains backup(s) with the same ArchiveId. Backup import will be skipped. Backups:

As I got feedback here, that I can import backups after I replaced the capacity tier, I want to know if this is possible at all with a capacity extent that has old SOBR backups? It's a bummer for us, as we will simply not be able to copy several hundreds of TB to new buckets. This probably affects multiple SOBR c.extents that we wanted to replace in the next days, but we need them for restores as a lot of data was moved there.

Mar 17, 2021 8:49 pm

I want to know if this is possible at all with a capacity extent that has old SOBR backups?

It's possible, indeed. You should be able to import backups from previous Capacity extent without any issues. So, kindly, continue working with the support team. Thanks!

pirx · Post by **pirx** » Mar 18, 2021 9:57 am this post

Case is still open, but after an 90min session with support the feedback is still that it's not possible to import the backups from old capacity extent. Only options currently: deploy new VBR server and import backups there. This is not feasible at all. Or copy old data to new bucket (400-500TB...). As we know that the old buckets had old obsolete data that was not removed by Veeam, I don't see the point doing that.

Which means if an capacity extent has to be replaced (for whatever reason), you probably have no access to old backups.

Post by **veremin** » Mar 18, 2021 4:20 pm this post

QA team is actively following the investigation, they have provided the recommendations to support engineer. So, let's wait for the final resolution, as the described behaviour does not look expected. I will keep you updated. Thanks!

bhnelson · Post by **bhnelson** » Apr 09, 2021 7:14 pm this post

Gostev wrote: ↑Dec 04, 2020 1:02 pm Just to close the loop on this: I've verified with developers that v10 already retries 503 errors with an exponential backoff algorithm. Thanks!

If so, it doesn't appear to be properly implemented. Using Veeam 10.0.1.4854 P20201202. Today I attempted to delete 1 job's data (2 clients, ~1TB) from a single SOBR (AWS S3) and met failure. Over the course of about 50 minutes I see WARN entries like below in the log ~6900 (!) times, then veeam completely failed the task siting the 503 error.

Code: Select all

[09.04.2021 10:19:29.952] < 24224> aws| WARN|HTTP request failed, retry in [1] seconds, attempt number [1], total retry timeout left: [1800] seconds
[09.04.2021 10:19:29.952] < 24224> aws| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.
[09.04.2021 10:19:29.952] < 24224> aws| >>  |Code: SlowDown', error code: 503
[09.04.2021 10:19:29.952] < 24224> aws| >>  |--tr:Request ID: 5J7BQNP3QJYYECJD
...
[09.04.2021 10:39:19.018] < 23996> aws| WARN|HTTP request failed, retry in [1] seconds, attempt number [1], total retry timeout left: [1800] seconds
[09.04.2021 10:39:19.018] < 23996> aws| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.
[09.04.2021 10:39:19.018] < 23996> aws| >>  |Code: SlowDown', error code: 503
[09.04.2021 10:39:19.018] < 23996> aws| >>  |--tr:Request ID: 2GWG96QVPD9GSPQK
...
[09.04.2021 10:59:01.994] < 28292> aws| WARN|HTTP request failed, retry in [1] seconds, attempt number [1], total retry timeout left: [1800] seconds
[09.04.2021 10:59:01.994] < 28292> aws| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.
[09.04.2021 10:59:01.994] < 28292> aws| >>  |Code: SlowDown', error code: 503
[09.04.2021 10:59:01.994] < 28292> aws| >>  |--tr:Request ID: VNA01TPSFN40R834

Code: Select all

[09.04.2021 11:06:42.609] < 25176> aws| WARN|HTTP request failed, no retry attempts left (total timeout exceeded)
[09.04.2021 11:06:42.609] < 25176> aws| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.
[09.04.2021 11:06:42.609] < 25176> aws| >>  |Code: SlowDown', error code: 503
...
[09.04.2021 11:08:22.754] < 14476> aws| WARN|S3MultiObjectDeleteParallel sub pplx::task failed with exception
[09.04.2021 11:08:22.754] < 14476> aws| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.
[09.04.2021 11:08:22.754] < 14476> aws| >>  |Code: SlowDown', error code: 503
...
[09.04.2021 11:08:51.131] < 14476> cli| ERR |Failed to process method 'ArchRepo.DeleteBackupChain'
[09.04.2021 11:08:51.131] < 14476> cli| >>  |Amazon REST error: 'S3 error: Please reduce your request rate.

If there is a backoff at all, I'd bet it's being done at some sub-level (request, thread, process, etc), with each new one resetting to full speed, and thus not affecting the overall request rate at all.

Post by **veremin** » Apr 12, 2021 4:04 pm this post

Can you please collect debug logs, open a support ticket and attach the logs to it? This way we can ask R&D team to check the situation and see whether everything works correctly. Thanks!

bhnelson · Post by **bhnelson** » Apr 13, 2021 9:41 pm this post

Hi veremin
I've opened case 02387844 with logs.
Please note that re-running the delete did complete successfully the second time, though still with a massive number of 503 WARNings in the log.

jbwoodoo · Post by **jbwoodoo** » Dec 08, 2022 10:14 am this post

#04119837
Hi

Kindly could you indicate where to find or create de RegKeys for:

S3ConcurrentTaskLimit
S3MultiObjectDeleteLimit

Kind Regards

Boris

Dec 08, 2022 10:20 am

Hi Jorg,

there should not be any need for changing the default values.
You can upgrade to v11 and use the Maximum Concurrent Task limit to lower the load on your storage.
As well make sure that you do NOT use unlimited tasks on the Repository configurations.

R&D Forums

High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Re: High S3 API request rate during delete operations (exponential backoff algorithm)

Who is online