s3.PutObjectRetention millions request per hour.

Post by **edh** » Jan 07, 2025 12:11 pm this post

Hi,

As we moved to S3 Storage, we encountered another issue "s3.PutObjectRetention" in millions per hours.

Looks like is a MetaData Lock-In Refresh for inmutability but this small request near 254B saturare the full configuration.

Is there any way to delay or throtthle? It can cause that backups cant be uploaded during this "nightmare hours"...

Our S3 is a Minio Distributed that we tested the PUT with normal objects up 10-13 Gbps...

Regards
Manuel

Post by **Gostev** » Jan 07, 2025 12:38 pm this post

Hi, I recommend you check with MinIO on this.

Only actual storage can know if it's overloaded, and it's trivial for storage to delay/throttle processing of "offending" requests to address this temporary overload. Alternatively, object storage should respond with 503 Slow Down HTTP error code, which will make Veeam engage exponential backoff algorithm for retries. All of this is industry-standard S3 stuff and most object storage use one of these approaches or a combination of both.

Just to be clear, what do you mean by "normal objects"?

Post by **edh** » Jan 07, 2025 12:44 pm this post

Thanks Gostev..

For us "normal size" 1MB to 4MB objects

, but this Put request are for 254bytes. Mmm if Veeam handle the retrys... maybe i can implement something arround Nginx for force a throthle...

Post by **Gostev** » Jan 07, 2025 12:48 pm this post

Got it. This PUT request does not actually put any data, it instruct object storage to extend immutability on the existing storage, thus it only needs to pass metadata.
I've just asked devs for the full list of HTTP error codes that trigger exponential backoff algorithm, may be it will help with your workaround idea.

Post by **Mildur** » Jan 07, 2025 12:54 pm this post

Hello

I moved this topic to our object storage subforum.

Best,
Fabian

Jan 07, 2025 1:00 pm

The following HTTP error codes trigger exponential backoff algorithm in Veeam:

Code: Select all

Error 400 with one of the following statuses:
Request Timeout, ExpiredToken, TokenRefreshRequired

Error 402 /* Service Quotas Exceeded */

Error 403 with one of the following statuses:
ConnectionLimitExceeded, Throttle, RequestTimeTooSkewed

Error 408 /* Request Timeout */

Error 409 with one of the following statuses:
OperationAborted

Error 429 with one of the following statuses:
SlowDown, CloudKmsQuotaExceeded

Error 500 /* Internal Server Error */

Error 502 /* Bad Gateway */

Error 503 /* Service Unavailable, Slow Down */

Error 504 /* Gateway Time-out */

Jan 08, 2025 3:57 am

Hi,

Just for update, after several hours, we're going to rolling back to Block Storage Inmutable. There's no elegant solution to accomplish this without move to QLC SSD drives and expend several 200K € in new storage.

Stats of Distributed Cluster +1PB full saturated by small request. This same cluster without buckets inmutable can handle 10Gbps , tested with Veeam VBO 365.

Code: Select all

Duration: 11m20s ▱▱▱
RX Rate:↑ 5.2 GiB/m
TX Rate:↓ 107 MiB/m
RPM    :  5026.2
-------------
Call                      Count          RPM     Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors
s3.PutObject              25037 (44.0%)  2209.5  33.152s   2.866s    4m36s     33.152s   4m36s     ↑2.4M ↓1B    ↑5.2G ↓3.2K  0
s3.PutObjectRetention     23733 (41.7%)  2094.4  15.131s   3.026s    38.324s   15.131s   38.324s   ↑254B        ↑520K        0
s3.HeadObject             3012 (5.3%)    265.8   111.1ms   952µs     24.522s   111ms     24.522s   ↑121B        ↑31K         0
s3.GetObject              2011 (3.5%)    177.5   725.5ms   543µs     19.232s   691.2ms   19.232s   ↑121B ↓606K  ↑21K ↓105M   0
s3.DeleteObject           1397 (2.5%)    123.3   1.195s    962µs     30.508s   1.195s    30.508s   ↑121B        ↑15K         0
s3.ListObjectVersions     1022 (1.8%)    90.2    4.18s     1.9ms     35.586s   4.18s     35.586s   ↑121B ↓8.4K  ↑11K ↓753K   0
s3.ListObjectsV2          411 (0.7%)     36.3    9.76s     1.9ms     5m5s      9.76s     5m5s      ↑121B ↓38K   ↑4.3K ↓1.3M  0
s3.GetBucketLocation      197 (0.3%)     17.4    799µs     589µs     3.8ms     780µs     3.8ms     ↑121B ↓128B  ↑2.1K ↓2.2K  0
s3.DeleteMultipleObjects  87 (0.2%)      7.7     3.342s    4.1ms     30.318s   3.342s    30.318s   ↑1.6K ↓116B  ↑13K ↓890B   0
s3.HeadBucket             22 (0.0%)      1.9     696µs     607µs     1.2ms     656µs     1.1ms     ↑121B        ↑234B        0
s3.ListObjectsV1          18 (0.0%)      1.6     3.051s    1.202s    10.067s   3.051s    10.067s   ↑121B ↓315B  ↑192B ↓500B  0
s3.ListBuckets            8 (0.0%)       0.7     1.7ms     1.3ms     3ms       1.6ms     3ms       ↑159B ↓3.7K  ↑112B ↓2.6K  0

We tryed a workarround with Nginx map and limit_req zone configs returning a 429 error to force throthle at Veeam Agents / VBRs but it dont works.

From our perspective the main problem comes when a task finish and then start to apply the retencion object logic to update all metadata of all object of the backup.
This small request, sub 1KB saturate IO at physical layer and dont allow other customers to proces their s3.PutObject to the repository creating a global stock over the repository.
Just 1 TB at 4MB is about 250 000 request, but as we cant setup/force customer block size, normally is 1MB and about 1Million request per TB stored in S3 Inmutable and in proposed scale 1PB will be arround 1 000 000 000 requests overloading any IO available, just a simple math, 1PB will take about 55hours to be fully metadata updated.

Of course this not happens with NVME options, but NVME still is not €/GB afforable at least for our market.

I checked in VBO how it works and looks like its implemented at "Repo Level" to apply the retention at schedule time.

Why tryed this setup to allow our customer a better scalability solution and allow them "Instant Options" for restore but this setups dont works at least in our small scale.

How it can be improved? I dont know but as Veeam with SOSApi create a system and capacity xml files, maybe instead use S3.PutObjectRetention can create a hidden xml file that update in a single put the full object locking.
Or maybe allow to SORBs admins to setup when should be scheduled the PutObjectRetention, or as Veeam got a regkey to limit the number of object deleted per request, create a key to limit in the Proxy the number of S3ObjectRetention request per second.

Im pretty sure im not the only sysadmin that got this caveheads with S3 storages... now i know why Dell removed ECS....

Regards,
Manuel

Post by **Gostev** » Jan 08, 2025 1:01 pm this post

This is not about S3 storage in general, just the one you're testing (MinIO). Every object storage has different architecture as vendors prioritize different use cases.

Some vendors put big focus on performance and scale with large number of objects and we even use a couple such vendors in our performance testing labs.

Ironically, Dell ECS was one of the best in the early days actually. I didn't know it was discontinued.

Post by **edh** » Jan 08, 2025 1:17 pm this post

Yeah , We tested Dell ECS but they use postgress SQL for store metada, not bad but scaleup billions entrys dont work

. The main problem with Minio is that they dont use OS buffers , just DirectIO and dont support LVM cache or other cache layers .... And their answer is always, fork the project and code anything you need

Post by **Gostev** » Jan 08, 2025 1:20 pm this post

LOL!! This made my day.

Post by **karsten123** » Jan 08, 2025 1:27 pm this post

what about using RAID 0 per disk to use controller cache?

Jan 08, 2025 5:51 pm

They got a warning in their Documentation about not use intermediate cache o raid controllers, typical warnings like CEPH done... thats why we didnt try with RAID 0 but our servers gots 8GBs cache that maybe is not too much but they will absorb that 254Bytes request pretty easy... not today... but maybe other day will test with some dell demo servers. Just now we finish our new deployment for solve this issue and we're getting 8GBps incoming traffic and 0.3 IO Wait... Veeam Transport + Linux + XFS + RAID 60 works like a charm.

Jan 08, 2025 6:43 pm

edh wrote: ↑Jan 08, 2025 5:51 pmVeeam Transport + Linux + XFS + RAID 60 works like a charm.

https://www.youtube.com/watch?v=Fs72G3fIlog

R&D Forums

s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Re: s3.PutObjectRetention millions request per hour.

Who is online