Long delays during SOBR Offload tasks

Post by **clintbergman** » Nov 10, 2020 8:32 pm this post

Case #04414173

We have been working with support on both the Veeam and our object storage provider for a while now, and seem to be getting closer to solving this issue. There have been plenty of roadblocks/smaller issues that have been resolved in our work to get to the bottom of the issue. It's been escalated to the next tier in Veeam support at this point, but I wanted to post some of the symptoms and get any feedback from the community as well.

Brief environment summary:
~70VMs / 10TB of backup data
ReFS repo that otherwise performs admirably w/ synthetic fulls etc
500Mbps connection throttled to 200Mbps using Veeam's network throttling to the internet.
Immutability is enabled on the S3 bucket in use

Our SOBR offload tasks have been not successfully completing in an acceptable time frame. There are long delays in processing them when no data is being offloaded.
When the data offload does send data it runs well and is able to saturate the 200Mbps cap imposed. The majority of the time spent shows ~5-30 minute delays in the log files in a few different areas:

#1 "Mutex owned by someone else. Waiting for release" - 18 minute delay snippet below
[09.11.2020 12:20:21] <38> Info [Mutex] Created mutex [Global\VBRRepositoryArchiveKey[4cbc270c-2446-4e61-abd5-f463a08ca285]]
[09.11.2020 12:20:21] <38> Info [Mutex] Mutex [Global\VBRRepositoryArchiveKey[4cbc270c-2446-4e61-abd5-f463a08ca285]] is owned by someone else. Waiting for release
[09.11.2020 12:38:10] <38> Info [Mutex] Acquired mutex [Global\VBRRepositoryArchiveKey[4cbc270c-2446-4e61-abd5-f463a08ca285]]

#2 Entries related to RepositoryKey, a ~3 minute example here:
[09.11.2020 12:38:10] <38> Info [AP] (89aa) command: 'Invoke: ArchRepo.GetRepositoryKey { (EString) Spec = <S3ConnSpec><BucketAddr BucketName="veeam-srmc-sobr-cap-lock01" AmazonS3RestClientId="d242d27b-4c2e-44ee-b618-0592185ec354" /><Repository FolderName="SOBR Zadara 01" DisplayName="SOBR_Zadara01_CapTier" /><Options UseIAStorageClassForBlockObjects="False" UseOneZoneIAStorageClassForBlockObjects="False" ObjectLockEnabledOnBucket="True" /></S3ConnSpec>; }'
[09.11.2020 12:41:32] <94> Info [AP] (89aa) output: <VCPCommandResult result="true" exception="" />
[09.11.2020 12:41:32] <99> Info [AP] (89aa) output: <VCPCommandArgs><Item key="KeySet" type="EBinaryBlob" value="...SNIP..." /><Item key="KeySetId" type="EString" value="fe5173ae63d10488dfc8cfa68f53d12a" /></VCPCommandArgs>
[09.11.2020 12:41:32] <99> Info [AP] (89aa) output: >

According to the S3 logs the requests sent to object storage during these times are being responded to in terms of milliseconds for the most part, which seems to indicate that Veeam is hung up somewhere along the way. Anyone experience issues like this in your offload work, or have insights to share?

Many thanks in advance, and I'll be sure to post the eventual solution.

Post by **HannesK** » Nov 11, 2020 6:28 am this post

Hello,

and our object storage provider

the key information is missing here... I have seen undersized and simply bad performing object storage systems.

What does the support of the storage vendor say?

I mean, 10TB does not sound really much. But I don't know what the storage is doing additionally to your backups.

Best regards,
Hannes

Post by **clintbergman** » Nov 11, 2020 3:30 pm this post

Offsite object storage is being provided by Zadara. Had them on the call with Veeam support. We have worked through a few performance issues with them up to this point but as we were on the line yesterday (and according to their logs), things were performing well on their end. I requested a copy of the logs from their end, but have not yet received them.

That said, something seems off. I have a job that began "Invoke: ArchRepo.ArchiveCleanup" over 13 hours ago, and posts a "[11.11.2020 08:05:06] <19> Info [AP] Waiting for completion. Id: [0x352c296], Time: 13:00:00.1916305" line every 30 minutes. According to the perf mon that I have access to for our objstor it's only processing 8 delete operations per second, and that doesn't seem right. I updated our ticket with them on that last night.

Still waiting to hear back from our Veeam escalation, surely they'll have better insight than I do on next steps.

Post by **clintbergman** » Nov 12, 2020 10:23 pm this post

Over 48h and no response from Veeam on our escalated case.

Latest from object storage support is that it looks like multi-delete may be overloading their system:

"You are correct, the 503 response code from our Object Storage does have the same meaning as per AWS, 'slow down please' (introduced in our version 20.01-308). The Engineering issue where this feature was introduced is, curiously enough, in handling the multi-deletes from Veeam (default 1000). Veeam has the behaviour of retrying the same set of deletions if a timeout or error is received, and failing the job. Is there any indication on the Veeam side that the 503 responses are being acknowledged and the requests being backed up?"

From an older thread (post371339.html), it looks like there's a "S3MultiObjectDeleteLimit" registry entry that may still be adjustable. Would adjusting that down (to say 500 from the default of 1000) require restarting any services, or the jobs in question? And how could we tell that it had taken effect?

Post by **Gostev** » Nov 13, 2020 12:31 am this post

Normally, the majority of registry values require that the Veeam Backup service is restarted. There are some exceptions, but since I'm not aware about this specific value, it is safer to assume that the restart is required.

You can only tell that it had taken effect from debug logs, but I don't have the information which one.

Normally, your support engineer should help you to both set it, and validate that it had taken effect. But they don't usually start playing with registry values until the issue is confirmed to be related to what the registry key confirms.

So far, from your explanation it does very much look like the object storage can't keep up with S3 API calls, or actual delete operations. We actually recommend performing delete scalability test before committing to any particular vendor, because one particular NOSQL database engine used in some object storage system can't keep up well with deletes due to an architectural issue with tombstones. This issue is specifically highlighted in the sticky compatibility topic in this forum.

Post by **Gostev** » Nov 13, 2020 12:53 am this post

P.S. I must note that it is not necessarily a good idea to use this registry value, as reducing it may actually make the situation much worse. Thing is, reducing the number of objects deleted per single API call means backup server will have to issue more S3 API calls to delete the same number of objects. However, from what you posted above, it sounds like the very issue this object storage is having is inability to keep up with the number of S3 API calls.

Post by **clintbergman** » Nov 13, 2020 1:25 am this post

Thanks for the info, Gostev. I'll hold off on fiddling with any registry values for now. The issue has been sent to engineering at Zadara, and hopefully we'll hear from our escalation point at Veeam tomorrow as well.

How would someone go about running a delete scalability test? Is that part of the 'Object Ready + Immutability' certification? Along with Hitachi and Cloudian, Zadara was one of the first to be listed (https://www.veeam.com/kb3118) and our pilots went well. But things do change from time to time, I suppose.

Post by **Gostev** » Nov 13, 2020 1:40 am this post

Veeam Ready validation tests for S3-compatible storage assume on-prem usage by a single customer. They do not involve testing multiple tenants using the same object storage device, which obviously multiplies the load proportionally. Normally, it is up to the object storage provider to determine overall scalability and multi-tenant capacity of their specific object storage setup before deciding how many tenants to subscribe to the given S3 endpoint.

Post by **clintbergman** » Nov 13, 2020 2:43 am this post

Makes sense, thanks for the clarification.

I tried searching the forums, and internet in general, for "delete scalability testing" and haven't found anything concrete. If Veeam recommends performing such testing, are there methods/tools that are recommended by Veeam to run such validation? Something like cosbench? Or does that fall too deeply into the "it depends" chasm to be specific about? Having a few tools & methods may prove useful in testing/validating solutions to our issues here.

Appreciate everyone's time on this, thank you.

Nov 13, 2020 11:43 am

the idea is to just throw 10, 20, 50TB backup on the storage and then press the "delete from disk" in the backups section.

Post by **Gostev** » Nov 13, 2020 11:51 am this post

Veeam Ready validation testing includes creating a few TB of backups on object storage, and then deleting it.

For multi-tenant scenario, the same test should be multiplied according to the number of tenants. For example, 10 jobs each creating a few TB of backups on object storage, and then deleting all backups at once to simulate worst-case scenario. This will test both multiple concurrent uploads, and also multiple concurrent deletes.

Nov 13, 2020 12:59 pm

clintbergman wrote: ↑Nov 13, 2020 2:43 am I tried searching the forums, and internet in general, for "delete scalability testing" and haven't found anything concrete. If Veeam recommends performing such testing, are there methods/tools that are recommended by Veeam to run such validation? Something like cosbench? Or does that fall too deeply into the "it depends" chasm to be specific about? Having a few tools & methods may prove useful in testing/validating solutions to our issues here.

Personally, I gave up on Cosbench. It's way too complicated.

An easier way to determine delete performance is to use the warp utility by MinIO. Here is the command that we have used to check if different vendors can scale to the number of HTTP DELETE operations that are required:

Code: Select all

docker run --rm --name warp -it minio/warp delete --host <internal IP - preferably without TLS> --access-key <akey> --secret-key <skey> --region default --objects 128000 --obj.size 128k --concurrent 4

We did leave the concurrency low on purpose, but you may increase it as you see fit.

Nov 13, 2020 7:06 pm

Thank you for the feedback and suggestions, everyone. Support sent me an "internal tool" this morning to run the delete tests with and we're working on the process and the output.

Post by **clintbergman** » Nov 16, 2020 10:09 pm this post

I haven't been able to run the warp benchmarks yet, but I have spent some time running the deletion test tool Veeam support provided. That has somewhat left me wondering what is/how can one determine (even roughly) "the number of HTTP DELETE operations that are required" (for our use case anyway). I'm really trying to understand "acceptable object storage performance (for our use case anyway)" so I have an idea how hard I need to be leaning on our capacity tier provider to make changes on their end.

Obviously the answer is ‘it depends’ and the timing/performance level is going to be somewhat specific to the use case of copy-mode capacity tiering with immutability, data change rates in the backup chains, how long of an offload window one has etc. But I mean if “generally probably OK” is anywhere from 60-300 seconds per 100,000 deletes then OK, at least that’s something to look/aim for.

Surely 90 minutes to delete 100,000 files (regardless of the reason why) falls outside the bounds of acceptable performance, no?
Especially when I can run that same test against a bone-stock minio container backed by a couple SATA drives and delete those same 100,000 files in 75 seconds...right? Or am I way, way off base here?

Nov 17, 2020 6:21 pm

So a few updates here:
1 - I've been informed by Veeam's support that the "general performance concerns and optimizations, appears to be a known issue that we are researching and addressing." So, pretty much on standby as R&D looks into it.

2 - I’ve been running some different scenarios using https://github.com/minio/warp.

Using specs received from our object store provider (64 concurrent connections and 256K object size) I ran through 100,000 object delete scenario and got the following results, abbreviated to just the averages for brevity:
Here’s the command I ran:

Code: Select all

sudo docker run --rm --name warp -it minio/warp delete --host <host> --tls --access-key <key> --secret-key <key> --region us-east-1 --objects 100000 --obj.size 256k --concurrent 64

Operation: PUT
* Average: 31.57 MiB/s, 129.32 obj/s
-------------------
Operation: DELETE
* Average: 659.60 obj/s

The whole run took 32 minutes, mostly capped by my 300Mbps internet upload speed going through the PUTs. The delete operation took only 4 minutes of that time, and exceeded the 200 deletes/sec our provider told us to expect "on the low end". For comparison’s sake, the run I did with 4 concurrent connections @ 128K got the following results:

Code: Select all

sudo docker run --rm --name warp -it minio/warp delete --host <host> --tls --access-key <key> --secret-key <key> --region us-east-1 --objects 100000 --obj.size 128k --concurrent 4

Operation: PUT
* Average: 1.07 MiB/s, 8.79 obj/s
-------------------
Operation: DELETE
* Average: 39.48 obj/s

I suppose that leaves me wondering why concurrency makes such a vast difference. I'm definitely still learning quite a bit when it comes to object storage. @poulpreben You mentioned that you left concurrency low intentionally - why is that? And what output from warp with concurrency set to 4 have you considered sufficient for your use cases?

R&D Forums

Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Re: Long delays during SOBR Offload tasks

Who is online