New Veeam 12 deployment with SOBR offloads to Wasabi timing out

velo · Aug 14, 2023 5:27 pm

We recently moved to a brand new deployment using Veeam 12 with onsite Object storage backed by an Object First appliance and Wasabi for offsite.

Previous Environment

This environment is still around, jobs are just disabled. We never experienced these types of timeout issues. All backup related servers, repos and proxy sit inside of our backup VLAN (same as our new environment)

Virtual VBR server running Veeam V11

Physical Windows based proxy/repo which handled the direct storage access for our backup and replication jobs

Backup copy jobs were configured to move backups offsite.

Offsite repo (windows) is part of a SOBR which used AWS as its capacity extent

SOBR offloads happened after 14 days

Internet service is a 1Gbps symmetrical connection. In Veeam we throttle to 300Mbps during business hours and 600Mbps outside of business hours.

New Environment

- VBR Server is a new physical server (1x 16c CPU 64G of RAM) which also acts as our proxy and has dual 10G connectivity into iSCSI (for direct storage access backups) and into our backup VLAN
- Object first appliance with 64TB raw storage which has dual 10G SFP+ connectivity into our backup VLAN.
- Primary repo is a SOBR with the Object First appliance as our performance extent and Wasabi as capacity. The SOBR does an immediate copy as well as moves workloads after 21 days.
- We are only using backup jobs now since we do an immediate copy to the capacity tier and we move GFS points off after 21 days.
- Internet service is a 1Gbps symmetrical connection. In Veeam we throttle to 300Mbps during business hours and 600Mbps outside of business hours.
During testing everything ran fine for the immediate copy. We tested backing up and copying our largest VM roughly 4TB which the VBK file was a little over 2TB. We also tested a small subset of VM's (4-5) which had VBKs ranging from about 20-40G each.

Our production workload has 8 multi VM jobs, ranging from 15VMs up to 30ish. The total raw size of these ranges from just under 1TB to 2.5 TB. It also has 2 single VM jobs (due to special retention requirements) that are on the large side (a little over 4TB each). The behavior we are seeing is that jobs VM offloads will timeout after hitting 99% offload with the error below.
Code: Select all
```
The HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002

Exception from server: HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002

Unable to retrieve next block transmission command. Number of already processed blocks: [1397].
```
We were also seeing offloads complete successfully but would have this message in there which meant they never actually offloaded.
Code: Select all
```
Resource not ready: object storage repository SOBR Capacity Tier
```
We solved that once by applying the setting in this forum post to our capacity extent

object-storage-f52/since-upgrading-to-v ... 85104.html

We do have a Veeam support case open (06234303) and there's been a good amount of troubleshooting that's been done thus far. Saturday night we added the below reg keys to our VBR/Proxy server and manually re-trigged the offload.

S3RequestTimeoutSec Value (decimal): 900

S3MultiObjectDeleteLimit Value (decimal): 200

S3RequestRetryTotalTimeoutSec Value (decimal): 9000

There was a bit of hope because the first 5 vm offloads actually completed this time but it's started failing since with that same timeout error above. The reg values seem to have only delayed getting the error above. We've also reached out to Wasabi to make sure things are good on their side and they let us know that over the last 7 days, they've seen 1 put error of the 9 million put requests we've made. They asked to to revert back to veeam support for further troubleshooting.

I am wondering if anyone else has experienced anything like this before and what you did to resolve it. Currently the case is in the hands of the advanced support team

velo · Post by **velo** » Aug 15, 2023 12:48 am this post

Is this going to be approved so that I can get some help on this?

Aug 15, 2023 6:50 am

Which Wasabi region are you using?
Do you use latest V12 Updates? 12.0.0.1420 P20230718 https://www.veeam.com/kb4420

velo · Post by **velo** » Aug 15, 2023 2:39 pm this post

We are in us-west-1 and we are running the latest Veeam 12 build that came out last month

matt_778 · Aug 21, 2023 9:18 am

Well written post. FYI I had issues with Wasabi with veeam12 and using backup copy jobs (no SOBR). This was resolved by changing the repo concurrent hubs to 3. Below is the text from the wasabi ticket:

we can confirm is that all of our servers are hard limited to 250 connections per IP per minute, which is a limitation recently increased from 100 connections.
This is actually not an issue as Veeam agent is suggesting in multiple occasions, but a limitation which most of the storage provides are implementing in order to provide equal service across all the customers.

What is strange is that Veeam support agent has not suggested any solution in order to rectify this issue, since Veeam is the app not developed by Wasabi and Veeam support team should be better familiar with the product than we are.

In order to resolve this issue, we can suggest for backup jobs to timewise separated in order for all jobs not to be initiated at the same time, but shifted by a margin of one or more hours.

The other solution, which was previously suggested by Veeam's probably more experienced agents, involves registry modification in order to limit the number of connections.
This might have been implemented in the latest Veeam 12 release as an option, so I would strongly suggest confirming the following with Veeam support.
As an additional note, this limitation settings were previously suggested for 100 connections per IP per minute, so it might be different in 250 connections case scenario.

Once again, the following might be suggested by the Veeam team, but I would strongly suggest foremost to take into consideration what Veeam support is suggesting (because it might be different from what is written below and it might defer in the lates version):
_____________________________________________________________________________________________________________________
HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication\
Name: S3ConcurrentTaskLimit
Type: REG_DWORD
Default value: 64
Description: Amount of parallel HTTP requests for data upload (tasks) to the S3-compatible object storage (archive tier). Applied either on VBR server and applied automatically to all extents/gates, or can be set specifically on the extent/gate. Key on VBR server has a higher priority.

The value should be temporarily reduced to a low number like 4. This should rule out the possibility of high loads being the issue.

mlemo1080 · Post by **mlemo1080** » Aug 21, 2023 11:51 am this post

Had a very similar issue with copy jobs and this timeout. This is a Wasabi issue worked with support at Wasabi. They made some back-end changes to the DB on the Wasabi side it immediately fixed the issue. Had to run quite of few scripts on the AWS cli side first but after that did not work, they sent case to the backend team, and it was resolved in 24 to 48 hours.

Aug 21, 2023 8:09 pm

I've also had timeout issues with Wasabi and it seems like there's a known config requirement to limit the number of concurrent jobs on the Wasabi repo to 2-3. Any higher may cause problems based on your workload / architecture etc. IIRC this is not documented, but many support staff seem to know it - or work it out pretty quickly. Hopefully this helps, even only as a workaround. Better handling and some form of co-operative rate limiting would be much better than this hard limit though, which probably means there's time Wasabi could be accepting more work but the hard limit prevents it being dispatched.

Aug 21, 2023 8:40 pm

If you hit any source ip / rate limit, you will receive an Error code that explicitly says you have exceeded your concurrent connections.

Here is an example.

Code: Select all

Failed to pre-process the job Error: REST API error: 'S3 error: Your account exceeded the limit for s3:PutObjectRetention requests. Please try again later.
Code: RequestRateLimitExceeded', error code: 403

If you are timing out, this is typically due to a saturated network connection or a network provider routing issue(although other things could cause this, like a misconfiguration of versioning). Lowering the number of jobs or decreasing the S3Concurrent connections can alleviate the issue since you are lowering your bandwidth. Reaching out to your network provider is also an option, as if there is a routing issue, they may be able to resolve it. If lowering the jobs or decreasing the concurrent connections seems to fix the issue, you must adjust these based on your bandwidth capacity.

If there are timeouts, this could also mean that the request is not making it to Wasabi to capture any logs of the issue.

If Wasabi has you run AWS CLI scripts and then do something on the back end to clean up DeleteMarkers, you ran a non-supported Veeam configuration by turning on Versioning on a Non-Immutable bucket. This placed DeleteMarkers on anything deleted instead of deleting it, drastically increasing the count of objects in your bucket. Cleaning up the DeleteMarkers manually by Wasabi on the backend is a workaround for the misconfiguration and puts you into the "unpredictable system behavior" could occur category. The S3 API only allows the listing of objects 1k at a time, so your bucket should have 13mil objects in it that need to be requested, but versioning has caused 500mil objects to be stored. The listing behavior can take a long time and timeout due to the scanning required for an unusually large amount of objects.

The first thing to look at would be whether the bucket has Versioning enabled but not Object Lock enabled.
You can confirm this in your Bucket Settings on the Wasabi Console.
If you see a tab called "Object Locking," you are all good. If you have Versioning enabled and see a tab called "Compliance," then you are running misconfigured.

Aug 21, 2023 9:34 pm

Note: There is also a bug in v11 of Veeam that can cause delete markers to show up. Veeam support can provide a patch (or just upgrade to v12) to stop them from being generated.

There is also a script that can be run to clean these up.

See: post496171.html#p496171

velo · Post by **velo** » Aug 22, 2023 12:11 am this post

Appreciate everyones replies. We ended up engaging Object First's support along with Veeam and discovered that we were getting timeout errors reading from the Ootbi repo as well as trying to offload to Wasabi. We ended up doing the following which was suggested by a joint call with Veeam and the Object First team.

We removed all reg keys and set things back to default. We then created a new standalone repo on the Ootbi repo and moved all of our backup jobs to that. Most of the backups completed without error but there was an agent job which completed on the initial full, but then failed later on subsequent incremental backups with the timeout error. We also had a couple jobs where 1 or 2 VM's would consistently fail with a timeout error. The object first team helped us deploy a firmware update to all of the Ootbi disks. On the next run the jobs with 1 or 2 VM's failing with a timeout error started completing. The agent job itself continued to fail until it was rebuilt. The rebuild was done to keep the job running and logs were sent from the old job to both Veeam and the Object First support teams. Veeam support found that for this particular agent job it was unable to find 2 metadata object. It would check a primary location for the metadata which would timeout, then move to the secondary location and timeout again. The teams is still doing further investigation.

The object first team just got back to use with our next set of troubleshooting steps from their R&D team. Currently the physical VBR/Proxy and the Ootbi appliance and they switch stack they land in are all set for jumbo frames. They've asked us to take down the MTU from 9000 to 1500 as a troubleshooting step. This work is currently under way, and I will report back to this post as things progress just to keep everyone informed.

I will say, both support teams have been great. Veeam started out a little slow but has been better engaged since my escalation request. Object First team has been great as well, the issue started out on their side, and they've taken a lot of ownership in this. I am just hoping we can get things back to normal by the end of this week.

velo · Aug 22, 2023 12:43 pm

Progress

After dropping the MTU from 9000 down to 1500 across all devices we tested the failing agent job and that processed without issue. We then took our capacity tier out of maintenance mode and triggered a manual tiering and overnight the tiering job is still running and there have not been any failures.

The object first team will be digging in to see what the issue was with jumbo frames. At this point it appears to be the culprit at this point.

Post by **AlexHeylin** » Sep 06, 2023 1:56 pm this post

Thanks for coming back with the cause.

R&D Forums

New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Re: New Veeam 12 deployment with SOBR offloads to Wasabi timing out

Who is online