Previous Environment
This environment is still around, jobs are just disabled. We never experienced these types of timeout issues. All backup related servers, repos and proxy sit inside of our backup VLAN (same as our new environment)
- Virtual VBR server running Veeam V11
- Physical Windows based proxy/repo which handled the direct storage access for our backup and replication jobs
- Backup copy jobs were configured to move backups offsite.
- Offsite repo (windows) is part of a SOBR which used AWS as its capacity extent
- SOBR offloads happened after 14 days
- Internet service is a 1Gbps symmetrical connection. In Veeam we throttle to 300Mbps during business hours and 600Mbps outside of business hours.
- VBR Server is a new physical server (1x 16c CPU 64G of RAM) which also acts as our proxy and has dual 10G connectivity into iSCSI (for direct storage access backups) and into our backup VLAN
- Object first appliance with 64TB raw storage which has dual 10G SFP+ connectivity into our backup VLAN.
- Primary repo is a SOBR with the Object First appliance as our performance extent and Wasabi as capacity. The SOBR does an immediate copy as well as moves workloads after 21 days.
- We are only using backup jobs now since we do an immediate copy to the capacity tier and we move GFS points off after 21 days.
- Internet service is a 1Gbps symmetrical connection. In Veeam we throttle to 300Mbps during business hours and 600Mbps outside of business hours.
Our production workload has 8 multi VM jobs, ranging from 15VMs up to 30ish. The total raw size of these ranges from just under 1TB to 2.5 TB. It also has 2 single VM jobs (due to special retention requirements) that are on the large side (a little over 4TB each). The behavior we are seeing is that jobs VM offloads will timeout after hitting 99% offload with the error below.
We were also seeing offloads complete successfully but would have this message in there which meant they never actually offloaded.Code: Select all
The HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002 Exception from server: HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002 Unable to retrieve next block transmission command. Number of already processed blocks: [1397].
We solved that once by applying the setting in this forum post to our capacity extentCode: Select all
Resource not ready: object storage repository SOBR Capacity Tier
object-storage-f52/since-upgrading-to-v ... 85104.html
We do have a Veeam support case open (06234303) and there's been a good amount of troubleshooting that's been done thus far. Saturday night we added the below reg keys to our VBR/Proxy server and manually re-trigged the offload.
S3RequestTimeoutSec Value (decimal): 900
S3MultiObjectDeleteLimit Value (decimal): 200
S3RequestRetryTotalTimeoutSec Value (decimal): 9000
There was a bit of hope because the first 5 vm offloads actually completed this time but it's started failing since with that same timeout error above. The reg values seem to have only delayed getting the error above. We've also reached out to Wasabi to make sure things are good on their side and they let us know that over the last 7 days, they've seen 1 put error of the 9 million put requests we've made. They asked to to revert back to veeam support for further troubleshooting.
I am wondering if anyone else has experienced anything like this before and what you did to resolve it. Currently the case is in the hands of the advanced support team