Discussions related to using object storage as a backup target.
Post Reply
premeau
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2016 4:30 pm
Full Name: SPremeau
Contact:

Odd Capacity Tier Offload Failures

Post by premeau »

(Case #07560142 has been opened on this issue, but seeking some crowdsourced ideas.)

I did a round of updates to my Veeam Server (and Network infrastructure) during the week after Thanksgiving, the primary upgrades were a newer "server" (to be within current hardware recommendations) and the install of V12.3.0.310.

Since then, I have been receiving the following error at a variable time after the offload job starts:

"Error: HTTP exception: WinHttpReceiveResponse: 12002: The operation timed out, error code: 12002 Shared memory connection was closed."

Once the error has appeared on any thread, the entire job appears to be doomed for failure, and (much) worse, subsequent jobs appear to get stuck immediately until the server is restarted.

The troubleshooting process with Support has done the following:
  • Adjust various concurrency and timeout settings (presumably to avoid throttling, which Backblaze support say is not occurring)
  • Confirmed that my original destination bucket had the incorrect B2 retention setting (it was created before B2 offered object lock -- but the same issue occurs when attempting to offload to a brand new bucket following the current setup guide(s))
  • Have me chase down potentially related issues in my environment (e.g. network adapter power save settings, firewall logs, etc), which did reveal a few issues, but have not had a clear impact on the issue.
While this is a SOHO environment, so there is a chance I could be overwhelming something causing an intermittent issue, the fact that the Offloads (only) from this repository server (which is also the main backup server) are wedged until the server is restarted seems to point to a bug of some sort. (And the failure to cleanly recover after what ever the triggering issue is definitely and issue.. )

The site has 300Mbps symmetric bandwidth (via at&t fiber), so bandwidth should not be an issue..

(SOBR offloads from a repository server at another (VPN connected) site managed by the same main server when offloaded to a different bucket at the same Backblaze S3 endpoint have not been impacted.)

Another (potentially unrelated) issues I am seeing is the need to perform several SOBR repository rescans after the reboot process for both the locally attached performance-tier disk and backups to be available after the reboot.

Is this an issue that anyone else has experienced?
david.domask
Veeam Software
Posts: 2652
Liked: 615 times
Joined: Jun 28, 2016 12:12 pm
Contact:

Re: Odd Capacity Tier Offload Failures

Post by david.domask »

Hi premeau,

Thank you for sharing the case number and details, sorry to hear about the challenges.

No known issues at the moment match the described behavior, so please continue working with Support on the matter.

>Another (potentially unrelated) issues I am seeing is the need to perform several SOBR repository rescans after the reboot process for both the locally attached performance-tier disk and backups to be available after the reboot.

Can you elaborate on this a bit more? Does the rescan fail or produce a warning?

Similarly, how is the VBR server specced CPU/Memory-wise? The behavior you're mentioning with the rescan operations needing multiple tries and other operations failing until reboot has me thinking in the direction of VBR server resources, but this is just a hunch. But I would quickly review the System Requirements and check the VBR server resources -- since it's both VBR server and repository, the requirements are cumulative across all roles the server has, and just it's an easy one to overlook.
David Domask | Product Management: Principal Analyst
premeau
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2016 4:30 pm
Full Name: SPremeau
Contact:

Re: Odd Capacity Tier Offload Failures

Post by premeau »

> Can you elaborate on this a bit more? Does the rescan fail or produce a warning?

The rescan process never fails.

When I open the console and go to "Backup Infrastructure" the repository is shown as unavailable.
After the first (successful) rescan, everything is back online, but if I attempt an SOBR Offload, there will be 0 backups to offload because all of the backup files are unavailable.
After the second (successful) rescan, the backups will be back but the offload will fail because the "capacity extent containing the required backup files if offline".
Sometimes that requires another rescan, sometimes just retrying the offload again is all that's required.

It's almost as if the various components are not "keeping up" with the status of the other components.

> Similarly, how is the VBR server specced CPU/Memory-wise?

It's a small 4-core (older generation i7) CPU with 32GB of RAM. Most processes are limited to 2 workers, so I believe I "just fit".... that said, Windows performance monitor does not show any hardware pressures, with "low" CPU usage and >50% memory free.
premeau
Influencer
Posts: 18
Liked: 4 times
Joined: Aug 26, 2016 4:30 pm
Full Name: SPremeau
Contact:

Re: Odd Capacity Tier Offload Failures

Post by premeau »

To update the thread ... (un)fortunately, the issue evaporated during troubleshooting, so there may never be a full explanation of the issue(s).

That said, after the issue evaporated (to try and place the extra stress on the system again), I did create new buckets in Backblaze's newly opened CA-East region (which is 1/2 of the network distance from both of my backup repository sites), and did see a vast increase in offload speeds.
david.domask
Veeam Software
Posts: 2652
Liked: 615 times
Joined: Jun 28, 2016 12:12 pm
Contact:

Re: Odd Capacity Tier Offload Failures

Post by david.domask »

Hi premeau,

Thank you for the update and glad to hear that it resolved. Given that it was timeouts, I half-wonder if it was a temporary issue on the Backblaze issue that they ended up resolving. The WinHTTP12002 looks to come first in the error and I suppose that's why our datamover agent aborted as the connection was no longer possible.

But glad to hear that it's up and running again, and good to hear that the closer region helped. For posterity purposes, do you mind sharing statistcs on the speeds before/after?

- Approximately how much data total sent to backblaze during the tests
- Speeds with other regions
- Speeds with CA-East

Napkin math is fine here, just curious how much improvement we're talking :)
David Domask | Product Management: Principal Analyst
Post Reply

Who is online

Users browsing this forum: Semrush [Bot] and 18 guests