What's the Veeam Way: Confirm SOBR offload either in progress or completed

AlexHeylin · Post by **AlexHeylin** » Mar 29, 2023 1:02 pm this post

Currently we're monitoring for SOBR upload errors logged to Windows Event log - this is doesn't work well because errors are "expected" so it's hard to tell "expected" errors from "I'm broken and you need to fix me" errors. For example, we had 22 offload failures on one of our SOBR in last 24 hours... that's more than we usually see - but it doesn't tell us if human intervention is required.

What's "the Veeam way" for a monitoring system to confirm SOBR offload has completed, or is still in progress, or has real errors which need manual intervention to resolve?
We've got VSPC if that helps.

Thanks

Mar 29, 2023 1:27 pm

The daily SOBR status email report provides a good summary.

@Egor Yakovlev please also check if some of those Windows Event log events should really be warnings, or not logged at all. Temporary connection issues might be better not mentioned at all, unless of course they are already logged only after lots of fighting and retries? It is just that 22 failures would indicate we're too spammy, unless there were actual major backup infrastructure or Internet access or object storage issues during those 24 hours.

Please include @veremin in review to understand what can be optimized.

Post by **Egor Yakovlev** » Mar 29, 2023 2:12 pm this post

Sounds good, queued for investigation.
/Cheers!

Post by **veremin** » Mar 29, 2023 2:15 pm this post

Sure, we will have a call with Egor this week to review the current situation with offloading errors, warnings and reporting. Thanks!

AlexHeylin · Post by **AlexHeylin** » Mar 29, 2023 2:47 pm this post

Wow - what a response guys - thanks!!

While this thread isn't specifically about this case, you might find helpful background in Case 05930792 & Case #04800922. We don't normally open cases for this, but it's suboptimal to live with is as we have been.

Gostev wrote: ↑Mar 29, 2023 1:27 pm The daily SOBR status email report provides a good summary.

That looks like a good place to start for us to use as "OK / go look at it" indication - thanks.

If you want me to make a case with some logs inc Windows event logs, let me know.

Thanks

Alex

Apr 03, 2023 1:26 pm

Hi

This may seem like overkill, but here is something we cobbled together based on similar conversations and suggestions on the forums.

"%SystemRoot%\system32\WindowsPowerShell\v1.0\powershell.exe" -noprofile -command "import-module Veeam.Backup.PowerShell; $sobrOffload = [Veeam.Backup.Model.EDbJobType]::ArchiveBackup; $sessions =[Veeam.Backup.Core.CBackupSession]::GetByTypeAndTimeInterval($sobrOffload,'9/1/2022', (Get-Date).adddays(1)) ; $taskgroups = $sessions.gettasksessions() | where {($_.progress.TransferedSize -gt '0') -and($_.status -eq 'Success')} |group-object -property Name; $lastSuccessTasks = foreach ($Task in $Taskgroups) {$task.group | sort -property {$_.progress.stoptimelocal} | select -last 1 -Property JobName, Name, Status, @{l='EndTime';e={$_.progress.StopTimeLocal}}, @{l='Duration'; e={$_.progress.duration}}, @{l='TransferedSize (GB)'; e={$_.progress.TransferedSize/1GB}} }; 'Task Count:' ; ($lastsuccessTasks | measure-object).count ; $lastsuccessTasks| sort jobname | convertto-csv"

This gives you information for each offload "task" (one for each backup "Job"). Name, ID, Last time the job succeeded actually sent data. The task name will sometime change to "name of the SOBR Offload" depending on if this task was independent or not, but the ID stays the same (I don't know of a great way to deal with that).

Forgive the ugly look. The simplest way to run in via our RMM daily was as a one line CMD.

AlexHeylin · Post by **AlexHeylin** » Apr 03, 2023 1:30 pm this post

Thanks very much sykerzner! I'll give that a go - certainly a great place to start

Post by **veremin** » Apr 03, 2023 3:14 pm this post

Hey, Alex, we discussed the issue further, and in order to change the behavior or suggest something further we'd like to get the exact failure that got logged 22 times. This should help us to re-verify the logic behind the particular event. Thanks!

AlexHeylin · Post by **AlexHeylin** » Apr 04, 2023 12:31 pm this post

Hi,
I've uploaded both the Veeam logs and Veeam Backup windows event log (which is what we've been looking at) to Case #05930792
Thanks!

Post by **veremin** » Apr 04, 2023 1:49 pm this post

Thanks for the reference, we will review the provided information and post back. Thanks!

Apr 07, 2023 12:12 pm

We've contacted your support engineer recently.

Next week we review the event logs and see whether the given error is logged with the necessary priority (error instead of a warning) and the necessary number of times. This will help us to understand if there is room for improvement.

I will update the topic once I have more information.

Thanks!

AlexHeylin · Post by **AlexHeylin** » Apr 18, 2023 9:22 am this post

Just to share today's alerts from our monitoring based on the eventlogs
13 SOBR offload failures on OUR-SP-SOBR1 in last 24 hours. Offsite backups may be incomplete! Most recent 2023-04-18 07:43:27
2 SOBR offload failures on TENANT1-SOBR1 in last 24 hours. Offsite backups may be incomplete! Most recent 2023-04-17 22:39:21
2 SOBR offload failures on TENANT2-SOBR1 in last 24 hours. Offsite backups may be incomplete! Most recent 2023-04-17 09:01:52
2 SOBR offload failures on TENANT3-SOBR1 in last 24 hours. Offsite backups may be incomplete! Most recent 2023-04-18 04:59:46

We're looking to move over to the VSPC alerts, though integrating those into our systems / process is rather challenging.

AlexHeylin · Post by **AlexHeylin** » Apr 19, 2023 10:59 am this post

It looks like there are various contributors to these message counts:

Object storage cleanup failed: Timed out waiting for the backup files to be released, cancelling the job

This warning seems to be due to an internal Veeam design / scaling issue.

Resource not ready: object storage repository S3-EXT-NAME for SOBR-NAME Timed out waiting for backup infrastructure resources to become available (14400 sec)

This seems to be due to VBR trying to run too many offload jobs at the same time. In this case there appear to have been three instances of "SOBR Offload" plus an "SOBR-NAME Offload" for each SOBR running simultaneously (five total). Due to the required (but undocumented) concurrent task limit on the S3 repo (to avoid rate limit errors from S3 vendor) this pushes the bottleneck back to the object storage repository S3-EXT-NAME being "unavailable". At least, that's my interpretation.

18/04/2023 23:59:01 :: Removing checkpoint d4d19b02-323f-41cd-81fe-1bd7b354b1a2 from Capacity Tier...
19/04/2023 01:20:25 :: Checkpoint cleanup failed Details: HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002

REST API error: 'S3 error: We encountered an internal error. Please retry the operation again later. Code: InternalError', error code: 500 Other: Detail: 'Could not find pool number 2269 in extent B-643390/O-f5db5c2dbe717ca6/S-1',

18/04/2023 23:40:30 :: Checkpoint cleanup failed Details: HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002
18/04/2023 23:40:32 :: Object storage cleanup failed: HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002
Shared memory connection was closed.
18/04/2023 23:40:32 :: Object storage cleanup failed: HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002
Exception from server: HTTP exception: WinHttpQueryDataAvaliable: 12002: The operation timed out, error code: 12002
18/04/2023 23:40:47 :: Offload finished with warning at 18/04/2023 23:40:47

And other related transient errors from the S3. To a point VBR should just accept these as normal and retry and only report as errors if they fail repeatedly.

18/04/2023 07:43:27 :: Failed to offload backup. Error: Failed to call RPC function 'FcRenameFile': The process cannot access the file because it is being used by another process. Failed to rename file from [D:\Veeam\Backups\xxxxxxxxxxxxxxxxxxxxxxx\xxxxxxxxxxxxxxxxxxxxxx\xxxxxxxx.vbm.temp] to [D:\Veeam\Backups\xxxxxxxxxxxxxxxxxxxxxxx\xxxxxxxxxxxxxxxxxxxxxx\xxxxxxxx.vbm].
File 'D:\Veeam\Backups\xxxxxxxxxxxxxxxxxxxxxxx\xxxxxxxxxxxxxxxxxxxxxx\xxxxxxxx.vbm.temp' locked by 0 processes:.
File 'D:\Veeam\Backups\xxxxxxxxxxxxxxxxxxxxxxx\xxxxxxxxxxxxxxxxxxxxxx\xxxxxxxx.vbm' locked by 0 processes:.
18/04/2023 07:43:27 :: Failed to upload meta into master agent.

We're plagued by this occasional error and can't find the cause. All AV exclusions are in place as the most aggressive exclusions possible, and applied to both the file path and file names. Windows defender is uninstalled.
We find the "locked by 0 processes" very suspicious. Does that mean it's locked by zero Veeam processes, or zero processes in total (in which case the whole message could be wrong, as it means the file is NOT locked open as it says)

pirx · Apr 20, 2023 10:45 am

This is v11, a few common errors/warnings

Error (not sure if this has to be an error as the blackout period was set by purpose)

20.04.2023 06:19:18 :: Processing xxxxx Error: Job was stopped due to backup window setting

Error (should this really be an error?)

08.04.2023 23:14:55 :: Processing xxxx Error: Stopped by job 'xxxx' (Backup)

Warning

19.04.2023 22:27:46 :: Object storage cleanup failed: Failed to retrieve certificate from https://s3.dualstack.ap-southeast-1.amazonaws.com

Error (very common over all our different locations with buckets in different regions, not sure why the above is warning and this an error)

19.04.2023 17:00:44 :: Processing xxxxx Error: Failed to retrieve certificate from https://s3.dualstack.ap-southeast-1.amazonaws.com

Error (random but very common, I guess it has to be an error, but as this happens only randomly we usually ignore it)

09.04.2023 05:00:35 :: Processing xxxxx Error: HTTP exception: WinHttpSendRequest: 12030: The connection with the server was terminated abnormally
, error code: 12030

Warning (happens a lot, we tweaked some settings in the past but without 100% solution, so we just ignore it)

08.04.2023 19:45:31 :: Object storage cleanup failed: REST API error: 'S3 error: Please reduce your request rate.
Code: SlowDown', error code: 503
Other: HostId: 'xxxxxx

AlexHeylin · Post by **AlexHeylin** » Apr 20, 2023 10:48 am this post

I agree that "Job was stopped due to backup window setting" should not be an error. It's an indication that the system is working as designed / configured.

AlexHeylin · May 04, 2023 12:12 pm

Still causing drama several days a week...

Code: Select all

Processing 0c31728d-5c3c-46fa-925f-9edbe89621b7 Error: Timed out waiting for backup infrastructure resources to become available (14400 sec)

These seem to be routine - and due to other offload jobs running. Bear in mind we've been told by support to limit concurrent jobs on the S3 repo to 2 to deal with another message - very likely

Code: Select all

REST API error: 'S3 error: Please reduce your request rate.

This whole design of "run LOADS of offload jobs, often at the same time, have them ignore that other jobs are already running, then log errors when they timeout" just seems "highly suboptimal".

Code: Select all

Error: Backup file version mismatch: scale-out backup repository rescan is required.

Oh!!!

Given this was in sync previously, and nothing other than VBR has touched either the performance or capacity tiers - this is "very disappointing" that this seems to keep happening, long after the upgrade to v12 was supposed to improve all this.

If a rescan really is required - why doesn't VBR queue one up and suspend all the offload jobs (which will likely fail anyway) until it's completed?

AlexHeylin · Post by **AlexHeylin** » May 04, 2023 1:54 pm this post

New Case #06045964

AlexHeylin · May 04, 2023 2:07 pm

The rescan has spat out a load of warnings like

Code: Select all

Failed to import backup Backup Copy xxxxxxxxx\yyyyyyy - zzzzzz Details: The existing index has a different backup id

These are presumably because the SP side is sulking about a tenant having built a new backup server and remapped the new backups to the old chain, having upgraded from "per-machine data single metadata" to "per-machine data per-machine metadata".

SPs need a system that works and is more reliable and less needy than this!

Thanks

Alex

R&D Forums

What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Re: What's the Veeam Way: Confirm SOBR offload either in progress or completed

Who is online