Long health checks on every second day

Post by **SnakeSK** » Jun 27, 2022 4:20 am this post

Hello,

currently under Case #05496393 - at L1 so we are getting nowhere yet

.

We are running into strange situation in VBR for a customer we are managing. We do daily health checks on every backup we produce. We migrated VBR server to dedicated physical server, we are targeting local repo (DAS), two immutable storages, one Cloud connect repo, and a simple copy job to onsite NAS (as it has no other purpose). The issue we are seeing is every second job takes really long, sometimes a difference is huge. For example this is a primary job:

Fast 24/6:

Slow 23/6:

Fast 22/6:

Slow 21/6:

You might say that this is not a huge issue, but this is also affecting Cloud connect repo:

45min vs 12 hours.

Immutable linux repos are the same, to give an example of one:

mostly 50min vs 14 hours

The support is blaming storage for this, but there is clearly more to it, as the pattern is always the same. If we skip one day of the backup, the next one continues the trend. Reboots dont affect a thing, CC repo is utilized for 10%, DAS storage is keeping up fine, even on long or short checks. We tried disabling key rotation in BEM. Tried rebooting the servers before backups. The jobs are timed so they finish and have plenty of time to prepare the infrastructure for next job (and next repo). We tried to build the infrastructure not to rely on any gateways, so that the repos have "brain on their own" and do not utilize the network of the backup server/proxy (hence we are using linux repos, direct attached storage and cloud connect).

This only happens on VM backups (Hyper-V), as the Windows Agent (Backup server itself + Hypervisor backups) health checks are completing within 2 minutes after finishing backups. There is also replication throughout the day (6am to 8pm, the backups themselves for the VMs start one hour later, so the replication wont affect them), Oracle RMAN backups are running after backups are taken (they are not that large), SQL log backups are running until the main job truncates the SQL (only one primary job does this). SureBackup is scheduled to run throughout the day after the health checks are finished (the long ones). File copy jobs are scheduled to run before the backups. So everything is time well not to stress the host, network, proxy or the repos, there are no overlapping jobs that could affect this. Filesystems are ReFS with 64KB blocks and i flag on DAS, XFS 4K with Reflink (both), NAS is targeted using NFS (but is not a primary concern), VCSP is using ReFS also. Replication metadata are targeted on NVME storage and replicated to server with ReFS filesystem (but this is not a primary theme for this topic and there is no problem with replicas). The Veeam server has 4 NICs connected to Cisco switches via LACP on management and production network, it is able to saturate full gigabit speeds and we have no problem with that

. The management agent for VCSP is installed and we are able to access VBR server from VCSP console. The Veeam One is installed on the same server, latest version, SQL database is MSSQL 2019 Standard edition with LPIM enabled, Veeam BEM is installed. VBO is also installed targeting DAS storage, but it is stopped during the backup window or replication window (increments are about 50-150MB every 30 mins). The DAS has a RAID card configured as RAID10 with caching enabled. HPE server with ECC memory and Windows Server 2019 Std.

Any suggestion what might be happening with our VM backups?

Post by **SnakeSK** » Jun 27, 2022 4:29 am this post

Just to add one note - the long backups are finishing after midnight, the short ones always finish on the same day. If this even adds something to the puzzle (maybe some internal mechanism in Veeam gets triggered and thats why it takes so long.)

R&D Forums

Long health checks on every second day

Re: Long health checks on every second day

Who is online