overlapping jobs caused by long duration of active fulls.

Post by **ashleyw** » Oct 10, 2023 12:16 am this post

hi, We are backing up around 30-40TB of data.
The processing rate is roughly 1TB per hour when active full backups are in full swing. and we set active fulls to run the first Friday of every month. (we are exploring different strategies to increase throughput but our test time is limited without destruction of existing data sets). (we are using 4MB block sizes and other tuning to reduce the RAM overhead of the Veeam de-dupe tables to prevent memory errors like the ones below);

Code: Select all

Out of memory: Killed process <x> (veeamagent) total-vm: <a>kB, anon-rss: <b>kB, file-rss: <c>kB, shmem-rss: <d>kB, UID:0 pgtables: <e>kB oom_score_adj:-100

Some of our largest jobs have between 8 and 15TB workloads and our total workload is around 450 VMs split over 8 jobs.

Sometimes depending on the load on our storage array, the active full backups overlap the daily schedules so we end up with some of the daily incrementals queuing up and running and slotting in and interrupting the jobs on the full backup cycle during wait states.

How do we deal with this scenario, other than to abandoning the Veeam schedules and triggering the daily incremental and monthly active fulls and weekly synthetic full jobs via power-shell from windows task scheduler and reducing the number of concurrent tasks hitting the repository? Or alternatively disable daily jobs at the start of the active full cycle, then re-enable daily jobs at the end of the cycle/chain?

Also in terms of the threading model, our throughput is obviously far higher during incremental and synthetic full stages (we are running XFS layered on top of a zfs zpool to get the advantages of COW and the reliability of ZFS).
Is there any chance specific tasks can be weighted differently and impacting the scheduling algorithms?
e.g.
an incremental task has trivial workload impact and a weighting factor of 1.
an active full backup task has high workload impact and a weighting factor of 4.
and the the repository could have a configurable load factor of say 16 - so in essence the number of tasks hitting the repository would be related to the types of workloads in flight. (eg 16 inflight incrementals or 4 active fulls or a combination of both up to the load factor limit).
So on a standard incremental run through, then there would be a higher number of parallel tasks, but when running active fulls a lower number of concurrent tasks would run to prevent the repository from being overloaded.

Any thoughts would be appreciated -thanks.

Post by **karsten123** » Oct 10, 2023 3:56 am this post

What is the requirement for active full? Go with synthetics fulls. done

Post by **ashleyw** » Oct 10, 2023 4:21 am this post

thanks, the requirement for active fulls is to make sure the chains are usable and that no corruption etc in the chains or on the underlying storage appliance.
How does everyone else deal with these types of problems with workloads sizes of 10s of TB?
We were heavily burned years ago on ReFS to get the COW benefits, of synthetics etc which forced us to XFS, so I'd never want to assume that backup chains were fine just because the file system says so.

Oct 10, 2023 7:37 pm

As Anton would say: enterprise grade RAID controller and health checks out of veeam.
Surebackup is a thing, too.
Never had problems with proper hardware underneath.

Post by **ashleyw** » Oct 17, 2023 8:34 am this post

thanks, After many hours of testing I have resolved the issue.
We now have consistent throughput of over 1GB per second on active fulls with our storage server - rocky8 repository previously using ZFS with XFS layered onto the zvol, but now running 22 disk Raid 10 using xfs on top of an mdadm provisioned raid 10 md0 drive.

In case it helps anyone else;
In our case the underlying problem was that we had to disable NCQ via this style;
for drive in sd{b..x};do
echo 1 > /sys/block/$drive/device/queue_depth
done
once we had disabled NCQ (Native Command Queuing), the throughput took off and we had a 5-10 fold throughput improvement, which means out entire active full workload can complete in under 24 hours now and we have no issues on our cheap-as-chips Dell server (which is actually a VMWare server with the repository and linux proxies as VMs, but with the repository using pci pass through of the HBA330 controller so that the linux repository can see the drives natively).

I've been trying to isolate this specific issue for a long time, so I'm happy I had the time to persist with the investigations.

We have always preferred jbod controllers as they are cheap and means we aren't dependent on specific vendors or hardware. The jbod controllers and software raid have been the basis for ZFS and for mdadm for years now, so I get the fact the some people may prefer hardware based controllers, but I don't think they are a requirement when the rest of the stack is typically software defined these days.

thanks again.

R&D Forums

overlapping jobs caused by long duration of active fulls.

Re: overlapping jobs caused by long duration of active fulls.

Re: overlapping jobs caused by long duration of active fulls.

Re: overlapping jobs caused by long duration of active fulls.

Re: overlapping jobs caused by long duration of active fulls.

Who is online