Improve throughput for single large VM's, with high latency?

Post by **colohost** » Dec 19, 2023 5:33 pm this post

Hi all, I'm not sure yet where my actual issue is; latency, packet loss, HPE Catalyst acceleration, combo of all, etc.

I'm trying to perform backups from VB&R 12 across a 10gig point to point circuit to a separate data center, with an HPE StoreOnce device as the target on the remote end. It's set to mandate fixed block chunking, i.e. optimized for Veeam 12. I don't have the capacity on the local end to back up first and then replicate. I'm intentionally using a Windows 11 proxy on the source side because that version of Windows allows the use of BBR for TCP congestion control, which allows for far greater throughput in the face of latency and possible low amounts of packet loss (<1%). With iperf between this system and the remote end, multi-gig throughput is no problem. Latency is about 55ms.

If I run a big backup job across it, it works pretty well while there are numerous VM's being backed up. I achieve a cumulative throughput of what I think is likely the StoreOnce's receive limit, in the ~300 MB/sec (2-3gbps) range of real throughput when an active full is being run; even greater throughput if you factor in the data reduction, etc.

Where I'm not seeing good performance is once most of the VM's are done and there are just a few unusually large 3-5TB VM's left. Those run quite slow, maybe 30-35 MB/sec, which means the backup job takes half a day. The same VM's when being all that's left and headed to a local StoreOnce unit on the source side tend to run in the 250MB/sec range, so the overall job is not impacted by much.

I have my job set to Exclude Swap, Exclude deleted, compression optimal (recommended), storage optimization 512kb to achieve better dedupe ratios and smaller backups.

Since I have the bandwidth, albeit with latency, I'm wondering if perhaps I should instead try to put the burden on the StoreOnce by switching to dedupe-friendly compression, and 4MB chunks so there are 8x less StoreOnce Catalyst transactions?

Figured I'd ask first before playing with settings since each trial run requires another 12+ hours.

Thanks!

Post by **HannesK** » Dec 20, 2023 3:24 pm this post

Hello,
in general, it's expected that many machines in parallel have better performance than one machine only. A combination of all sounds like a good explanation to me.

compression optimal (recommended), storage optimization 512kb to achieve better dedupe ratios and smaller backups.

that's not the recommended settings for StoreOnce. The software uses other per default values. Yes, the backups (what you read) are smaller with 512KB block size. Dedupe appliances reduce the overhead of large blocks (4MB as recommended) because they deduplicate.

Best regards,
Hannes

Post by **Gostev** » Dec 20, 2023 4:13 pm this post

@HannesK btw is it just me being blind or this User Guide article does not explain the compression level? I only see block size and backup chain recommendations... while the absence of compression is arguably the most important factor of achieving a decent dedupe ratio

Post by **colohost** » Dec 20, 2023 4:17 pm this post

Thanks. I hadn't heard back from support on this either so figured may as well tinker rather than sit around. I did a settings change to 4MB, left compression on optimal, and it did improve performance of an active full considerably with the data already existing on the target, and StoreOnce still set to the default low bandwidth catalyst mode preference; very little went across the wire. However, I deleted those images and began a new active full, and even with the optimal settings here I am again with it crawling along at 33 MB/sec per vmdk once all the small VM's finished. While the small ones were running, I actually hit better than 1000 MB/sec effective throughput with the new settings, so the change helped quite a bit with the heavy parallel streams.

There seems to be something artificially limiting the single vmdk backup speed when crossing this 55ms 10gig link, which does not occur to the same model appliance, same settings, local to the data center. I have to imagine it's simply overhead from the catalyst protocol, where perhaps there's some process of allocating a block, here comes 4 MB chunk, wait for acknowledgement, repeat, and the wait states involved in this combined with the 55ms latency can't exceed 33 MB/sec of raw data throughput that can't be accelerated from it already existing on the target.

If that is the issue, perhaps I should try setting it to extreme compression and take a penalty in proxy and storeonce cpu usage, in hopes that some appreciably larger amount of data fits into each 4MB chunk.

Post by **Gostev** » Dec 20, 2023 4:48 pm this post

Compression is usually bad for deduplicating storage though, as it cannot dedupe compressed data well. So by going in this direction, you essentially just got yourself a slow and yet expensive backup storage with zero benefits over a general-purpose server (which is much faster and cheaper). To get good benefits from the "dedupe" part, compression should normally be disabled.

Post by **colohost** » Dec 20, 2023 4:51 pm this post

Should I switch my jobs to None for compression in that case? Veeam doesn't attempt to make that change for me, when it does suggest other changes that it considers not optimal.

Could that possibly get me over this 33 MB/sec throughput per vmdk issue?

Post by **Gostev** » Dec 20, 2023 4:58 pm this post

It's storage-specific indeed and I don't know StoreOnce specifics all too well. Just sharing general considerations here, don't see how Extreme compression at source can improve anything at all when it comes to using a deduplicating storage appliance.

The 33 MB/sec throughput per VMDK issue is extremely unlikely to have anything to deal with a backup storage though. It's just too slow for any on-prem storage at all, so the root cause is most likely elsewhere in the infrastructure.

Post by **HannesK** » Dec 21, 2023 7:48 am this post

I quoted too much... block size is the setting I meant. That should be 4MB.

Compression is overridden by "decompress backup file data blocks before storing" in the repository settings (if defaults / recommended settings are used)

changing between high and low bandwidth mode sometimes helps. I don't know the impact of packet loss that was mentioned above on Catalyst.

Checking with HPE support outside Veeam could also be an option. I remember deduplication appliance vendors having tools to test performance. Then one can see the theoretic maximum speed vs. what Veeam achieves.

R&D Forums

Improve throughput for single large VM's, with high latency?

Re: Improve throughput for single large VM's, with high latency?

Re: Improve throughput for single large VM's, with high latency?

Re: Improve throughput for single large VM's, with high latency?

Re: Improve throughput for single large VM's, with high latency?

Re: Improve throughput for single large VM's, with high latency?

Re: Improve throughput for single large VM's, with high latency?

Re: Improve throughput for single large VM's, with high latency?

Who is online