V11 unbuffered access -> restores faster, but...

mkretzer · Post by **mkretzer** » May 17, 2021 11:28 am this post

We just upgraded to V11. In V10 the unbuffered access for tape backups lead to a immense performance boost (500 MB/s for two streams to LTO8).

We expected a similar performance boost with full restores. We did a restore test of a VM with multiple disks. At first, the restore looked much better (~700 MB/s) so unbuffered accesss seems to really help (restore to all-flash). Whats strange is that after the first disks finished restoring the remaining disk restores did not speed up.

We get ~135 MB/s (better than ~80 MB/s in V10) - The same backups written to tape are about twice as fast with unbuffered access in V10...

Post by **Gostev** » May 17, 2021 12:26 pm this post

Sounds like you're talking about tape restores, so I'm moving this to the Tape sub-forum.

You should have support investigate performance counters to find out what is the issue with the following disks.
Please include the support case ID so that we could follow the investigation too.

mkretzer · Post by **mkretzer** » May 17, 2021 3:37 pm this post

Hello,

No! I talk about why in V11 normal restore from disk is not as fast as tape backup got in V10 (both read unbuffered from Veeam Repo)...

Markus

Post by **Gostev** » May 17, 2021 4:28 pm this post

OK, moving it back then!

In general, "unbuffered access" is something that plays with write I/O only (in other words, when backup files are created). While during restore, there are no writes to buffer (or not to buffer) in principle. So you really lost me there for a moment

but I see now that you probably wanted to say "async I/O" (this one in turn plays only with read I/O).

Nevertheless, regardless of terminology, the issue you're observing is clear. Please have support investigate performance metrics during the following disks restore, and it should be easy to see where the bottleneck is. May be the processes that restore the following disks are not initialized correctly for async I/O. Although it's a but unlikely, as all disks are stored in the same backup file and it is only open for read once at the beginning of the process. But who knows!

mkretzer · Post by **mkretzer** » May 17, 2021 5:19 pm this post

Yes... Thats what i wanted to say... So before i open the case: the IO mode/performance should be similar for full restores and tape backup from repo to tape in V11, correct?

Case 04813360

Post by **Gostev** » May 17, 2021 6:46 pm this post

I would not be so sure, as it's a totally separate logic: one is highly optimized for reading from LTO tape devices, which are practically identical (with most tape drives out there coming from just a couple of vendors); while another one is highly optimized to perform equally well across a very large variety of disk based storage with their performance potentially differing 100 or more times (plus a few different protocols on top of that).

I would rather focus on investigating the very specific issue with the following disks restoring much slower than the first one. This does not seem right, and the reason for this should be quite easy for us to understand from extended performance debug logs.

mkretzer · Post by **mkretzer** » May 17, 2021 7:00 pm this post

No, i think you misunderstood. All disks restored at a similar rate. Thats what i find strange: Restoring 1 disk should be faster than restoring 5 disks (per stream!) if the source and target storage is the same.
What is mean: if 5 streams do 500 MB/s and one stream does 100 MB/s something is not optimized.

Post by **Gostev** » May 17, 2021 7:03 pm this post

Ah... no, this is perfectly normal and is just how enterprise storage works: it's scales with multiple I/O threads, you can never possible saturate it with the single thread.
I'm not too strong with the theory here though, but I'm sure @tsightler can give us a quick explanation about this when he gets a moment.

mkretzer · Post by **mkretzer** » May 17, 2021 8:04 pm this post

Yes i know that - but why does it work better for single-stream backup from disk to tape? It should be the same now in V11 (at least that was my hope after we talked about this "feature" in the past!).
Multi-Stream was never an issue (not even in version 9 or so).

Post by **tsightler** » May 17, 2021 8:13 pm this post

Hey @mkretzer, can you tell me a little more about the restore process? What transport mode are you using? What disk provisioning (thin, thick lazy/eager zero'd). The problems could easily be on the writing side as much as the reading as VMFS has some "interesting" semantics and I'm actually not sure if we're doing any async writes on the restore side (or even if VMDK supports them). However, 135MB/s seems pretty slow for a single disk restore to all-flash to me so I'd be interested in some more details. I'll try to go dig them out of the case when I get a chance, but it might be easier if you happen to see this and post here.

mkretzer · Post by **mkretzer** » May 17, 2021 8:25 pm this post

Hi tsightler,

Hotadd, once with an old windows hotadd proxy, once with a fresh linux proxy, optimized following your recommendations.

Lazy zeroed. I just had the theory as well that VMFS is the problem here as i saw 6 ms latency on the disk but only 450 µs on the storage side. Could be VAAI zero taking its toll (which is sometimes not included in storage side statistics)... Will try to restore to a different storage model as soon as i have the chance.

Logs are still beeing created, it takes well over an hour for just one day.

Markus

Post by **tsightler** » May 17, 2021 8:51 pm this post

I just did a quick smoke test restore in my lab of a 75GB VM with a single disk. It's a system with a daily 20% change rate that is 90% full and doesn't compress very well so I like to use it for quick throughput tests since it get's really fragmented as I've only run weekly synthetic fulls (block clone on XFS) for over a year so, with a very hight and random change rate you can image how fragmented these files are. With hotadd the restore managed 443MB/s with a linux proxy (Ubuntu 20.04), and 440MB/s with the Windows proxy, although interestingly, the Windows proxy was more inconsistent and had a higher max throughput of just over 500MB/s. All throughput numbers were verified via network traffic graphs.

My lab hardware is modest compared to the setup you have, so seeing only 135MB/s seems pretty slow and I'd think VAAI zeroing is the most likely culprit.

mkretzer · Post by **mkretzer** » May 17, 2021 9:12 pm this post

Yea getting > 400 MB/s for a single VM disk restore from repo over network to Veeam server NTFS while a bunch of copies, backups and synthetics are running on the repo. Bottleneck statistics for restores in GUI would be nice!

Post by **sherzig** » May 18, 2021 9:36 am this post

I remember when restoring into a datastore configured on a Hitachi LDEV, the array didn't like thin provisioned volumes. I advised Markus (I remember he asked something about Hitachi) - to start a Entire VM Restore using Eager Zeroed Thick as disk format to see if it speeds up the restore.

mkretzer · May 18, 2021 10:54 am

With Eager Zeroed we get nearly 3 times to restore performance! I'll test with our old DataCore system next.

Support also confirmed "So the problem does not seem to be with reading the data, but writing it, where Async I/O is not used and the operation depends entirely on VMware side."

Could Async IO be used on VMware side in theory? I guess it would not help if zeroing is really the issue...

Post by **Gostev** » May 18, 2021 11:18 am this post

As I've already said above, in V11 Async I/O feature plays only with read I/O (it means issuing read requests for the blocks we know we will need in future in advance, even before the immediately required blocks are retrieved). While "VMware side" during restores does write I/O.

Optimal writing remains the RAID controller's job: it groups all outstanding writes in its memory and executes them in the most efficient order to reduce the number of IO operations required. Client's system cache gets in the way here, and makes things worse when you have smart Enterprise-grade RAID controllers in the storage device, which is why V11 uses unbuffered access.

May 18, 2021 8:01 pm

I wouldn't completely discount the advantages of async I/O for writes, while it may at first seem like there would be limited benefit, there can be cases where async writes offer a performance advantage (for decades databases like Oracle have used unbuffered I/O in combination with async read/writes to maximize performance) and I've often wondered if this is one of them. Basically, writes still suffer from the same issues as reads (i.e. request latency limits maximum throughput), although it's typically not as bad for writes, but moving to unbuffered writes may make this particular aspect even more important.

The primary advantage of async I/O is the ability to have multiple outstanding requests in flight thus keeping the storage more fully utilized. For example, if I perform reads from a server with 48 drives in a RAID 60 array, the latency for any given read requests is likely to be 5-10ms, because that's the request latency of a single disk, however, assuming a 256K stripe size, a single request for a 512K block would only utilize 2 drives leaving the other 46 drives doing nothing for those 5-10ms. When using synchronous I/O with a single stream that's exactly what is happening, however, if I issue 24 requests, I can, potentially, get all 24 done in that same 5-10ms. Obviously that assumes a perfect world where each of those 24 blocks are on a different set of 2 disks, and that perfection would never happen in the real world, but, in the end, some of the blocks will certainly be spread on other disks so the benefits of async I/O are great for reads, and get greater as the amount of disks increase.

Writes do have different behavior, they are usually buffered in some way (even when "unbuffered" in the OS) and, generally, latency is from the time the write is issued to the time it is committed to a non-volatile source (such as the NVRAM on a RAID controller or redundancy cache on a storage system) so, indeed, the impact for async writes is generally less overall, but it's not non-existent since having many outstanding requests still gives both the OS I/O elevator and the RAID/SAN controller more blocks to potentially merges/optimize.

However, in the VMFS case specifically, I've often suspected that async writes could help quite a bit as I believe that the bulk of the delay is due to the VAAI zero effect for lazy-zerod or thin provisioned VMDK. This is based on the fact that we see disabling VAAI significantly improve performance for restores in general and, more importantly, you can see the latency spike caused VAAI, for example, you note 6ms latency during your restores even though the storage itself is clearly not there. I believe this latency is coming from the UNMAP/WRITE_SAME requests as each new segment is written and, when using sync I/O, which each new write won't be issued until the prior is acknowledged, 6ms latency would limit a restore to ~167MB/s (give or take) which is pretty close to what you are seeing.

My belief is that async writes could combat this VAAI zeroing overhead by allowing VMware to have many such requests outstanding at once. This theory is further bolstered by the fact that restoring mulitple disks is faster than a single disks so obviously the VAAI zeroing requests are not a datastore level bottleneck, but a latency bottleneck. I'll try to do some testing and see if I can prove this theory in any way as it's been in the back of my head for years, since back when we first started noticing the VAAI performance impact on restores.

R&D Forums

V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Re: V11 unbuffered access -> restores faster, but...

Who is online