Full VM restore through a backup proxy is single threaded

pizzim13 · Post by **pizzim13** » Jul 29, 2014 11:11 am this post

When doing a full VM restore I noticed that only one core (of eight) on my virtual proxy was in use. Is there any config/reg key changes to make that a multi-threaded operation or this is a software limitation?

Shestakov · Post by **Shestakov** » Jul 29, 2014 4:36 pm this post

Do you use physical or virtual proxy?
If you need a better performance my advice is to use virtual proxy(Hot Add mode).

pizzim13 · Post by **pizzim13** » Jul 29, 2014 4:40 pm this post

It's a virtual proxy using Hot add. The restore process of a virtual machine will only use 100% of one core. Can this process be multi-threaded?

Post by **tsightler** » Jul 29, 2014 4:53 pm this post

I'm a little surprised to see Veeam using 100% of a CPU during a restore as the restore process is normally very lightweight from a CPU perspective. I wonder if it's not actually Veeam using the CPU but rather network traffic not using RSS and thus all being serviced by CPU0. Can you tell me a little more about your environment?

What compression is the backup job using?
How fast is the restore running (MB/s)?
What OS version is the proxy?
What virtual NIC (vmxnet3 or e1000)?
What physical CPU is used?

Thanks!

pizzim13 · Post by **pizzim13** » Jul 29, 2014 5:30 pm this post

Compression: Optimal
Restore rate: 150MBs
Dedupe: Lan target
OS: 2012 R2
Nic: vmxnet3
CPU: x7560 @2.27GHz

When restoring at a rate of 150 MBs one core on the virtual proxy would flat line at 100%. CPU interrupts are around 325 MHz. The proxies are configured with eight cores. When backing up I see proxies using multiple cores.

Jul 29, 2014 6:03 pm

This seems unusual. I just performed a test restore and performance was spread evenly across all four cores of my proxy VM. My restore performance was not as high has yours due to constraints in my lab on my target datastore, however, I was still able to reach ~75MB/s and I would have anticipated seeing one core busy and the other 3 idle, but instead I had almost perfectly distributed CPU load. This was on 2012 instead of 2012R2, but I wouldn't expect that to be a difference.

Is there anything else on the system that might cause this behavior, anti-virus agents or something perhaps? Is your repository located on this same proxy, or is it receiving data from another machine?

Jul 30, 2014 11:18 am

I would first of all use tools like Process Explorer to better check if it's really Veeam causing the CPU spikes, and look for the ideas Tom dropped here to check for other possible reasons.

jbsengineer · Post by **jbsengineer** » Jul 31, 2014 5:16 pm this post

Hey Tom,

I work with the OP and we wanted to post a quick forum post for deep diving into the Veeam support realm. Thanks for testing. From what I understand a single thread will not necessarily pin itself to a single core. A single execution will pin itself to a single core. But multiple chained individual executions will be placed across multiple cores but the total usage will not exceed that of a single core. In our case we restored 1432 VMs over a 24 hour period. Not one single restore job would exceed the CPU usage of a single core clock speed. It was very apparent when we witnessed it, as the CPU would flat line at the max clock speed of the core without another bottleneck surrounding the process. This forced us to horizontally scale our restores instead of vertical stack them, that caused us to hit what appears to be a file locking limitation on the repositories. So essentially we are bottlenecking on restores by the single threaded rehydration process, and a file locking limitation on the repositories. We would love to be bottlenecked by disk!

What is interesting to us is that on a backup, the veeamagent is certainly multi-threaded given we can peg a 6 core machine easily with one stream. Just not the other way around. I believe in Veeam 6 the repositories did the rehydration, and I cannot remember if it was multi-threaded then (before they moved the rehydration over to the proxies).

Going to open a call with Veeam shortly. Thanks for all your help gents!

jbsengineer · Post by **jbsengineer** » Jul 31, 2014 5:45 pm this post

I was just able to confirm with Veeam that the rehydration process is single threaded.

Still waiting for more clarification on real world limitations of a repository.

Post by **Gostev** » Jul 31, 2014 6:57 pm this post

Hmm, I would not expect restore process to create so much CPU load to present problems and cause a bottleneck in the first place (as noted earlier, restore is very light on CPU). Can you tell us what "bottlenecked" restore speed are we talking about here as the result?

Also, if I understand you right, in your case a single restore process is limited to 1 core worth of CPU consumption across all of its thread (which is what you think causes the bottleneck). This is a very strange behavior, and the reason for it probably lies outside Veeam, because we do not throttle CPU consumption of our data movers. Most likely though, this match is just a coincidence, and the bottleneck is somewhere else.

There are no known limitations of backup repositories. Basically there are two separate unexpected behaviors in your case (abnormally high CPU load on restore + unexpected restore process CPU consumption limit across the cores), so best would be to open a support case for deeper investigation.

Thanks!

Post by **tsightler** » Jul 31, 2014 7:27 pm this post

And I cannot confirm in my lab with v7. I've included a screenshot of a restore that I ran just a couple of days ago specifically for this thread. I did have a bottleneck on the source VMFS volume which limited the restore to around 115MB/s, but that's reasonably close. My proxy is virtual, and my processor is of very similar performance, at least generationally (it's a L5520, that also runs at 2.27Ghz). I did see ~20% total CPU load, so indeed if that was pinned to one processor it would be an issue, however, the load was spread almost perfectly evenly across the processors. I've attached a screenshot below and you can see easily when the restore stream started.

I'm curious if you could post a screenshot and perhaps tell me a little more about your repository configs. Perhaps there's some subtle difference there that's causing the issue. In the case below the restore was coming from a Linux repository, and the screenshot it from the proxy itself which was simply receiving the restore stream and writing via hotadd.

Aug 01, 2014 3:56 am

jbsengineer wrote:I was just able to confirm with Veeam that the rehydration process is single threaded.

I would like to follow up with this person and understand the basis of that statement. Was it someone in support? I'm concerned they may be answering your question of "single threaded" as in not "parallel processing", i.e. we don't restore mulitiple VMDK's at the same time, but that doesn't mean the data mover process itself is single threaded, and I think that's what you're implying with the "pinned to CPU" statements.

In case it wasn't clear, I am a Veeam employee, specifically a solution architect dedicated to B&R, so I have a lot of interest in this case and I've always seen Veeam use multiple CPUs in both backup and restore. I have seen cases where restore load is not properly distributed and in those cases so far I've seen one of two issues:

1. RSS was not enabled on the vNIC in the virtual machine. This was really common with 2008R2, but I think this is now default in newer versions.
2. Some type of anti-virus or security solution that forces all inbound network traffic to be scanned via a filter and uses a high amount of CPU.

Neither of these issues would impact backups, unless you were using NBD mode, but they can have a significant impact on restores.

I just spent a few minutes with Process Explorer during a restore and it was easy to see the threads and their "preferred CPU" as defined by the Windows scheduler and the threads were fairly equal. Yes there was one worker thread that got more workload than any other, but it was still only about 5% on average, with peaks to 10%, there were quite a number of other threads the had 1-3% load on average and they were being scheduled across the other CPUs. I suppose it's possible that this single thread could increase to eventually use an entire CPU, but my restore was running at 110MB/s and this thread was nowhere near saturating a single CPU so it seems unlikely you would hit that at only 150MB/s. I've seen restores much faster.

jbsengineer · Post by **jbsengineer** » Aug 01, 2014 6:21 pm this post

So I have done some more testing and cannot point at the CPU. It appears from some tests I ran that I am not able to get more than ~130MB/s out of the veeamagent. Let me explain how I eliminated other suspects and please feel free to comment on what I may have missed:

Not CPU: I ran a restore on a 2.27GHZ multi-core proxy and a 3.33GHZ multi-core proxy. There was no difference in disk throughput. Both were writing down to disk at a max rate of 130MB/s.

Not network: I used a tool from Microsoft called NTtTCP which is a send/receive utility. I was able to peg a 10GB link between the repository and test proxy. Also, while restoring a machine with two disks I saw a 100MB/s network stream coming into the proxy, and a 130MB/s being written down for the first disk. On the second disk there was a 35MB/s stream coming into the proxy and a 130MB/s stream going to the disk.

Not disk: First I swapped a backend repository disk with SSD, then tested restoring the VM to an SSD datastore. Still no change. To make sure, I setup IOMeter on the proxy and added a disk to the proxy that is on the exact datastore as where I am restoring the virtual machine. The IOMeter profile was setup do do 1MB, random writes (I know veeam is writing 1MB, for the random part I wanted worst case scenario). The results were obvious that the disk wasn't the issue. I was writing 700MB/s consistently.

I have verified there are no limits set on network throughput in VEEAM.

What am I missing?

jbsengineer · Post by **jbsengineer** » Aug 01, 2014 6:40 pm this post

I should add that the job settings were set to not compress, no inline de-dupe, Local Target. If someone could explain the network ingest rate being lower than the disk write rate that would remove a question I have. Will veeam actively write white blocks?

Post by **tsightler** » Aug 01, 2014 8:03 pm this post

Any chance you could do a couple of additional tests? Ideally, can you perform two restores at the same time to the very same VMFS datastore and report the speed? Then to two different datastores? Second, is there any chance you can do a "VM files restore" to a volume attached the Veeam server directly and formatted with NTFS, I'm guess that will be much, much faster? I know that's a lot of testing, but it will help us identify possible causes. My guess is it's simply the normal performance issues seen when restoring the VMFS due to the semantics of the filesystem.

One of the more common causes is the impact of "zeroing". During normal operation this isn't a big deal, but when doing a restore, there's a call to zero the disk for each new segment. With a VAAI supporting storage system the host uses "Write Same" SCSI commands to zero the blocks, however, some storage systems are actually slower at this when doing large streaming writes. For example, the customer in this post found that with his system restore performance increased over 2x just from disabling the VAAI zeroing. I've had other customers see similar results. I think the problem is just the latency of the request. When sending a "Write Same" I have to wait for the request to complete, but if I just send zeros they buffer like all other writes. Just a guess though.

Other things that come into play are the VMFS locking symantics, each and every time the disk "grows", which is a lot during a restore of a thin provisioned VM, the system has to grab a global lock since that requires VMFS metadata updates. This is why restores to NFS volumes are usually much faster, but if you can possibly run the tests above, it might help us make sure that's really where the issue lies.

jbsengineer · Sep 10, 2014 1:32 pm

tsightler wrote:Any chance you could do a couple of additional tests? Ideally, can you perform two restores at the same time to the very same VMFS datastore and report the speed? Then to two different datastores? Second, is there any chance you can do a "VM files restore" to a volume attached the Veeam server directly and formatted with NTFS, I'm guess that will be much, much faster? I know that's a lot of testing, but it will help us identify possible causes. My guess is it's simply the normal performance issues seen when restoring the VMFS due to the semantics of the filesystem.

One of the more common causes is the impact of "zeroing". During normal operation this isn't a big deal, but when doing a restore, there's a call to zero the disk for each new segment. With a VAAI supporting storage system the host uses "Write Same" SCSI commands to zero the blocks, however, some storage systems are actually slower at this when doing large streaming writes. For example, the customer in this post found that with his system restore performance increased over 2x just from disabling the VAAI zeroing. I've had other customers see similar results. I think the problem is just the latency of the request. When sending a "Write Same" I have to wait for the request to complete, but if I just send zeros they buffer like all other writes. Just a guess though.

Other things that come into play are the VMFS locking symantics, each and every time the disk "grows", which is a lot during a restore of a thin provisioned VM, the system has to grab a global lock since that requires VMFS metadata updates. This is why restores to NFS volumes are usually much faster, but if you can possibly run the tests above, it might help us make sure that's really where the issue lies.

Hey Tom,

You actually hit on something I was starting to suspect. A VMFS zeroing "tax" if it exists. So I have done some very extensive testing since last posting and my conclusion is this. Outside of Veeam if I create then write 20GB of data into an Eager Zeroed Thick disk, I can do this in 1 minute and 31 seconds (30 seconds to create the disk, 61 to write the data). If I create a lazy zero thick disk, then write 20GB of data into it, it will take 2 minutes 57 seconds. Exact same amount of IOPs of zeroing and writing of real data, however almost double the time. I replicated this with and without write same turned on and recorded everything from ESXTOP. I was able to squeak out another 10-15% performance on the zeroing end with write same turned on, but the results were similar when "zeroing on the fly." Now obviously, once you initialize that block once you are good to go. Unfortunately, Veeam has no way to do restore as a Thick Eager Zero which could in theory double the performance.

The reason my IOMeter test was flawed originally was because it will prep the disk (without telling you) and initialize the blocks. So it had masked this from me.

I could not correlate anything to VMFS locking as the lock counters were not increasing once the restore started. But I cannot rule out some type of metadata updating as the possible bottleneck with the zeroing.

I have a call open with VMware and expect to have some more information shortly. I could potentially see me putting a feature request in for the option to restore as "Eager Zero Thick." (The default - restore same as source disk does not work). Maybe Veeam 8 fixes that already.

After I find the core issue, maybe the thread can be renamed with approval of the OP?

Thanks,
Josh

juerg.schneebeli · Aug 02, 2019 8:56 am

Generic Information’s about High CPU Usage with any Network related traffic within vSphere environments (as well valid for any Hyper-Visor).

Network traffic creates CPU Interrupts. Those interrupts are sources of High CPU usages. It doesn’t matter if it’s a virtual machine or a physical server. The principals are the same. By default, if a network packet arrives at the physical or virtual NIC, a CPU interrupt is generated. Those interrupts are by "default" only handled by “one” physical or virtual CPU.

Especially in environment >=10Gbps Network CPU load shows 100% usage as many network packets arrives and they are handled always by one pCPU or vCPU in case VMs. As well you may never achieve higher bandwidth as 4.5Gbps as CPU is 100% loaded as interrupts are only handled by one pCPU/vCPU. CPU is exhausted.

That’s why may users thinks the concerned application (in this case VBR/Proxy) is source of high CPU load and think may application is single threaded, but this is not the case.

This problem has been addressed with modern Network NICs especially with NICs 10Gbs and higher. If using <10Gbps NICs, there are no solution. Intel VT-d is the base to solve high cpu usage, but any other Network card vendor support VT-d similar principals. Modern Network Cards are divided in up to 256 virtual device queues. Every device queue is assigned to a pCPU in a round robin fashion during boot of ESXi. Therefore, a modern NIC has not only one interrupt handler but 256. Network interrupts can be potentially handled by 256 pCPUs per NIC. Only pCPU are considered, not Hyper-Threaded once.

The next question is how Network packets which arrive at the physical NIC can be distributed on those different physical network NIC queues and therefore using all pCPUs / vCPUs in parallel. Two load distribution algorithms exist.

• VMDQ
• RSS

One of both methods can be used. Is mandatory that the NIC is supporting device queue (every 10Gbps) (check hardware compatibility).
RSS or VMDQ Functions needs be activated within the virtual machine / ESXi Host. See future support info’s how to configure RSS (as is preferred for Veeam Proxy)

VMDQ (Virtual Machine Device Queues)
Does assign statically every MAC Address of vNIC to a pNIC the a NIC device queue. Therefore, only one interrupt handler per VM exists. If 100x VMs exists, those are handled by different pCPU certainly, but one VM itself may not get higher throughput as 4,5 Gbps as assigned statically to one CPU.

RSS (Receive Side Scaling)
RSS is a mechanism which allows the network driver to spread incoming TCP traffic across multiple CPUs, resulting in increased multi-core efficiency and processor cache utilization. If the driver or the operating system is not capable of using RSS, or if RSS is disabled, all incoming network traffic is handled by only one CPU. In this situation, a single CPU can be the bottleneck for the network while other CPUs might remain idle.
Note: To make use of the RSS mechanism, the hardware version of the virtual machine must be 7 or higher, the virtual network card must be set to VMXNET3, and the guest operating system must be capable and configured properly. On some systems it has to be enabled manually. These operating systems are capable of using RSS:

Windows 2003 SP2 (enabled by default)
Windows 2008 (enabled by default)
Windows 2008 R2 (enabled by default)
Windows Server 2012 (enabled by default)
Linux 2.6.37 and newer (enabled by default)

For further information on Microsoft RSS, see Receive Side Scaling (RSS). For more information on Linux Receive Side Scaling, see RSS and multiqueue support in Linux driver for VMXNET3 (2020567).

Articles as References:
https://www.youtube.com/watch?v=qfGAAcNTd_Q
https://www.vmware.com/content/dam/digi ... -paper.pdf
https://www.intel.com/content/dam/www/p ... -brief.pdf
https://nielshagoort.com/2017/08/16/vxl ... explained/
https://www.youtube.com/watch?v=YhIMswT6K2k
https://www.linkedin.com/pulse/virtual- ... chneebeli/ (German)

R&D Forums

Full VM restore through a backup proxy is single threaded

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threade

Re: Full VM restore through a backup proxy is single threaded

Who is online