Host-based backup of VMware vSphere VMs.
Post Reply
matteu
Veeam Legend
Posts: 725
Liked: 118 times
Joined: May 11, 2018 8:42 am
Contact:

Sizing and task number

Post by matteu »

Hello,

I would like to know how I can size correctly my customer infrastructure.

Total VM : arround 400

There is 2 different room (1 and 2).
On room 1 there is VBR (on VM) + 3 ESX to save.
On room 2 there is 2 ESX + 2 other ESX.

There is 2 vcenter : 1 with 3 (on cluster1) + 2 remote (cluster B) and an other vcenter for 2 other.

The 5 host in the same vcenter use the same storage array
The 2 others ESX are on dedicated storage (datacore).

Room 1 is backuped on Hardened linux repository with 12 task
Room 2 is backup on second hardened linux repository with 12 task.

I use NBD as transport mode because network is 10Gb.
I installed 1 proxy / ESX with 4 task for each one

On each backup, my bottleneck is the source (99%) and then my target with 60%.

Backup is arround 400MB/s processing and sometimes it's only 70MB/s probably because there is some process on the VM external to the backup.

I have backup copy job with mirroring but it doesn't start until my backup job finish.

I would like to know :
How could I "improve" performance with task number on proxy and repository ? I choose them randomly here.
Do I have to use different schedule for all my backup job ? Actually, they all start at the same time.
When veeam is backuping my vcenter, it can't backup any other VM because it can't connect to vcenter to process them (I suppose) I have error about licence ([Error Unhandled exception was thrown during licensing process]) and this error was only for few VM and then all other are working perfectly.

I use per vm backup job.

I can give all the details you need, I just would like to understand how can I maximize performance for this customer.
Thanks for your help.

Image
soncscy
Veteran
Posts: 643
Liked: 312 times
Joined: Aug 04, 2019 2:57 pm
Full Name: Harvey
Contact:

Re: Sizing and task number

Post by soncscy » 1 person likes this post

Hey matteu,

> On each backup, my bottleneck is the source (99%) and then my target with 60%.

So the host can't handle the load, which is a little surprising for me (4 tasks is not that much in my opinion!) and I think you maybe want to triple check performance. Remember, NBD has some limits and if these NICs are shared for other VMware management operations, even with the improvements in 7.0u1 for vSphere, you'll be gated a bit. 400 MB/s is pretty good for NBD, but it can stream faster with dedicated backup networks, which I'm guessing you have.

I'd start with checking for usual things like open snapshots in the environment, but maybe check the number of tasks running in Veeam when the performance dips.

Also, you might get even better performance by switching to hotadd and putting the hotadd proxy on the same 10Gbit network if you can.
matteu
Veeam Legend
Posts: 725
Liked: 118 times
Joined: May 11, 2018 8:42 am
Contact:

Re: Sizing and task number

Post by matteu »

Hello,

Thanks for your answer.
I was on 460MB/s. The nic is normally dedicated for management and vmware is 6.5 or 6.7 (I don't remember but I will check version and usage tomorow).

There is total of 5 proxy with 4 task for each on the same storage = 20 tasks.
There is 2 proxy with 4 task on the second storage (datacore) = 8 tasks.

400MB would be excellent but I need to see why it go down to 80MB/s then... I will try to check on vsphere component usage (cpu / disk) but without veeamone it will not be easy :)

I don't this there are snapshot on the VM but I ll check it !
There is the same task number when performance is good or not :/

Hot add has lot of issue on lot of customer I had with virtual disk stay connected on proxy and it's lot of wasted phone time to help customer and solve it.
I didn't choose how this should be implemented, it was pre sales company but I understand perfectly his choice.

Finally, how do you "manage" scheduling when you have backup + backup copy X 2 ? I'm not sure I did the good choice here.
jamcool
Enthusiast
Posts: 67
Liked: 11 times
Joined: Feb 02, 2018 7:56 pm
Full Name: Jason Mount
Contact:

Re: Sizing and task number

Post by jamcool »

Nice diagram. :). I have a similar setup with about 30 ESXi 7U2 host in each datacenter with about 700 TB of datastores in each datacenter too.

I also use NBD, with 10 Gbps networking all around and get maybe 600 MBps range if nothing else is going on. When it gets down to just a few VMs backing up (maybe less than 10), it can drop performance to under 50 MBps. Just limitations with what ESXi will allow using NBD (as mentioned). There are some TIDs about multi-threading NBD you may want to Google. I had tried it on a test server but did not see any real improvement.

As far as proxy go, we do 4 vCPU and set the threads to 8. I have about 1 proxy (virtual/windows) per 3 ESXi server. With NBD, it does not use that much CPU. If you were doing virtual (hot-add), then it would consume a lot of more CPU. We moved to NBD because Hot-add would not always release the disk and we found that with NBD and doing just incremental it is much faster.

You may already be doing this as part of best practices but make sure your repository is REFS (windows) or XFS (Linux) and you are doing Synthetic Full on a weekly basis. This way only having to do incremental from VMWARE and less impact on your VMWARE servers.

For VMWARE, make sure DRS is setup to move VMs around to load balance the cluster.

One tip, for those first backups that will be full on large system (TBs), do virtual transport on proxy if you can as it is great performance for large amount of storage and then when done, switch proxy servers back to NBD.
soncscy
Veteran
Posts: 643
Liked: 312 times
Joined: Aug 04, 2019 2:57 pm
Full Name: Harvey
Contact:

Re: Sizing and task number

Post by soncscy »

matteu wrote: Oct 17, 2021 8:44 pm There is total of 5 proxy with 4 task for each on the same storage = 20 tasks.
There is 2 proxy with 4 task on the second storage (datacore) = 8 tasks.
Looks fine, but just keep in mind that you have the max snapshots per datastore that Veeam sets (configurable with registry values), so some of that throughput issue might be snapshots gating you. But increasing concurrent snapshots is not always the answer! In fact, it can make it slower if not careful. But it is good to test and see. If the concurrent tasks from Veeam don't seem to be any different between fast and slow runs, then likely it's the host itself is busy, and you'll want to try to establish some time frame when you see slow runs, and then see what was going on in vSphere at the time.
matteu wrote: Oct 17, 2021 8:44 pm I don't this there are snapshot on the VM but I ll check it !
There is the same task number when performance is good or not :/
Keep in mind, the vsphere client might not show "orphaned" snapshot files, so there are times the client doesn't show it when there really are. I learned a trick checking Veeam logs that if you grep Task logs in Veeam jobs for the pattern '00000[2-9].vmdk', it's a very reliable indicator of machines stuck on snapshots. The job should only make a single snapshot enumerated 000001, so if it's higher, it's a high-confidence indicator that the VM referenced in the log file name is running on snapshots.
matteu wrote: Oct 17, 2021 8:44 pm Finally, how do you "manage" scheduling when you have backup + backup copy X 2 ? I'm not sure I did the good choice here.
I switched most of my clients to Immediate Copy mode and didn't look back. Intervals are too confusing (I'm not even sure __I__ understand it right, and for sure my clients don't get it). We set copy windows for the jobs when the client has resource concerns or bandwidth concerns. It's a little "black box" for some of my clients who are more used to having to configure extremely granular schedules, but I find that usually this preference is just their legacy interest from older software and also a misunderstanding of how the Backup Copy works anyways (i.e., people think it's an rsync clone, but it's not!). It's a difficult talk sometimes, but for our usual setup for clients, we push Immediate Copy and use Backup Windows to manage resource conflicts.
matteu
Veeam Legend
Posts: 725
Liked: 118 times
Joined: May 11, 2018 8:42 am
Contact:

Re: Sizing and task number

Post by matteu »

This morning, emergency call from my customer because infrastructure is down...
One ESX did purple screen and HA did his job. Both other ESX were not enough to support the load and did a purple screen too...

He had ESX 6u2 and had this bug : https://kb.vmware.com/s/article/2145071
He was in this situation : BlueScreen: VERIFY bora/vmkernel/net/vmxnet3_vmkdev.c:10474

We reboot the servers this morning and I swap the maximum task from 4 to 2...

We patch all the ESX from U2 to U3 EP25 and I hope there will not be more issue this night...

There are some snapshot on the environment but on template vm. I use rvtools to see snapshot or not.

Thanks for backup copy schedule ! I will not change anything here !
matteu
Veeam Legend
Posts: 725
Liked: 118 times
Joined: May 11, 2018 8:42 am
Contact:

Re: Sizing and task number

Post by matteu »

Result this morning :

Backup from old storage array to linux repository : Arround 240 MB/s processing rate but throughput last 5 min is up to 400MB/s sometimes !
Bottleneck source with 99%. I can't do really better I think. Storage is old VNX...
Backup copy from linux to linux : 1,5 GB/s O_o extremly fast !
Bottleneck : Source 91% but with this bandwith, it's not really an issue :)
Post Reply

Who is online

Users browsing this forum: Google [Bot] and 55 guests