Increase in Hot Add backup time after installing Update 3a

willjohnson · Jul 11, 2018 2:51 pm

[UPDATE] August 24, 2018
Solution is to install this hotfix > KB2711

Hi,

Is anyone else experiencing a huge increase in backup or replication times since updating to U3a? (In addition to the SQL problem).

e.g. Backup job normally taking 3.5hrs now taking over 21hrs since updating.

After the update, logs now have additional lines of [ViProxyEnvironment], including"The proxy has NBD mode", while Veeam GUI still says HDDs are being backed up using [hotadd].

Logs also have a new [ViProxyEnvironment] entries of "The proxy cannot be used for write" and "The proxy has not SAN mode".

Before updating, logs had no mention of NBD mode etc. or [ViProxyEnvironment].

Case # 03096521

Cheers,

Will

Post by **foggy** » Jul 12, 2018 8:53 am this post

Hi Will, these messages seem to be vSphere 6.7 related. Please continue investigating with support engineer.

willjohnson · Post by **willjohnson** » Jul 12, 2018 8:56 am this post

Hi foggy,

Messages weren't in logs before U3a update, regardless of vSphere 6.7.

Cheers.

Post by **foggy** » Jul 12, 2018 8:58 am this post

Yes, because vSphere 6.7 support was added in U3a update.

willjohnson · Post by **willjohnson** » Jul 12, 2018 9:00 am this post

We're on 6.5.

Waiting for support to respond.

Cheers

Aron.Stocker · Post by **Aron.Stocker** » Jul 16, 2018 5:05 am this post

Hi, we have the same issue (very longer backups and replication s).
We're interested too, please post any reply you receive from support.
Thanks

Post by **foggy** » Jul 16, 2018 12:18 pm this post

I would recommend contacting support directly, since the reasons for that might be different and depend on a particular environment.

Post by **Gostev** » Jul 16, 2018 1:07 pm this post

It looks like the issue is caused by hot add process taking too long (20-30 min), and it appears to be caused by some change around the latest VDDK update that Update 3a uses. Because for the original case, the workaround was to replace VDDK libraries with ones used in Update 3, which took hot add times down to normal (1-2 min).

We also already know that the issue does not affect every environment, so there's something else to it.

We will be opening a support case with VMware on this.

omegagx · Post by **omegagx** » Jul 16, 2018 1:59 pm this post

OK, please keep us updated on this. We will hold off on installing Update 3a until this is resolved.

Post by **humbertoz** » Jul 16, 2018 11:46 pm this post

I know a lot of people have been fighting with the SQL issues that "appeared" with this update 3a. In reality, it is just how Veeam decided to change how the SQL backups run. The way they are doing it now is better, but everyone who got caught not having the exact "documentation" permissions in their SQL servers had their jobs break when they were running fine in the past.

Unfortunately, we ran into a big issue with one customer who has a couple of VMs with lots of drives (as in 8+ each). We have not observed this issue with any of our other customers, but everybody else has VMs with 4 drives max. When their jobs run, all lower disk count VMs process perfectly as they always have. Once one of the VMs with high disk counts starts to process, all currently processing VMs take from 30 minutes to two hours to connect each hotadd disk. Even disks that are already connected and transferring data pause when a disk hotadd is in it's limbo state. Basically, the entire job pauses during the disk hotadd timeout. Once the high disk VM has finally finished backing up, all remaining VMs in the job run as expected. This has caused their incremental jobs to go from 25 minutes to 2-4 hours. We have a ticket with Veeam and they have confirmed a bug with update 3a and high disk count VMs. They do not see the issue in debug against the VDDK, but when the software runs, obviously the issue is there. They have confirmed several other tickets coming in with the same symptoms. They are not seeing this issue with jobs configured with NBD (network mode). This customer runs on a very fast, three server cluster with 20Gb vSAN. It runs incredibly great and hotadd is perfect in their scenario. NBD is much slower for them since their VMs are on several different networks/DMZs and their firewalls are 1Gb. Their internet is 1Gb also, so they do not have a huge need to have 10Gb or 20Gb firewalls which would speed up their jobs if they were to switch to NBD. Hotadd allows the drives to be ripped at 20Gb regardless of network design/complexity which is beautiful.

Just a warning for everybody out there. Since you have to have VMs with lots of drives (we do not know the exact number, but at least more than four), luckily it will not affect a huge amount of people, but the ones it does it will be painful.

FYI, Veeam ticket # 03095693

Post by **Gostev** » Jul 17, 2018 12:03 am this post

Thanks for the hint on VMs with large number of drives being the culprit. If this is confirmed and the issue is indeed with the latest VDDK version, a simple temporary hotfix could be for us to automatically force Network transport for VMs with number of disks larger than X. We will be trying to reproduce the issue internally now.

Post by **humbertoz** » Jul 17, 2018 12:39 am this post

My bad...I looked around everywhere for hotadd and high disk count posts, but never found this one since the title was worded a little bit different. Luckily, my post was merged to this one to help out everybody already here. As additional info, the customer that is affected by this was on vCenter and ESXi 6.5. After upgrading to 3a, they were hit by these issues. Since they were on 3a now, we were able to upgrade their vCenter and ESXi from 6.5 to 6.7. Unfortunately, their problems persist, so the issue is in the 6.7 VDDK from VMware that Veeam is using in 3a to be compatible with vCenter and ESXi 6.7. Since the customer was previously on vCenter and ESXi 6.5, it would seem that the VMware 6.7 VDDK also talks the same bad language to 6.5. I'm not sure if the root of this issue is in how Veeam interacts with the VDDK, or if it is how the VDDK interacts with vCenter and the hosts after getting all it's instructions from Veeam. If the VDDK just needs to be instructed differently by Veeam, then Veeam will be able to fix this with another patch. If the VDDK needs to interact with vCenter and the hosts differently, then the fix will have to come from VMware as a patch to the VDDK and then Veeam will have to integrate the newly patched VDDK into a patch 3b for all of us.

Veeam support, please correct any of this, or provide any further insight into how we will be able to resolve this issue. Unfortunately, failover for high disk count VMs to NBD would not be a solution for this customer, because then backups would "slow" down to 1Gb. The additional time to process the VMs versus the "broke" hotadd would probably be the same or more for their monthly full backups. For incrementals, it would take longer than before ("working" hotadd), but I'm positive it would still be faster than the current "broke" hotadd. Our support rep mentioned that this is basically the biggest issue they are working on right now coming out of 3a. They said it is in tier 3 and R&D, so hopefully it is getting a lot of traction over there.

Luckily, out of all our customers, only one has VMs with lots of drives and got hit by this bug.

Post by **Gostev** » Jul 17, 2018 10:56 am this post

First tests are done and we could not confirm an issue using multiple VMs with 11 disks. To be continued...

Post by **spiritie** » Jul 17, 2018 12:53 pm this post

We've also seen huge increase in backup time, and are not able to reach our backup window anymore.

Busy adding another proxy to see if it helps.

Running VMware 6.5, and all proxies are VM's using hotadd.

We though don't have many VM's with many disks, but have some large jobs.

EDIT:
Just did an inventory of our VM's in on of our vCenters. Out of around 400-500 VM's,
only 9 VM's has over 3 disks (Ranges between 4-7 disks)

Around 70-75% of the VM's has 1 disk, and 20-25% has 2 disks, rest is above.

Post by **humbertoz** » Jul 17, 2018 1:54 pm this post

Gostev wrote:First tests are done and we could not confirm an issue using multiple VMs with 11 disks. To be continued...

Gostev,

There has to be something to this. You can look at our ticket which will have logs uploaded. This customer has three backup jobs with 10, 13 and 26 VMs and two replication jobs with 9 and 10 VMs. There are only two VMs that go into the hotadd "pausing job" issue. One is a SQL server in job #1 with 10 VMs. The other is a backend Exchange server in job # 2 with 13 VMs. The third job with 26 VMs is not affected. Neither are the two replication jobs, but as I mentioned, neither the two replication jobs nor the third backup job have those two VMs in it (or any other VMs with high disk count). No matter what order we move those two VMs in their jobs (first, middle, last), the jobs tank as soon as they reach those two VMs. The issue has to be related to drive quantity. It could maybe be total drive sizes? Thos two VMs are their largest VMs. If you add up all 8/9 drives the SQL server is 2.2TB and the Exchange server is 3.5TB, but this is all spread out on 100GB, 200GB, 500GB drives for both VMs. They have three other VMs that are file servers with hundreds of thousands of files and they backup fine. They are 2.0TB and 1.7TB in size. Those three have only two drives, a small 60GB OS drive and then the 1.7TB to 2.0TB drive. I'll list things I can think might influence this.

#1 - VMs with high quantity of disks
#2 - All their VMs including the two affected ones use the VMware Paravirtual SCSI controller. Not sure if you tested this in conjunction with high quantity of disks.
#3 - VMs with large disks. Those two VMs are their biggest ones. Well over 2TB in total disks added up. There next larger VMs are exactly 2.0TB and down. Not sure if their is a majic threshold of 2TB for this issue.
#4 - Application aware processing VMs. All their VMs run with this on. The two affected VMs are a SQL and backend Exchange that have it enabled for obvious transaction log reasons. Again, not sure if you tested this in conjunction with high quantity of disks.
#5 - The current alignment of planets and moon causing this. I'm at a loss with this issue. No matter what we do to the backup jobs, when it hits those two VMs, they blow up.

Hopefully you can get some insight in your testing. If you need any specific changes tested on our end, just let me know and we'll try it. Every body just wants to get to the source of this evil.

Post by **humbertoz** » Jul 17, 2018 2:05 pm this post

spiritie wrote:We've also seen huge increase in backup time, and are not able to reach our backup window anymore.

Busy adding another proxy to see if it helps.

Running VMware 6.5, and all proxies are VM's using hotadd.

We though don't have many VM's with many disks, but have some large jobs.

EDIT:
Just did an inventory of our VM's in on of our vCenters. Out of around 400-500 VM's,
only 9 VM's has over 3 disks (Ranges between 4-7 disks)

Around 70-75% of the VM's has 1 disk, and 20-25% has 2 disks, rest is above.

Spiritie,

Can you look at your jobs and see where the slowdown is to maybe shed some commonalities with our setup? When we look at our jobs, we can easily see the issue. All VMs process correctly (with low incremental backup times 5-10 minutes) until they hit those two "bad" VMs. Then the backup time for those VMs and any other VMs processing at the same time blows up to 2-4 hours. After those two VMs are processed, all remaining VMs are processed correctly again (5-10 minutes). You can literally see it in the backup summary report without even digging deeper. The "duration" column will be good, good, good, bad, bad, bad, then good , good again as soon as the "bad" VMs are done processing (good VMs are also affected while the "bad" VMs are processing concurrently, but everything goes back to normal as soon as the bad VMs are done). This customer only has two VMs with high disk counts out of 60 VMs, but those two are causing a real headache. Hopefully, when you find which VMs are causing an issue, you can share specs on them.

Post by **Gostev** » Jul 17, 2018 2:08 pm this post

@Gert for those on 6.5, the quick and confirmed fix is to roll backup VDDK back to the version used in Update 3 - feel free to contact our support for assistance with this.

Post by **spiritie** » Jul 18, 2018 7:17 am this post

Gostev wrote:@Gert for those on 6.5, the quick and confirmed fix is to roll backup VDDK back to the version used in Update 3 - feel free to contact our support for assistance with this.

Hi Gostev, we are upgrading vCenter 6.7 tomorrow (No ESXi), is this bug with vCenter or ESXi?

ottl05 · Post by **ottl05** » Jul 18, 2018 7:20 am this post

we have the same issue.
vm on 6.5, hotadd and a increase on backup time

whats the best way to solve this?

mcz · Post by **mcz** » Jul 18, 2018 7:24 am this post

ottl05 wrote:we have the same issue.
vm on 6.5, hotadd and a increase on backup time

whats the best way to solve this?

I think rolling back VDDK as Anton Gostev wrote:

@Gert for those on 6.5, the quick and confirmed fix is to roll backup VDDK back to the version used in Update 3 - feel free to contact our support for assistance with this.

Post by **spiritie** » Jul 18, 2018 7:37 am this post

humbertoz wrote: Spiritie,

Can you look at your jobs and see where the slowdown is to maybe shed some commonalities with our setup? When we look at our jobs, we can easily see the issue. All VMs process correctly (with low incremental backup times 5-10 minutes) until they hit those two "bad" VMs. Then the backup time for those VMs and any other VMs processing at the same time blows up to 2-4 hours. After those two VMs are processed, all remaining VMs are processed correctly again (5-10 minutes). You can literally see it in the backup summary report without even digging deeper. The "duration" column will be good, good, good, bad, bad, bad, then good , good again as soon as the "bad" VMs are done processing (good VMs are also affected while the "bad" VMs are processing concurrently, but everything goes back to normal as soon as the bad VMs are done). This customer only has two VMs with high disk counts out of 60 VMs, but those two are causing a real headache. Hopefully, when you find which VMs are causing an issue, you can share specs on them.

I've looked through them, but cannot find any redline on "bad VM's". I have 1 job where we have 27 VM's in it
and the only 2 VM'sthat was slow (a bit over 1 hour each) was tiny VM's that only had 40 GB and 1 disk on them.

But this seems to randomize, because I have some a lot of VM's that worked fine yesterday but "failed" this night.

All the VM's that is slow goes like this in the log, all of them stuck on hot adding the disk, and it fails over to NBD mode and processes them fairly quickly:

17-07-2018 21:35:24 :: Using backup proxy VMware Backup Proxy for disk Hard disk 1 [hotadd]
17-07-2018 22:47:36 :: Unable to hot add source disk, failing over to network mode...
17-07-2018 22:47:38 :: Hard disk 1 (40,0 GB) 1,1 GB read at 22 MB/s [CBT]
17-07-2018 22:48:47 :: Removing VM snapshot
17-07-2018 22:49:08 :: Finalizing

Gostev has reported that there is a fix for VMware vSphere 6.5

mcz · Post by **mcz** » Jul 18, 2018 7:42 am this post

I had such issues in the past when proxie's bios.uuid wasn't unique withing the same vcenter. This would be the case if e.g. someone would replicate proxies to another host but still within the same vcenter. Has anybody checked if this isn't the case?

Post by **humbertoz** » Jul 18, 2018 8:19 am this post

humbertoz wrote: Spiritie,

Can you look at your jobs and see where the slowdown is to maybe shed some commonalities with our setup? When we look at our jobs, we can easily see the issue. All VMs process correctly (with low incremental backup times 5-10 minutes) until they hit those two "bad" VMs. Then the backup time for those VMs and any other VMs processing at the same time blows up to 2-4 hours. After those two VMs are processed, all remaining VMs are processed correctly again (5-10 minutes). You can literally see it in the backup summary report without even digging deeper. The "duration" column will be good, good, good, bad, bad, bad, then good , good again as soon as the "bad" VMs are done processing (good VMs are also affected while the "bad" VMs are processing concurrently, but everything goes back to normal as soon as the bad VMs are done). This customer only has two VMs with high disk counts out of 60 VMs, but those two are causing a real headache. Hopefully, when you find which VMs are causing an issue, you can share specs on them.

I've looked through them, but cannot find any redline on "bad VM's". I have 1 job where we have 27 VM's in it
and the only 2 VM'sthat was slow (a bit over 1 hour each) was tiny VM's that only had 40 GB and 1 disk on them.

But this seems to randomize, because I have some a lot of VM's that worked fine yesterday but "failed" this night.

All the VM's that is slow goes like this in the log, all of them stuck on hot adding the disk, and it fails over to NBD mode and processes them fairly quickly:

17-07-2018 21:35:24 :: Using backup proxy VMware Backup Proxy for disk Hard disk 1 [hotadd]
17-07-2018 22:47:36 :: Unable to hot add source disk, failing over to network mode...
17-07-2018 22:47:38 :: Hard disk 1 (40,0 GB) 1,1 GB read at 22 MB/s [CBT]
17-07-2018 22:48:47 :: Removing VM snapshot
17-07-2018 22:49:08 :: Finalizing

Gostev has reported that there is a fix for VMware vSphere 6.5

Spiritie,

Thanks for the info. Unfortunately, it looks like your issues is a little different than our customer's. Your hotadd takes a long time to process, like theirs, but in the end is failing and switching over to NBD (network) mode. Our customer's two "bad" VMs take a long time for the hotadd process just like you are also seeing, but the hotadd never fails. It finally completes and then the VM begins the drive read and backup process until it pauses again. Their "bad" VMs are never random. Yours seem to be random a little. I do not see any things in common with your issue and our issue except that hotadd is having problems. Hopefully, Gostev is able to track some issues down and find a common source to all our ailments.

ottl05 · Jul 20, 2018 3:22 pm

I have an open ticket (#03106424 ) about rollback vddk version, but nothing is happening

ottl05 · Post by **ottl05** » Jul 23, 2018 6:20 am this post

ottl05 wrote:I have an open ticket (#03106424 ) about rollback vddk version, but nothing is happening

I changed the vddk-files and now the backup times are back to "normal".

thanks.

DominikM · Post by **DominikM** » Jul 23, 2018 7:25 am this post

I'm having the same Problem since I've upgraded to 9.5 U3a. Since we're still on vSphere 6.5 I've requested a downgrade of the VDDK version.

omegagx · Post by **omegagx** » Jul 26, 2018 9:34 pm this post

So this issue doesn't occur on ESX 6.0 U2 ?

AlexWhit · Post by **AlexWhit** » Jul 30, 2018 7:33 am this post

HI have seen this and dot told by Veeam support that it is normal

Jul 30, 2018 8:33 am

omegagx wrote:So this issue doesn't occur on ESX 6.0 U2 ?

Hi,

the issue is not within the hypervisor it is within the vmware software development kit (VDDK) that backup vendors need to add to their products.
For vsphere 6.7 compatibillity reasons we upgraded to VDDK 6.7 which include the issue.

For those customers that do not have upgraded to vsphere 6.7 you can stay on our Update 3 (older vddk kit) or contact support to downgrade the vddk kit with Update 3a. Let‘s hope that VMware fixes the VDDK kit soon.

ChuckS42 · Jul 30, 2018 9:39 pm

Has VMware acknowledged the issue and started working on a fix?

R&D Forums

Increase in Hot Add backup time after installing Update 3a

Re: Increase in backup/replication times after U3a installat

Re: Increase in backup/replication times after U3a installat

Re: Increase in backup/replication times after U3a installat

Re: Increase in backup/replication times after U3a installat

Re: Increase in backup/replication times after U3a installat

Re: Increase in backup/replication times after U3a installat

Re: Increase in backup/replication times after U3a installat

Re: Increase in backup/replication times after U3a installat

[MERGED] Issue with update 3a Hotadd and VMs with many disks

Re: Increase in Hot Add backup time after installing Update

Re: Increase in backup/replication times after U3a installat

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Re: Increase in Hot Add backup time after installing Update

Who is online