Discussions specific to the VMware vSphere hypervisor
Post Reply
ashman70
Expert
Posts: 203
Liked: 12 times
Joined: Dec 04, 2012 2:18 pm
Full Name: Both
Contact:

Backup job stuck at removing snapshot 99% for 7hrs

Post by ashman70 » Aug 13, 2017 12:23 am

I have a backup job running under Veeam 9.0.0.1715 that has been stuck at 99% removing the snapshot for 7hrs now. There was a pending replication job for the same VM that I stopped. Not sure what to do, leave it awhile longer or kill the job? I checked and there are no other snapshots visible for the VM.
What should I do?

ashman70
Expert
Posts: 203
Liked: 12 times
Joined: Dec 04, 2012 2:18 pm
Full Name: Both
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by ashman70 » Aug 13, 2017 3:36 am 1 person likes this post

Patience won out in the end, it eventually removed the snapshot and all is well.

mwvme
Veeam Software
Posts: 112
Liked: 20 times
Joined: Dec 05, 2015 10:19 pm
Full Name: Michael White
Location: Calgary, Alberta Canada
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by mwvme » Aug 16, 2017 8:49 pm 1 person likes this post

For what it is worth, VMware often improves the time of clearing or consolidating snapshots in each release. I hope that this issue never hits you again, but if it does, I hope you are on some newer version of VMware that minimizes the time.

Michael
Michael White
Field Product Manager
https://notesfrommwhite.net
@mwVme

stevenrodenburg1
Expert
Posts: 125
Liked: 19 times
Joined: May 31, 2011 9:11 am
Full Name: Steven Rodenburg
Location: Switzerland
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by stevenrodenburg1 » Dec 21, 2017 3:47 am

Sorry mr. White but no way that VMware needs 7 hours or more to delete a backup-snapshot (assuming it's not a gigantic snapshot and the storage is not totally overloaded) .

I maybe looking at a similar situation right now at a customer's site. All the VM's in the job are completely finished, but there is one VM stuck at 99%, which finished (reading all disks) a while ago, that simply does not even enter the "removing snapshot" phase (it has a single vmdk). The API call to vCenter has not been sent yet (I checked). So no way that this is a problem on VMware's side.

This just happens every now and then. I got used to it. One just needs to wait long enough (many hours sometimes) and at one point, Veeam finally issues the API-call to vCenter to delete the snapshot (which goes fast). There is nothing else going on in Veeam. So I have no clue as to what on earth it's waiting for.
With all the other vm's, right after the last VMDK was read, it issues the delete snapshot command to vCenter and done. With this VM, I have no idea. The last line in it's log is that it read the VM and the line saying that it's deleting the snapshot still has not appeared after 3 hours.

stevenrodenburg1
Expert
Posts: 125
Liked: 19 times
Joined: May 31, 2011 9:11 am
Full Name: Steven Rodenburg
Location: Switzerland
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by stevenrodenburg1 » Dec 21, 2017 4:19 am

Update: it finally went to the Delete snapshot stage (which only took 9 seconds to do).

foggy
Veeam Software
Posts: 16702
Liked: 1343 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by foggy » Dec 21, 2017 4:34 pm

Steven, I strongly recommend asking support to review this behavior to find the cause, since it's not normal.

stevenrodenburg1
Expert
Posts: 125
Liked: 19 times
Joined: May 31, 2011 9:11 am
Full Name: Steven Rodenburg
Location: Switzerland
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by stevenrodenburg1 » Dec 21, 2017 4:50 pm

Yeah I know but this is a financial institution and this environment is isolated. Nothing goes in or out. No Internet, no nutt'n. Pain In The Butt. It happens every once in a while and always at night. I see it the next morning when the daily backup job suddenly took waaaay longer than normal and one VM was waiting forever to get started with snapshot deletion. But it always solves itself.
I suspect it has to do with the backup-proxy having trouble letting go off the VMDK it attached to itself (hot-add). Because when I get impatient, and nuke the job, I always end up with a VMDK from the VM that got stuck, still attached to the proxy. If i then de-tach it myself, I can consolidate that "stuck" VM.
I gave it some more thought, and by now I strongly believe that when it happens, the proxy cannot let go for some freak reason. Only when it detaches the VMDK, the job suddenly moves on with the snapshot-delete command towards vCenter and all is well.

Maybe it's an idea to have Veeam not try to detach a VDMK until eternity, but just give up after say 5 or 10 minutes. If it gives up, at least the main job can go on, the stuck VM job-part gets an abort and a mail is sent saying so. If such detach-problems happen, it holds up the entire job. It holds up a slot on the proxy, and for what? for one stupid VM who is stubborn like a donkey? No just break off after 5 or 10 minutes, show an error but move on. Whole backup-windows go to h*ll because of one vmdk detach issue with some VM. The backup-administrator can manually detach that stuck vmdk from the proxy in a controlled manner. And start looking for the cause.

cby
Expert
Posts: 108
Liked: 6 times
Joined: Feb 24, 2009 5:02 pm
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by cby » Dec 22, 2017 10:24 am

Steven

As an aside and not directly associated with the issue you report (yes, I've also encountered your frustrating scenario in the past), deleting a snapshot post backup can be a time-consuming affair particularly where there are multiple old snapshots in the chain.

A couple of years ago I found the best way to monitor the hanging 99% was to hop onto the ESX host where the VM resides, then change directories to get to the VM in question (e.g. cd /vmfs/volumes/xxx-xxx-xxx.../VM-name) and run something along the lines of:

watch -n 10 -d 'ls -luth | grep -E "delta|flat|sesparse"' (this is a cutdown Linux remember!)

to get a regular update of snapshot deletion progress. Like I said, particularly useful for multiple snaps which was the case here.

As an added bonus this also allowed me to calculate a snapshot deletion/merge completion time to within a surprisingly accurate degree.

May help someone in the same situation.

Gostev
Veeam Software
Posts: 22813
Liked: 2807 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Backup job stuck at removing snapshot 99% for 7hrs

Post by Gostev » Dec 27, 2017 3:20 pm

stevenrodenburg1 wrote:Sorry mr. White but no way that VMware needs 7 hours or more to delete a backup-snapshot (assuming it's not a gigantic snapshot and the storage is not totally overloaded)
I would actually disagree here, as I've seen numerous reports in the past years where it was taking over 24 hours. No Veeam in the picture - just admins chatting on VMware Communities worrying about "stuck" snapshot removal task in vCenter.

It important to realize this is not always about the size of snapshots. It could also be due to overloaded primary storage and busy VM, which results in snapshot removal process unable to catch up with the new writes (there are certain snapshot size and storage performance thresholds before hypervisor will allow VM stun for that final aux snapshot commit, to ensure stun time remains acceptable).

Luckily, most of that stuff is a thing of the past now, after VMware had completely re-architected snapshot removal in the latest vSphere releases by reusing the code for Storage VMotion, which allowed new writes to go straight into the base VMDK instead of a snapshot file.

Post Reply

Who is online

Users browsing this forum: Baidu [Spider] and 31 guests