Host-based backup of VMware vSphere VMs.
Post Reply
mpasaa
Enthusiast
Posts: 36
Liked: 2 times
Joined: Sep 08, 2009 3:28 pm
Full Name: Mike Audet

VM seems to be "hung" and unresponsive

Post by mpasaa »

We are trying to troubleshoot a very intermittent issue showing up on VMs lately. It's rarely the same VM, occurs on both desktop VMs and server VMs all located in different subnets, and nothing is logged and the vCenter shows no issues with the VM at first glance.

It's only when you try to RDP to it or console in or run scans with various management tools that you noticed something isn't quite right about the connectivity of the VM in question. At first, we suspect this latent network as a possible cause and even the security scanning that runs on this network sometimes even causing issue itself but now I am not so sure. The biggest problem is that there is nothing logged in the Windows event logs or even the VM appliances some of which run various Linux flavors so it's NOT an OS-specific problem and is random. The only "event" entry I could find is the normal Veeam items showing when the VM is being processed by Veeam and then the snapshot removed but nothing else.

Again, there is never anything obvious and we literally have to stumble across a problematic server or get a complaint from a VDI user before we realize something isn't working. For example, we also use Shavlik netchk for Window's patching and whenever this issue occurs the scans will show the server or workstation as "unresponsive" or "not reachable" even when it appears to be powered on. if I go to the VM itself the one thing we notice is that if you right-click the VM to go to the Power options menu to reboot the guest the only options available are Power Off, Suspend & Reset while other options are grayed out. The ONLY time I ever see any menu options grayed out is during snapshots or when other VM reconfigurations are taking place as it is locked for that brief time.

That's why I am starting to suspect a Veeam 8 backup issue when it comes to snapshots or the process. Something is intermittently and randomly causing VMs to becoming unreachable but not killing them altogether or logging potential causes aside from the normal Veeam tasks.

Thoughts? Our current build is 8.0.0.817 Standard version
Shestakov
Veteran
Posts: 7328
Liked: 781 times
Joined: May 21, 2014 11:03 am
Full Name: Nikita Shestakov
Location: Prague
Contact:

Re: VM seems to be "hung" and unresponsive

Post by Shestakov »

Hello Mike,

Could you elaborate on that:
"It's only when you try to RDP to it or console in or run scans with various management tools that you noticed something isn't quite right about the connectivity of the VM in question."?
So when you try to connect to those VMs, they become unresponsive? Is that happening when backup/replication jobs working with those VMs? Have you analyzed performance metrics of those VMs?
It`s quite difficult to say what the problem is by the description(at least for me). Have you considered contacting technical support for the detailed problem investigation?

Thanks.
mpasaa
Enthusiast
Posts: 36
Liked: 2 times
Joined: Sep 08, 2009 3:28 pm
Full Name: Mike Audet

Re: VM seems to be "hung" and unresponsive

Post by mpasaa »

What I meant was that this VM condition ONLY becomes apparent when we try to RDP to a server or desktop VM or run one of our management tools like Shavlik for patching and it sends back an error about the VM being unreachable. THEN, when we go to vSphere and try to reboot the GUEST the only options available are the ones I mentioned...POWER OFF, SUSPEND AND RESET and the others are grayed out so we cannot gracefully shut down the VM. That said, resetting these doesn't cause Windows to think it was improperly shutdown so something is affecting our VMs intermittently that, essentially, puts them in a state that neither makes them fully connected or completely offline--they are still running but not really FULLY ONLINE until we reset.

I do have a ticket open just in case this isn't the proper forum for this and I only wanted to find out if ANYONE else was experiencing similar issues. This could require more technical support than general input and I am well aware of that.

We've even got 2 different vCenter servers running in this environment and this issue occurs on servers, VDI running under Horizon View, static desktops, etc.. We are still investigating this issue too and I am NOT saying Veeam is the cause BUT it DOES interact with vmtools and does snapshots which CAN and HAS caused issues in past versions so that's why I posted this to see if anyone had any possible causes of this.

It could very well be something else on this network as we've had issues with security scanning killing our connections and causing issues as we think their scanning was tuned properly....many potential causes in our environment but we can only check the ones we manage. Thanks for the input.
Shestakov
Veteran
Posts: 7328
Liked: 781 times
Joined: May 21, 2014 11:03 am
Full Name: Nikita Shestakov
Location: Prague
Contact:

Re: VM seems to be "hung" and unresponsive

Post by Shestakov »

Thank you for the explanation, Mike.
Could you provide the support case # for us to follow the situation (assuming you`ve contacted Veeam technical support)?
Thanks!
mpasaa
Enthusiast
Posts: 36
Liked: 2 times
Joined: Sep 08, 2009 3:28 pm
Full Name: Mike Audet

Re: VM seems to be "hung" and unresponsive

Post by mpasaa »

sure thing Case # 00703124

I have already uploaded logs and they didn't see any errors which is why I do not believe Veeam to be the issue but I have to check before pointing fingers at other security groups in this environment. You know how that goes :-)
Vitaliy S.
VP, Product Management
Posts: 27055
Liked: 2710 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: VM seems to be "hung" and unresponsive

Post by Vitaliy S. »

mpasaa wrote:The ONLY time I ever see any menu options grayed out is during snapshots or when other VM reconfigurations are taking place as it is locked for that brief time.
Looks like a similar issue that is described over here > Snapshot removal issues of a large VM

That's a pretty large topic to read, so as a short summary - VM stun can happen during snapshot commit operation. You can reproduce it easily without Veeam server activity, just create a snapshot, give it a bit time to grow and then start the commit operation. You will find a couple of tips in the topic I've referenced above, might be useful.
daveyrand
Novice
Posts: 5
Liked: 1 time
Joined: Apr 20, 2015 8:15 am
Full Name: Dave Randall
Contact:

Re: VM seems to be "hung" and unresponsive

Post by daveyrand »

Hi Mike,

This is very interesting. We appear to be having exactly the same issue, I thought we were all alone!

Here are the symptoms we experience when a VM is affected...

- We are unable to connect to the VM via RDP
- We can access the VM`s console, but it attempts to unlock it with CTRL-ALT-DEL time out after four or five minutes
- The server responds to pings, so our monitoring solution reports it as "Up"
- We can access some MMC snap-ins remotely (e.g. Services), but others just time out (e.g. Event Viewer)
- All required services on the VM are in a running state
- We cannot access things like shared folders
- Performing a hard reset on the VM will return it to normal, until the problem next occurs

...does that look like the same issue you were having?

If so, were you able to track down the issue as being caused by snapshot removal? That does seem to be the case for us, as the problem only seems to occur over-night, and only on nights where Veeam has backed up that particular server.

Cheers,

Dave
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: VM seems to be "hung" and unresponsive

Post by foggy »

Dave, are you able to reproduce similar behavior by creating snapshot of this VM manually?
daveyrand
Novice
Posts: 5
Liked: 1 time
Joined: Apr 20, 2015 8:15 am
Full Name: Dave Randall
Contact:

Re: VM seems to be "hung" and unresponsive

Post by daveyrand »

Hi Foggy,

I gave it a quick go yesterday and it didn't trigger the problem, but one test is probably too small a sample set to draw any conclusions.

I'll do a more intensive test today and see what happens.
veremin
Product Manager
Posts: 20270
Liked: 2252 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: VM seems to be "hung" and unresponsive

Post by veremin »

Speaking about your issue, does it occur, as soon as a backup job starts (snapshot is created) or after some time or closer to the job's end (snapshot is removed)? So, in the test you're going to perform with creating snapshot manually, try to reproduce the similar time pattern. Thanks.
brupnick
Expert
Posts: 196
Liked: 13 times
Joined: Feb 05, 2011 5:09 pm
Full Name: Brian Rupnick
Location: New York, USA
Contact:

Re: VM seems to be "hung" and unresponsive

Post by brupnick » 2 people like this post

We had a similar experience in our environment, particularly with database servers that were generating a large number of changes. Because of the high change rate, VMware was never able to catch up with the snapshot removal process, so it would stun the VM for as long as was necessary to finish the operation. VMware has a pretty good summary of what happens during this time as well as some workarounds here: http://kb.vmware.com/kb/2039754

One thing that we learned when investigating this was that the size of the snapshot isn't really the issue, it's more the change rate. For example, if you have a 200 GB snapshot on a VM that is relatively quiet during the snapshot removal process, there's a good chance that you won't experience this behavior. However, if you have a 2 GB snapshot on a VM that is constantly changing at a high rate while trying to remove a snapshot, VMware will probably stun the machine after the 10th iteration (by default) in order to get off the snapshot.
daveyrand
Novice
Posts: 5
Liked: 1 time
Joined: Apr 20, 2015 8:15 am
Full Name: Dave Randall
Contact:

Re: VM seems to be "hung" and unresponsive

Post by daveyrand »

v.Eremin wrote:Speaking about your issue, does it occur, as soon as a backup job starts (snapshot is created) or after some time or closer to the job's end (snapshot is removed)? So, in the test you're going to perform with creating snapshot manually, try to reproduce the similar time pattern. Thanks.
It's difficult to say with absolute certainty as there is nothing logged to suggest when the problem emerged, however it looks closer to the end (snapshot removal) than the beginning.

I've spent yesterday creating and removing the snapshot on one of the problem servers, but the issue due not reoccur. I left each snapshot open for at least an hour to ensure that the time pattern was the same or longer than the backup duration. I even left one snapshot open over night, but even when I removed that this morning it went okay.

I'm going to switch tactics today and constantly run full backups on one of the problem servers to see if that does the trick.
daveyrand
Novice
Posts: 5
Liked: 1 time
Joined: Apr 20, 2015 8:15 am
Full Name: Dave Randall
Contact:

Re: VM seems to be "hung" and unresponsive

Post by daveyrand »

brupnick wrote:We had a similar experience in our environment, particularly with database servers that were generating a large number of changes. Because of the high change rate, VMware was never able to catch up with the snapshot removal process, so it would stun the VM for as long as was necessary to finish the operation. VMware has a pretty good summary of what happens during this time as well as some workarounds here: http://kb.vmware.com/kb/2039754

One thing that we learned when investigating this was that the size of the snapshot isn't really the issue, it's more the change rate. For example, if you have a 200 GB snapshot on a VM that is relatively quiet during the snapshot removal process, there's a good chance that you won't experience this behavior. However, if you have a 2 GB snapshot on a VM that is constantly changing at a high rate while trying to remove a snapshot, VMware will probably stun the machine after the 10th iteration (by default) in order to get off the snapshot.
Bingo! For us it's always high transaction stuff like Exchange servers, SQL servers and the like. No impact on DC's or file servers.

I'll be sure to check out the linked KB.
daveyrand
Novice
Posts: 5
Liked: 1 time
Joined: Apr 20, 2015 8:15 am
Full Name: Dave Randall
Contact:

Re: VM seems to be "hung" and unresponsive

Post by daveyrand » 1 person likes this post

v.Eremin wrote:Speaking about your issue, does it occur, as soon as a backup job starts (snapshot is created) or after some time or closer to the job's end (snapshot is removed)? So, in the test you're going to perform with creating snapshot manually, try to reproduce the similar time pattern. Thanks.
Okay, while I've not been able to reproduce this during production hours, we were lucky enough for the problem to occur last night on a low-priority file server. I say lucky because we run a little PowerShell script against all our file servers that sends out an email notification if it is unable to remotely open a particular text file, a check which it performs every five minutes. Last night we sporadically received notifications from the script, the first of which was raised at 19:45.

On the VM in question we only backup the system disk, which is 40 GB. From the history in Backup & Replication I can see that the VM processing started at 19:35 and ended at (drumroll please)... 19:49.

After one of my colleagues got in this morning at 07:40 he confirmed that it was the usual problem (unable to RDP, can open console but not log in, etc.) and then reset the server in vSphere. As normal, as the server was absolutely fine once it came back up and the notification emails stopped.

This seems to suggest to me that the issue is being triggered by snapshot creation or removal. However, unlike in the case of brupnik, the issue continues to occur even after the snapshot removal has completed. This was a server that's extremely low activity and we're only backing up a system partition. Not only that, but it's Windows Server 2008 R2 Server Core, with no other applications or roles other than the basic file services stuff.

TLDR; looks like our problem is that snapshot creation and/or removal is occasionally stunning a VM, and then not unstunning it on completion.
whippersnapper
Novice
Posts: 5
Liked: 1 time
Joined: Mar 02, 2015 3:15 pm
Full Name: Tone Loke
Contact:

Re: VM seems to be "hung" and unresponsive

Post by whippersnapper »

Did you ever find a resolution to this problem? We are experiencing the same thing in our environment.
Gostev
Chief Product Officer
Posts: 31460
Liked: 6648 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: VM seems to be "hung" and unresponsive

Post by Gostev »

Typically, the reason for such extended stuns is using hot add transport mode with NFS storage, when backup proxy VM and processed VM are running on different hosts.
Post Reply

Who is online

Users browsing this forum: No registered users and 75 guests