- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Strange issue with VM replication
I had a weird thing occur this morning with VM at a customer that is replicated every 30 min. After a replication job ran at 4:30AM I began receiving alerts that the VM was down, it was unreachable from the monitoring software, no ping nothing. I didn't get around to investigating the problem until 7:30, by that time the VM had been down for 3 hours, fortunately no one is using during these hours, but at 7:30, the client it open for business and they needed to use their critical application running on this VM. Upon investigating the problem I found that VM was up, I got onto the console the Vsphere client and logged into the guest OS, I couldn't ping anything from the guest OS, nor could the VM be pinged from anywhere else on the LAN. I rebooted the VM, but no change, still down and unreachable. There had also been a replication job that started at 7:30 but I canceled it and it stopped. All the replication jobs that had run since 4:30, every half hour, had completed successfully. I contacted Veeam support, they connected remotely and could not see anything wrong, the VM looked normal, no stuck snapshot, nothing. At this point I had to get the customer up and running, so I chose to fail over to a replica snapshot from 4AM and it failed over and came up no issues, the customer had to run all day on the replica and I am in the process of failing back to production as I type this.
Any ideas what could of caused the VM's networking to fail? I opened a case with VMware and they mentioned that Veeam 'stuns' a VM before taking a snapshot and that they have seen it sometimes cause 'issues' like this.
Has anyone ever experienced anything like this? I have to say I never had issues like this under Veeam 6.5 but I wasn't replicating as much with that version, since upgrading to 8 I've had a number of issues.
I haven't installed patch 1 yet, but I intend to as soon as I can.
Esxi version is 5.1.0 build 1483097
I have an open case #00724501
			
			
									
						
										
						Any ideas what could of caused the VM's networking to fail? I opened a case with VMware and they mentioned that Veeam 'stuns' a VM before taking a snapshot and that they have seen it sometimes cause 'issues' like this.
Has anyone ever experienced anything like this? I have to say I never had issues like this under Veeam 6.5 but I wasn't replicating as much with that version, since upgrading to 8 I've had a number of issues.
I haven't installed patch 1 yet, but I intend to as soon as I can.
Esxi version is 5.1.0 build 1483097
I have an open case #00724501
- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
Completed the fail back to production, VM has the same problem, network card shows 'disconnected', tried disabling it and re enabling it, no change. Then we tried undoing the failover to production, that produced an error and then said it had succeeded. Both the production and replica VM are off at the moment and I am waiting for senior level support to get in touch with me.....
			
			
									
						
										
						- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
On the phone with a senior support engineer, there is definitely a problem with the networking on this VM, we have removed the E1000 network card, added it back, no change. The engineer is now saying something is wrong with VMware but it seems too coincidental that right after the 4:30AM replication job the network card started experiencing problems. I only have standard support from VMWare which is during business hours. I understand how Veeam works with snapshots but something happened just with this VM after the replication job ran at 4:30 and it was broken thereafter.
Still no the phone with the engineer, but not a happy camper.
			
			
									
						
										
						Still no the phone with the engineer, but not a happy camper.
- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
I guess I over reacted a bit, turns out it was a a memory leak in the vswitch, apparently its an issue with Vmware 5.1 patch 2 when using E1000 network adapters. My apologies.
			
			
									
						
										
						- 
				Vitaliy S.
- VP, Product Management
- Posts: 27692
- Liked: 2907 times
- Joined: Mar 30, 2009 9:13 am
- Full Name: Vitaliy Safarov
- Contact:
Re: Strange issue with VM replication
Is there any KB from VMware regarding this issue? It would be useful to post it for future readers.
			
			
									
						
										
						- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
Yes, here is a link to the KB article.
http://kb.vmware.com/selfservice/micros ... Id=2072694
I just wonder if Veeam might have contributed to the problem in any way since this particular VM replicates every 30 minutes and was the only one having the issues, other VM's on this host were unaffected but do not replicate anywhere nearly as often.
			
			
									
						
										
						http://kb.vmware.com/selfservice/micros ... Id=2072694
I just wonder if Veeam might have contributed to the problem in any way since this particular VM replicates every 30 minutes and was the only one having the issues, other VM's on this host were unaffected but do not replicate anywhere nearly as often.
- 
				Vitaliy S.
- VP, Product Management
- Posts: 27692
- Liked: 2907 times
- Joined: Mar 30, 2009 9:13 am
- Full Name: Vitaliy Safarov
- Contact:
Re: Strange issue with VM replication
Not sure, is it a highly transnational application? I think it should be possible to see more details about the issue in VMware VM logs located in the datastore.  Thanks for the link!
			
			
									
						
										
						- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
Its running a pervasive SQL data base, but there are maybe about 20 people using it at the same time, certainly no one using it between 12-7AM.
			
			
									
						
										
						- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
Ok so today a replication job ran on an exchange vm on this same host, and the same thing happened, network card went down with the exact same symptoms as the other virtual machine and I fixed it the same way, removed the E1000 network card, added VMXnet2 and it worked. I am just suspicious though that everything was fine with this VM until the replication job ran, I am not saying the replication job was necessarily responsible for the issue, but could it have contributed in some way to exhausting the memory on the the vswitch and port this vm is connected to? I am going to be applying patch 4 for this version of ESXi this weekend, but I just want to bring this to your attention. I am a little uncertain if I should continue with my other replication jobs for other vm's on this  host, or hold off until I apply the patch.
			
			
									
						
										
						- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
I ran a replication job on the second affected virtual machine and it ran fine.
			
			
									
						
										
						- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
A backup job ran fine on the first affected vm after we implemented the workaround, what if anything is different about a replication job versus a backup job that might be contributing to this issue? I have suspended my other replication jobs for the time being.
			
			
									
						
										
						- 
				Vitaliy S.
- VP, Product Management
- Posts: 27692
- Liked: 2907 times
- Joined: Mar 30, 2009 9:13 am
- Full Name: Vitaliy Safarov
- Contact:
Re: Strange issue with VM replication
Was the duration of the backup and replication job the same in both runs? What about network transfer (amount of data) through that switch, was it the same?
			
			
									
						
										
						- 
				ashman70
- Expert
- Posts: 203
- Liked: 12 times
- Joined: Dec 04, 2012 2:18 pm
- Full Name: Both
- Contact:
Re: Strange issue with VM replication
It depends on the VM, some replication jobs run more frequently then others and so replicate less per job then those that only run once a day. I think the issue here is that the  more replication jobs run the more traffic they send through the vswitch and in my case since I am suffering from this bug, the traffic exacerbates the out of memory condition for the port the vm is connected to on the vswitch causing the port to go offline. I had had it happen to the customers terminal server at 4:30am yesterday morning and while I replaced the network adapter on the terminal server vm it killed terminal services, no one could connect via RDP, I would up failing over to their replica so they could work, but I replaced the network adapters on their two remaining vm's which are domain controllers. This weekend I am going to apply update 3 for the version of Esxi I am running which fixes this issue.
			
			
									
						
										
						- 
				Vitaliy S.
- VP, Product Management
- Posts: 27692
- Liked: 2907 times
- Joined: Mar 30, 2009 9:13 am
- Full Name: Vitaliy Safarov
- Contact:
Re: Strange issue with VM replication
Thanks for further information and please keep us updated on how the patching goes.
			
			
									
						
										
						Who is online
Users browsing this forum: Google [Bot] and 3 guests