Issues with redo log corruption - hotadd?

tgiphil · Post by **tgiphil** » Dec 20, 2011 11:49 pm this post

Is anyone experiencing issues with corrupt virtual machine disks after snapshot removal during a backup job?

We lost three VMs in the past week after moving VEEAM v6 into production. VMware reported failed consolidation events after VEEAM removed the snapshot after the backup, followed shortly thereafter with redo log .vmdk corruption messages. The VMs required full restores to bring them back into production.

We opened a case with VMware support for the first VM that failed. Support researched the issues and stated the corruption was caused by both VEEAM and BackupExec running at the same time. Okay, so we stopped doing that.

Then yesterday and last Thursday, two other VMs, that were not associated with BackupExec, failed for similar reasons. We are in the process of researching the issue and have tickets open with VEEAM and VMware support. The odd thing about the last failure was backup proxy VM also stopped; VMware turned it off immediately when it reported corruption in the redo log – on the other VM’s disk. And that other VM actually stayed up while Windows filled the event log with disk errors.

Based on the events, I suspect the issue is related to HotAdd feature with the virtual appliance transport mode.

While I sincerely hope it’s not a VEEAM issue (I really like the product). I’d like to get figure out what is going on – as we can’t have VMs being corrupted.

vmexpert · Post by **vmexpert** » Dec 21, 2011 12:04 am this post

I had similar issue many moons ago, it came down to the storage problem. Snapshot consolidation process is very I/O intensive operation, and apparently this extra load was overwhelming faulty controller, making it go crazy.

Post by **Gostev** » Dec 21, 2011 12:20 am this post

I doubt this has anything to deal with hot add mode specifically. Veeam job issues snapshot removal API call only once those hot added disks are successfully dismounted from the backup proxy VM. This is required due to some other considerations (doing this in the wrong order results in hidden undeleted snapshots due to locked virtual disk files preventing consolidation process).

So, at the point where you are experiencing the issue, the operation is exactly identical to removing snapshot from a random VM with vSphere Client (even the same vSphere API call is used to initiate the removal by both Veeam B&R and vSphere Client).

What vSphere version are you running? And please include your Veeam support case ID with any technical issue you report on forums.

tgiphil · Post by **tgiphil** » Dec 21, 2011 11:55 am this post

We are using vSphere 5. Support case is 5162090.

There may be other factors involved and the backup job just happens to trigger it. Everything is on the table at the moment. Hopefully, VMware's detailed review of the logs will narrow down to the root cause.

If anyone has had a similar experience, we would like to know as it may help resolve our issue.

Note: Vmware just recently released patches for 5.0 on 12/15. One is for the VAII Thin Provisioning Block Space Reclamation Issue, as described in VMware's KB Article: 2007427. It may affect snapshots. I don't know if this relates or not, but I've asked Vmware to consider it.

tgiphil · Post by **tgiphil** » Dec 22, 2011 7:32 pm this post

Not much to update, except that Vmware's Escalation engineers are activity involved. For the time being, we have suspended "hot-add" and reverted to "network" transport until the root cause has been determined.

There is one thing for sure, VEEAM is not detecting failed snapshot removals and reports them as successful. vSphere, however, is logging the failure.

Post by **Gostev** » Dec 22, 2011 8:04 pm this post

This only means that RemoveSnapshot() call returns us "success" when in fact the process did not succeed, which makes it a bug in VMware vSphere 5 API (or may be intended behavior for some reason).

GabesVirtualWorld · Jan 12, 2012 3:28 pm

Exactly same issue here:
vSphere 5 (2x ESXi 5 hosts) + vCenter 5 virtual appliance
Veeam v6

In the past two weeks we had about 5 VMs that got corrupted this way. I have just filed a support request with VMware. Will file a support request with Veeam too. Even lost my Domain Controller because of this and when trying to restore using Veeam, it wouldn't work because the Veeam server couldn't authenticate to the SQL Server. Of course not without domain controller

Had to install new VM without domain dependency and install Veeam on it. Then restore from the vbk files.

Gabrie

GabesVirtualWorld · Jan 14, 2012 9:10 am

Anyone an update on this? Lost another VM this week and even lost the Veeam server and had to do some wierd tricks to get it back.
(Support ticket with Veeam: 5166467 )

Post by **Gostev** » Jan 15, 2012 10:15 pm this post

Hi Gabrie,

The posts from other users above suggest this is likely to be the storage issue, or some bug with VMware vSphere 5. I have provided the detailed explanation of all related processes in my previous posts, and cannot really add any new updates to that.

Looks like the original poster (Phil) stopped responding to our support a few weeks ago. As far as I understand from the last support case record, this looks to be vSphere 5 snapshot bug affecting certain VMs. I can see that our support engineer was able to reproduce the issue by creating snapshot on this VM manually - if I am interpreting this last case note correctly (there are little comments).

Thanks.

tgiphil · Post by **tgiphil** » Jan 18, 2012 9:56 pm this post

Gabrie, sorry I missed your post.

We have had no additional failures since turning off hotadd and using network transport backups instead AND we turned off SAN UNMAP acceleration feature (via the vmware's December patch).

We have an active support case with Vmware to determine the root cause. They have told us they were able to recreate the corruption failure when the UNMAP acceleration is turned on, but not when it is turned off. It still may be related to unmap only and/or combination of hotadd + unmap, but we don't know yet. I'll post more when I have more information to share.

Open a ticket with Vmware about this too. They could use additional logs and samples of corrupt VMs to help narrow down this serious "bug".

Either way, I would not use hotadd anymore or unmap anymore.

GabesVirtualWorld · Jan 31, 2012 3:05 pm

Maybe this VMware KB has something to do with it: http://kb.vmware.com/selfservice/micros ... Id=2007427

R&D Forums

Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Re: Issues with redo log corruption - hotadd?

Who is online