Host-based backup of VMware vSphere VMs.
Post Reply
pvz
Influencer
Posts: 18
Liked: 3 times
Joined: May 28, 2011 10:12 am
Full Name: Per von Zweigbergk
Contact:

Data loss bug - Instant VM Recovery with redirected writes

Post by pvz »

I'd just like to warn users out there for a bug in Veeam Backup & Replication which will eat your production data in a specific restore scenario if you are not careful.

I have a ticket open for this case (00674402) and I'm taking this to the forums, because, first of all, I feel information about the bug needs to come out, so that people can work around it and avoid data loss in a restore scenario.

To trigger this bug you have to:

1. Perform an Instant VM Recovery of your virtual machine. When configuring the Instant VM Recovery job, enable redirection of disk updates to the datastore you're planning on migrating the production VM to.

2. Power on the VM. At this point, the machine will be "live" and accepting user data. The machine will also not be covered by any backup, unless you take steps to ensure that it will.

3. At an appropriate time, perform a "Migrate to production" on the VM. Choose the same datastore as in step 1. (Actually, I'm not sure if the two datastores need to be the same, in my tests, I have not tested to use two different production datastores). Ensure that VMware Storage VMotion is used (not Quick MIgration). There is a check box at the end called "Delete source VM files upon successful quick migration (does not apply to vMotion)". Set this checkbox however, you like, it makes no difference.

4. Kiss your production data goodbye. Any data that has been written between steps 2 and steps 3, which could potentially be several hours or even days waiting for an appropriate service window - gone. You'll of course still have your original backup that you spun up the Instant VM Recovery from, but anything after that, irretrievable. What happens is that Veeam triggers a VMotion of the machine. For some reason, perhaps because the redo logs are already on the destination datastore, it decides that the Storage VMotion is done after only a few seconds, even though the data is still on the vPower NFS datastore. At this point, Veeam decides to DELETE your instant recovery VM because the Instant Recovery job is "done". That is most definitely not the desired behaviour for anyone in any scenario.

Right now, I'm a bit wary to offer workarounds, I suggest that anybody planning on using this feature tests it out in their environment, and makes sure anybody in their organization who might do an Instant VM Restore knows about this bug, until such time that Veeam releases a patch for this. Some possible workarounds (again, I take no responsibility for these, you will have to test this yourself to see if it works in your environment):

1. Don't use Instant VM Recovery, instead do a regular VM restore.
2. If you have to use Instant VM Recovery, do not redirect virtual disk updates.
3. If you have to redirect virtual disk updates - try using a different datastore for your migration destination and your disk updates. (UNTESTED)
4. If you have to put your virtual disk updates on the same datastore you plan on migrating to, use Veeam Quick Migration rather than Storage vMotion.

If your VMware installation doesn't have a license for Storage vMotion, you will not be bitten by this bug, because it only happens when using Storage vMotion.

Now, for me, I didn't experience any real data loss, because I happened to find this bug when I was demoing the software to a colleague who was preparing some documentation for VM recovery prodecures in our organization, so all I lost was some test data. But it might as well have been real data loss.

Still, I'm disappointed that Veeam Support has not been taking this bug seriously. The last response I got from Veeam is this:
As we discussed with engineers it is not actually a bug from Veeam side, it is more by design behavior. Because all steps from Veeam were done correctly according to the settings set for the jobs. We thing about warning message to notify user about consequences of these steps. In the next patches of Veeam we are gonna to add this notification.
In other words: Veeam will eat your data. By design. :roll:

Is it just me having too high expectations, or does anyone else find this kind of stance about this kind of bug... strange? Makes me wonder what other "design behaviours" are lurking below...
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by Gostev »

Hello. If this bug is real, then I'm also disappointed that this particular Veeam support engineer is not taking it seriously. However, I also find it strange that you are the first user to run into this bug after 5 years of Instant VM Recovery feature existence, and tens of thousands users using it for years for production recoveries. So, let me get more details on this, and what exactly it takes to run into this bug. Thanks!
m_zolkin
Veeam Software
Posts: 37
Liked: 17 times
Joined: Aug 26, 2009 1:13 pm
Full Name: Mike Zolkin
Location: St. Petersburg, Florida
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by m_zolkin »

Hi Per,

We are reviewing the conversation between you and the tech representative from the Veeam side. Looks like the BUG has been reported to the support management on 12th Nov, once the technician has reproduced the problem in our environment. I am still trying to understand why it got stuck there, but in the meantime our QA team is working on that.

Once the QA will collect all the necessary data and if they confirm this behavior - we'll get in touch with VMware support and report them the problem.

Once again, please accept our apologies for any inconvenience caused by this incident.
VP, WW Customer Technical Support
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by Gostev » 1 person likes this post

So, this appeared to be a critical bug in VMware Storage VMotion logic.

Steps to reproduce without Veeam in the picture:
1. Create a test VM with virtual disk on Datastore 1.
2. Use workingDir and snapshot.redoNotWithParent VMX parameters to move snapshot files location to Datastore 2.
3. Create a VM snapshot.
4. Perform Storage VMotion from Datastore 1 to Datastore 2.
5. Storage VMotion operation reports success, however no VM files are actually moved anywhere.

Redirecting snapshot to the same datastore that will be the Storage VMotion target is the requirement for the issue to trigger. If you use any other datastore, you will not run into this bug.

We will notify VMware about the issue.

To work around the issue, select "Force Veeam quick migration" checkbox when migrating instantly recovered VM to the production storage.
m_zolkin
Veeam Software
Posts: 37
Liked: 17 times
Joined: Aug 26, 2009 1:13 pm
Full Name: Mike Zolkin
Location: St. Petersburg, Florida
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by m_zolkin »

Gostev wrote:However, I also find it strange that you are the first user to run into this bug after 5 years of Instant VM Recovery feature existence, and tens of thousands users using it for years for production recoveries. So, let me get more details on this, and what exactly it takes to run into this bug. Thanks!
It turned our that only vSphere 5.5 is affected, the scenario worked fine for vSphere 5.1. That explains why we didn't see such issues before.
We submitted a ticket with VMware SDK support as well as Veeam works on the solution to bypass the issue.
VP, WW Customer Technical Support
pvz
Influencer
Posts: 18
Liked: 3 times
Joined: May 28, 2011 10:12 am
Full Name: Per von Zweigbergk
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by pvz »

I would suggest to add a check to Veeam to see that the files are *actually* on the target datastore with no dependencies left to the vPower NFS datastore before nuking the datastore, despite VMware reporting success. That should protect any of your customers who might be running this in the future on what is now the current version of vSphere.

Paranoia is never a bad policy when it comes to a backup product. :-)
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by Gostev »

For a quick fix to include in Patch #1, we will probably just force the usage of native quick migration engine when detecting such setup, instead of relying on VMware Storage VMotion.
dutch123
Lurker
Posts: 2
Liked: never
Joined: Dec 27, 2012 12:13 pm
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by dutch123 »

Will this also affect Veeam v7 in combination with vSphere 5.5?
Vitaliy S.
VP, Product Management
Posts: 27377
Liked: 2800 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by Vitaliy S. »

It doesn't matter which version of Veeam B&R is used, since the issue sits in the Storage VMotion engine of vSphere 5.5.
Meyercord
Enthusiast
Posts: 36
Liked: 6 times
Joined: Jul 14, 2014 4:31 pm
Full Name: AJ Meyercord
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by Meyercord »

Has VMware acknowledged this bug in their product?
Gostev
Chief Product Officer
Posts: 31814
Liked: 7302 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by Gostev »

Our support orgs are working with each other, last I heard is that our engineer was able to reproduce the issue for them, and they have collected all required logs.
dnrc
Influencer
Posts: 12
Liked: 1 time
Joined: Apr 21, 2015 8:19 am
Full Name: Daniel Caine
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by dnrc »

Hi, is this bug something i should still be concerned about?

i have done exactly the process described here today to restore an exchange server.

now i need to migrate to production but having read this I am now concerned about doing so.

running veeam 8.0.0.204
esxi 5.5.0 1331820
vcenter 5.5.0 2442329

what can i do to keep the data safe?

reading above it seems to read that using a quick migration will be ok, i just want to confirm if that is the case.

thanks
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by foggy »

Daniel, right, since the issue is in Storage vMotion engine, using Quick Migration is safe. Actually, the issue was addressed in the first Update for Veeam B&R v8 (Quick Migration is forced in such scenario), however, to be completely on the safe side, you can select the "Force Veeam quick migration" check box and clear the "Delete source VM files upon successful quick migration" one.
dnrc
Influencer
Posts: 12
Liked: 1 time
Joined: Apr 21, 2015 8:19 am
Full Name: Daniel Caine
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by dnrc »

ok foggy, thanks for that.

that was what i gleaned from the other posts but in this case (production exchange server) i wanted to be sure

i'm still going to take a separate image before doing anything as well though.
veremin
Product Manager
Posts: 20413
Liked: 2302 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Data loss bug - Instant VM Recovery with redirected writ

Post by veremin »

VeeamZIP might come in quite handy in this case. Thanks.
Post Reply

Who is online

Users browsing this forum: No registered users and 57 guests