Data loss bug - Instant VM Recovery with redirected writes

pvz · Post by **pvz** » Nov 17, 2014 9:09 am this post

I'd just like to warn users out there for a bug in Veeam Backup & Replication which will eat your production data in a specific restore scenario if you are not careful.

I have a ticket open for this case (00674402) and I'm taking this to the forums, because, first of all, I feel information about the bug needs to come out, so that people can work around it and avoid data loss in a restore scenario.

To trigger this bug you have to:

1. Perform an Instant VM Recovery of your virtual machine. When configuring the Instant VM Recovery job, enable redirection of disk updates to the datastore you're planning on migrating the production VM to.

2. Power on the VM. At this point, the machine will be "live" and accepting user data. The machine will also not be covered by any backup, unless you take steps to ensure that it will.

3. At an appropriate time, perform a "Migrate to production" on the VM. Choose the same datastore as in step 1. (Actually, I'm not sure if the two datastores need to be the same, in my tests, I have not tested to use two different production datastores). Ensure that VMware Storage VMotion is used (not Quick MIgration). There is a check box at the end called "Delete source VM files upon successful quick migration (does not apply to vMotion)". Set this checkbox however, you like, it makes no difference.

4. Kiss your production data goodbye. Any data that has been written between steps 2 and steps 3, which could potentially be several hours or even days waiting for an appropriate service window - gone. You'll of course still have your original backup that you spun up the Instant VM Recovery from, but anything after that, irretrievable. What happens is that Veeam triggers a VMotion of the machine. For some reason, perhaps because the redo logs are already on the destination datastore, it decides that the Storage VMotion is done after only a few seconds, even though the data is still on the vPower NFS datastore. At this point, Veeam decides to DELETE your instant recovery VM because the Instant Recovery job is "done". That is most definitely not the desired behaviour for anyone in any scenario.

Right now, I'm a bit wary to offer workarounds, I suggest that anybody planning on using this feature tests it out in their environment, and makes sure anybody in their organization who might do an Instant VM Restore knows about this bug, until such time that Veeam releases a patch for this. Some possible workarounds (again, I take no responsibility for these, you will have to test this yourself to see if it works in your environment):

1. Don't use Instant VM Recovery, instead do a regular VM restore.
2. If you have to use Instant VM Recovery, do not redirect virtual disk updates.
3. If you have to redirect virtual disk updates - try using a different datastore for your migration destination and your disk updates. (UNTESTED)
4. If you have to put your virtual disk updates on the same datastore you plan on migrating to, use Veeam Quick Migration rather than Storage vMotion.

If your VMware installation doesn't have a license for Storage vMotion, you will not be bitten by this bug, because it only happens when using Storage vMotion.

Now, for me, I didn't experience any real data loss, because I happened to find this bug when I was demoing the software to a colleague who was preparing some documentation for VM recovery prodecures in our organization, so all I lost was some test data. But it might as well have been real data loss.

Still, I'm disappointed that Veeam Support has not been taking this bug seriously. The last response I got from Veeam is this:

As we discussed with engineers it is not actually a bug from Veeam side, it is more by design behavior. Because all steps from Veeam were done correctly according to the settings set for the jobs. We thing about warning message to notify user about consequences of these steps. In the next patches of Veeam we are gonna to add this notification.

In other words: Veeam will eat your data. By design.

Is it just me having too high expectations, or does anyone else find this kind of stance about this kind of bug... strange? Makes me wonder what other "design behaviours" are lurking below...

Post by **Gostev** » Nov 17, 2014 1:50 pm this post

Hello. If this bug is real, then I'm also disappointed that this particular Veeam support engineer is not taking it seriously. However, I also find it strange that you are the first user to run into this bug after 5 years of Instant VM Recovery feature existence, and tens of thousands users using it for years for production recoveries. So, let me get more details on this, and what exactly it takes to run into this bug. Thanks!

Post by **m_zolkin** » Nov 17, 2014 2:59 pm this post

Hi Per,

We are reviewing the conversation between you and the tech representative from the Veeam side. Looks like the BUG has been reported to the support management on 12th Nov, once the technician has reproduced the problem in our environment. I am still trying to understand why it got stuck there, but in the meantime our QA team is working on that.

Once the QA will collect all the necessary data and if they confirm this behavior - we'll get in touch with VMware support and report them the problem.

Once again, please accept our apologies for any inconvenience caused by this incident.

Nov 18, 2014 11:59 am

So, this appeared to be a critical bug in VMware Storage VMotion logic.

Steps to reproduce without Veeam in the picture:
1. Create a test VM with virtual disk on Datastore 1.
2. Use workingDir and snapshot.redoNotWithParent VMX parameters to move snapshot files location to Datastore 2.
3. Create a VM snapshot.
4. Perform Storage VMotion from Datastore 1 to Datastore 2.
5. Storage VMotion operation reports success, however no VM files are actually moved anywhere.

Redirecting snapshot to the same datastore that will be the Storage VMotion target is the requirement for the issue to trigger. If you use any other datastore, you will not run into this bug.

We will notify VMware about the issue.

To work around the issue, select "Force Veeam quick migration" checkbox when migrating instantly recovered VM to the production storage.

Post by **m_zolkin** » Nov 18, 2014 12:07 pm this post

Gostev wrote:However, I also find it strange that you are the first user to run into this bug after 5 years of Instant VM Recovery feature existence, and tens of thousands users using it for years for production recoveries. So, let me get more details on this, and what exactly it takes to run into this bug. Thanks!

It turned our that only vSphere 5.5 is affected, the scenario worked fine for vSphere 5.1. That explains why we didn't see such issues before.
We submitted a ticket with VMware SDK support as well as Veeam works on the solution to bypass the issue.

pvz · Post by **pvz** » Nov 18, 2014 2:06 pm this post

I would suggest to add a check to Veeam to see that the files are *actually* on the target datastore with no dependencies left to the vPower NFS datastore before nuking the datastore, despite VMware reporting success. That should protect any of your customers who might be running this in the future on what is now the current version of vSphere.

Paranoia is never a bad policy when it comes to a backup product.

Post by **Gostev** » Nov 18, 2014 4:23 pm this post

For a quick fix to include in Patch #1, we will probably just force the usage of native quick migration engine when detecting such setup, instead of relying on VMware Storage VMotion.

dutch123 · Post by **dutch123** » Nov 24, 2014 8:53 am this post

Will this also affect Veeam v7 in combination with vSphere 5.5?

Post by **Vitaliy S.** » Nov 24, 2014 8:57 am this post

It doesn't matter which version of Veeam B&R is used, since the issue sits in the Storage VMotion engine of vSphere 5.5.

Meyercord · Post by **Meyercord** » Dec 01, 2014 5:16 pm this post

Has VMware acknowledged this bug in their product?

Post by **Gostev** » Dec 02, 2014 12:17 am this post

Our support orgs are working with each other, last I heard is that our engineer was able to reproduce the issue for them, and they have collected all required logs.

dnrc · Post by **dnrc** » Nov 05, 2015 12:58 pm this post

Hi, is this bug something i should still be concerned about?

i have done exactly the process described here today to restore an exchange server.

now i need to migrate to production but having read this I am now concerned about doing so.

running veeam 8.0.0.204
esxi 5.5.0 1331820
vcenter 5.5.0 2442329

what can i do to keep the data safe?

reading above it seems to read that using a quick migration will be ok, i just want to confirm if that is the case.

thanks

Post by **foggy** » Nov 05, 2015 2:29 pm this post

Daniel, right, since the issue is in Storage vMotion engine, using Quick Migration is safe. Actually, the issue was addressed in the first Update for Veeam B&R v8 (Quick Migration is forced in such scenario), however, to be completely on the safe side, you can select the "Force Veeam quick migration" check box and clear the "Delete source VM files upon successful quick migration" one.

dnrc · Post by **dnrc** » Nov 05, 2015 6:11 pm this post

ok foggy, thanks for that.

that was what i gleaned from the other posts but in this case (production exchange server) i wanted to be sure

i'm still going to take a separate image before doing anything as well though.

Post by **veremin** » Nov 09, 2015 8:23 am this post

VeeamZIP might come in quite handy in this case. Thanks.

R&D Forums

Data loss bug - Instant VM Recovery with redirected writes

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Re: Data loss bug - Instant VM Recovery with redirected writ

Who is online