vSphere CBT bug with QueryChangedDiskAreas("*")

kyle.manel · Apr 02, 2018 2:52 pm

An alert provided by Gostev, March 26 - April 1, 2018, stipulated that CBT (Changed Block Tracking) has resulted in a bug being identified with (potentially) particular storage and/or virtual volumes.

As this is a bug which can affect many platforms, I would like to know how to determine if a system is affected by this (Hard Drive model, virtual volumes etc) once it is better understood.

Regards,
Kyle

Referenced message;
=-=-=-=-=-=-=-=

Code: Select all

Veeam Community Forums DigestMarch 26 - April 1, 2018

THE WORD FROM GOSTEV
What can be worse than a new vSphere changed block tracking (CBT) bug? The CBT bug that even an Active Full backup does not help against! And unfortunately, we have just confirmed such bug to exist. Now, I recognize this may look like an April Fool's joke, so to be clear – this is NOT one. We've already demonstrated this bug to VMware support using naked API, so everyone is at risk no matter what vSphere backup product they are using. However, there's a hope that the bug is isolated to the particular storage model and/or Virtual Volumes (VVols) only – otherwise, we'd probably have way more customers reporting failed recoveries.

It all started from a support case coming from a fresh vSphere 6.0 deployment running VMs with thin disks hosted on VVols backed by Nimble storage. The customer was experiencing classic data corruption issues after full VM restores – the restored VMs had Windows firing up chkdsk, and checkdb was reporting corruptions in Microsoft SQL Server databases. This normally points at storage corruption – but production VMs did not have these issues, while backup files' content was matching the checksums. Which made CBT the next suspect - but the following troubleshooting steps revealed that the corruption could occur even when restoring from an active full backup! On the other hand, the issue would not reproduce when CBT was disabled completely. Magic, eh?

But our genius support folks did come up with a way to nail this problem down. They first changed permissions on the vCenter Server account used by Veeam Backup & Replication to make it unable to delete working snapshots created by a backup job. Then, after reproducing the issue again, they cloned the VM from the corresponding working snapshot, mounted VMDK of the clone and VMDK from the full backup file to a Linux box, and did a binary compare – which not surprisingly showed a mismatch in some disk areas. And finally, by referring to the debug log of the corresponding job run, they found the differences were in the disk areas that were NOT returned by QueryChangedDiskAreas() function call with changeID * parameter.

Now, let me step back and explain what this vSphere API function does. It is actually the cornerstone CBT function that is used to query used and changed VMDK blocks. During an incremental run, a backup job passes this function changeID of the snapshot created by the previous run, and thus gets all blocks changed since the last backup – a very simple concept. While during an initial run aka full backup, when there's no previous backup run to reference yet – the special * parameter is passed to this function, which makes it return allocated VMDK blocks only. This dramatically accelerates full backups due to not having to read though TBs of unallocated (and thus obviously empty) VMDK blocks. But even if a backup vendor chooses not to use this functionality for full backups, this query will still be issued by an ESXi itself when CBT is first initialized on a VM – meaning, there's no way to avoid one.

Bear with me, we're almost there now! There's one key difference in QueryChangedDiskAreas() logic with the two scenarios I explained above. Using a changeID belonging to a previous VM snapshot will return all changed blocks that were tracked by the ESXi host itself in the CTK file. This functionality had its own share of bugs in the past years, but all of them were fixed and by now we can be fairly confident that modern ESXi versions track changes reliably respecting disk resizes, VMotions and so on (you know, all those bugs we've been through). However, when using that special changeID * parameter, this function returns allocated blocks based on the data provided by the storage itself. And at least in this particular case, it appears that the storage provides an invalid allocation data.

According to the latest update, VMware is now working on a tool that should help to confirm this bug is indeed with the particular storage. I will keep you posted as we learn more – meanwhile, as always, remember to test your backups! And a big shout out to many VMware teams – SDK support, VADP team, VVOL to name a few! We had an absolutely incredible collaboration working with them on this issue, receiving very prompt responses and seeing great involvement with what in the end appears to most likely be a 3rd party vendor bug. A very refreshing experience, for sure!

[UPDATES]
April 15: VMware support has ruled out any issues with Nimble storage specifically (which appeared to be returning allocated data blocks correctly).

staskorz · Apr 03, 2018 9:10 am

Hi Kyle,

We are the customer Mr. Gostev is talking about in his post.

TL;DR:
It's probably a general issue with the current VMware CBT implementation. It might be specific to VVOLs, but I'm in doubt it has anything to do with the underlying Nimble storage.

As Mr. Gostev mentioned, Veeam calls VMware API method QueryChangedDiskAreas("*") for the first (or actually any Active Full) backup. This is how Veeam works when CBT is enabled for a backup job (which is the default). This way Veeam only reads the blocks data was ever written to. So if you have a 300GB disk with only 100GB of data ever written to it, Veeam would only read those 100GB, skipping the never-used 200GB - a great efficiency feature by Veeam. To sum it up, calling QueryChangedDiskAreas("*") should return a complete list of all blocks ever written to, later this list is used by the backup software to selectively read only the blocks containing data, skipping the empty ones.

But here is when it gets hairy: according to a VMware VDDK support representative, calling QueryChangedDiskAreas("*") is known to have issues, even though it's officially supported at the moment. Also according to the VMware VDDK team representative, the recommended way of calling QueryChangedDiskAreas() is with changeID, not with "*" (although VMware didn't document this recommendation). Furthermore, this functionality (calling QueryChangedDiskAreas() with "*"), although currently being officially supported, is going to be deprecated (and superseded by VixDiskLib_QueryAllocatedBlocks()) in VDDK 6.7 release, scheduled for mid April (~2 weeks from now).

The strange thing is, VMware didn't bother to to communicate neither their findings (the very fact calling QueryChangedDiskAreas() with "*" is known to have issues), nor their deprecation intentions to their partners. Not even in their Beta VDDK Program documentation, even though the release is scheduled for ~2 weeks from now...

In our case, calling QueryChangedDiskAreas("*") returns an incomplete list of blocks, thus Veeam, relying on inaccurate data returned by VMware, produces a corrupt backup.

A clarification: we've experienced backup corruptions not only with the Active Full backups, but also with the Incremental ones. There were cases when an Active Full backup would be fine (say passed the DBCC CHECKDB test), but the following Incremental backups were corrupt. This is still under investigation, VMware suspects both issues might have the same root cause, so resolving one might also resolve the other.

So as suggested by Mr. Gostev, make sure to test your backups.

Best Regards,

Stas

lando_uk · Apr 03, 2018 9:55 am

Let's just hope it's only VVOLs.

Apr 04, 2018 6:21 pm

staskorz wrote:The strange thing is, VMware didn't bother to to communicate neither their findings (the very fact calling QueryChangedDiskAreas() with "*" is known to have issues), nor their deprecation intentions to their partners. Not even in their Beta VDDK Program documentation, even though the release is scheduled for ~2 weeks from now...

Actually, this "deprecation" is not as bad as this word make it sound

QueryChangedDiskAreas("*") will continue to work with all vSphere versions, and remains fully supported. It is marked as "deprecated" in ESXi 6.7 only because VMware introduced the new, dedicated API function for querying allocated blocks which is VixDiskLib_QueryAllocatedBlocks() in VDDK 6.7.

They are making this change because QueryChangedDiskAreas("*") is not related to CBT functionality in the implementation - but more importantly, because they want the capability of getting all allocated disk areas to be available regardless of whether CBT enabled or not. This decoupling actually makes perfect sense to me.

Hope this helps!

staskorz · Post by **staskorz** » Apr 04, 2018 8:15 pm this post

You're right, the deprecation itself is not a bad thing at all. That's how software evolves.

The thing is, the moment the VMware VDDK team member heard QueryChangedDiskAreas("*") is used, he immediately said it's known to have issues. It felt like the guy was aware of that fact for ages.

So VMware is aware this functionality has issues and decided to deprecate it (and replace with a better one) instead of fixing. No problem with that, but... I think it's reasonable to expect VMware to notify (at least) their partners as soon as they are aware of such issues. Considering the possible consequences - technically, a silent data loss...

Right, it's customer's responsibility to test the backups, but testing the backups is easier said than done. The only reason we were able to discover the corruption is because the MS SQL DB we've restored for testing purposes was corrupt, and fortunately MS SQL has a data consistency verification functionality. Imagine it was a file server, filled with millions of flat files. How do you test that???

Ok, back to the databases. We have about 30 of them. Three of the most important ones are close to 1TB combined. Running DBCC CHECKDB on those 3 databases takes about 15 hours. Can you do it for every backup?

No? And what about the other, less important ones?

That's why I'm so uncomfortable with the fact VMware didn't immediately go public with their findings about the QueryChangedDiskAreas("*") issues. Because every time it runs incorrectly, a silently-corrupt backup is produced...

Apr 05, 2018 2:46 pm

staskorz wrote:The thing is, the moment the VMware VDDK team member heard QueryChangedDiskAreas("*") is used, he immediately said it's known to have issues. It felt like the guy was aware of that fact for ages.

I am sure he did not mean data corruption issues.

In general, the issue with this call is indeed well know to us, and have been discussed on these forums before (especially in the early days, when storage and fabric was much slower than today). The issue being this call starting to return all blocks, instead of allocated only - which makes full backup take too much time. And the fix was pretty complex woodoo type sequence of actions involving VM power offs, disabling and enabling CBT, etc.

So that's probably what he meant. In any case, we can be absolutely sure this call works reliably in general, as this kind of data corruption bugs is not something that can possibly go unnoticed for 10 years in over 1 million environments (I included free edition users, since VeeamZIP uses this API call too).

frakka · Post by **frakka** » Apr 09, 2018 8:02 am this post

So, by now, there is no need to disable vSphere CBT in Veeam's jobs (waiting for a new stable mechanism from vmware) and switch to Veeam’s proprietary filtering mechanism?

Morat · Apr 09, 2018 8:21 am

And that's the question we _really_ need the answer to!
Is CBT safe or should we disable it pending future word from Veeam and VMWare?

Post by **Gostev** » Apr 09, 2018 8:28 am this post

Since the scope of this issue is not yet confirmed by VMware, you should test your backups recoverability and go based on the results of your testing. For now, all we know is that any environments using vVols and/or Nimble storage are potentially at higher risk - just because the bug is consistently reproduceable in one such environment.

So testing restores is the key for making the call on whether or not to disable CBT. If your environment is affected, then the issue will be very easy to see with the most basic restore test - as it corrupts NTFS volumes and causes chkdsk to fire up at boot time. Even if you don't use SureBackup (which we highly recommend using regardless of this particular issue), then you can just do a simple full or instant VM restore tests for a few VMs without connecting them to the production network.

By the way, just for the reference and since I was already asked a few times privately - here are the support case IDs for all vendors involved.

VMware:
SR# 17638553811 (aka Service Request)
PR# 2034816 (aka Problem Report)

Nimble: 02175471

Veeam: 02260663

mc1903 · Post by **mc1903** » Apr 09, 2018 9:19 am this post

Gostev wrote:Since the scope of this issue is not yet confirmed by VMware, you should test your backups recoverability and go based on the results of your testing. For now, all we know is that any environments using vVols and/or Nimble storage are potentially at higher risk - just because the bug is consistently reproduceable in one such environment....

Hi Gostev,

It's really concerning me that we are a week on from your initial disclosure and "the scope of this issue is not yet confirmed by VMware".

I appreciate this is not an issue that Veeam can directly fix, but I would expect some official guidance from Veeam on mitigation.

1. If after restore/SureBackup testing we find that we are affected, please be clear what steps we should take. i.e. Should we disable CBT within the jobs and run new Full Actives?

2. How best to deal with the 'tainted' backups?

Also, what additional testing/qualification has Veeam been doing?

3. Do you think this issue is specific to a given VDDK release?

4. Which Nimble array model/NimbleOS (firmware) version was this reported on?

5. Will Veeam be testing or requiring storage vendors to re-test for this issue on other storage arrays

6. What is your confidence percentage that this is only affecting VVols?

Thank you for the vendor support ID's.

Martin.

Apr 09, 2018 10:02 am

1. Disable CBT and run new Active Full backup
2. Keep them for file level recovery purposes (most guest files should still be recoverable)
3. QueryChangedDiskAreas API is not a part of VDDK
4. Let's ask @staskorz to answer this one
5. That would be something that VMware may have to do based on their conclusion
6. In my personal opinion, the probability of root cause being vVols and/or Nimble storage is quite high

staskorz · Apr 09, 2018 10:25 am

Which Nimble array model/NimbleOS (firmware) version was this reported on?

Model: AF5000
OS Version: 3.7.0.0-444264-opt (I know it's not the latest one, that said, I went through the change logs of all the later releases and found no fixes related to our issue).

One thing I have to repeat: at least in our case, this issue not only affects Active Fulls, but also incremental backups. The only way to constantly achieve non-corrupt backups we found so far was by disabling CBT.

In some cases, an Active Full run would produce a working backup, the next few daily incremental ones would also be fine, but, say after 10 days, a daily incremental run would produce a corrupt backup - and then every following daily incremental backup would also be corrupt (which is obvious, as it's based on a previous corrupt backup copy). Then we would initiate an Active Full run, and it would all start over: a working backup produced by an Active Full run, a few good backups produced by daily incremental runs, until one of the daily incremental runs would produce a corrupt backup. Disabling CBT would constantly produce a working backup.

Only at a later stage, we've noticed some Active Full runs producing corrupt backups. Same as before, disabling CBT would produce a working backup.

mc1903 · Post by **mc1903** » Apr 09, 2018 10:46 am this post

Thank you @Gostev & @staskorz.

@staskorz - Do you have VMFS datastores on this Nimble array or are you only using VVols? If you do have VMFS datastores, what VMFS version(s) and have you restore tested to confirm data integrity of the VM's that reside on VMFS?

Thanks
M

staskorz · Apr 09, 2018 10:48 am

@mc1903: only VVOLs.

staskorz · Post by **staskorz** » Apr 09, 2018 2:07 pm this post

Just submitted a proposal for a method detecting such issues:

https://forums.veeam.com/veeam-backup-r ... 50110.html

mwiltermood · Post by **mwiltermood** » Apr 09, 2018 4:55 pm this post

Is this issue tied to a specific version of ESXi?

staskorz · Post by **staskorz** » Apr 09, 2018 5:00 pm this post

@mwiltermood: We are running ESXi v6.0 with the latest updates. VMware support didn't suggest upgrading to 6.5, so probably this issue affects both.

Post by **Gostev** » Apr 09, 2018 5:08 pm this post

Yeah, I don't think it is. But we won't know until VMware finishes their research and fully understands the issue.

wRx7M · Apr 09, 2018 8:16 pm

I am following this topic closely. I am running ESXi 6.0 U3, with vSphere 6.5, but am not using VVOLS or nimble. All my Veeam jobs (v9.5) use CBT and I test-restored 75% of our VMs using instant recovery and selected incrementals that were performed at the furthest point from the previous active full, prior to the next active full. All systems booted without any issues, like the checkdesk running, etc.

staskorz · Apr 10, 2018 7:14 am

@wRx7M: in the VM we have an QueryChangedDiskAreas("*") with, the corrupt block appears at offset of 498,615,713,792. So you'll only notice it after reading almost half a terabyte. The VM boots fine (without CHKDSK, etc.), MS SQL starts fine, the DB mounts fine and can even serve queries. But running DBCC CHECKDB reveals the corruption - after running for over 2 hours.

I'm writing all this because telling for sure your backups are REALLY fine is almost impossible. Indeed, sometimes we got restored VMs not booting cleanly, running CHKDSK, etc. That's an obvious corruption to catch. But sometimes the VM looks perfectly fine and you bump into the corruption only after trying the more sophisticated stuff - say rolling the Transaction Logs forward or running a consistency check tool (such as DBCC CHECKDB).

To sum it up, it all depends on the location of the corrupt block - if it's in some critical part of the filesystem, the restored VM won't boot cleanly, will run CHKDSK, etc. If it's in some critical portion of a DB, the DB won't mount. But if it's somewhere else, you'll have to try very hard to catch it.

That's why I've proposed the following detection method: https://forums.veeam.com/veeam-backup-r ... 50110.html. I think it's a very practical approach to ensure the backup is identical to the source.

Tommi0815 · Post by **Tommi0815** » Apr 10, 2018 1:43 pm this post

Hi together,

we are running VEEAM 9.5 under VSphere 6.5 with an Nimble Storage. (Typ CS235 Nimble OS 5.0.2 with no VVOLs active)
We tested our Backups with SureBackup. The result is 100% success for the validation test, so it seems that we are not
affected by this CBT-Bug. We also tested a restore of an entire VM with no problems.

My question to the VEEAM-Team is if the testing results with SureBackup giving us a guarantee that our backups are in a usable
status? Or is it better to disable CBT ?

Best regards
Thomas

staskorz · Post by **staskorz** » Apr 15, 2018 12:19 pm this post

@Tommi0815:

I'm not from Veeam, but will try to answer anyway.

SureBackup basically just automates the stuff you could do manually. Suppose your use the following testing procedure for a given VM:

Spin up a VM from backup using "Instant VM Recovery"
Wait till it's up and running
Make sure it listens on port 80
Shutdown and remove the restored VM

SureBackup allows you to automate the steps above and schedule them to run after backup job completion. You could even write some custom logic to, say, make sure the web page returned by your web server indeed looks as expected, providing even higher degree of confidence. SureBackup could also start your VMs in groups, providing a way to test interdependent services.

The question is to what degree such a test could be trusted, be it manual or automatic.

There are several indications for a backup to be considered successful:

The backup job completes successfully
The VM spins up successfully from the backup
The booted-up VM passes some tests (say has some processes running, listens on some ports, returns expected web pages, etc.) - those tests could be manual or automated using SureBackup

The thing is, this entire case is all about the fact that if CBT is enabled, #1 could not be trusted.

When CBT is enabled, Veeam only reads the blocks reported by CBT as changed and reports a successful backup completion. It's important to understand Veeam would report a successful job completion regardless of whether all the required blocks were copied or not - cause it completely trusts the list returned by VMware CBT.

In our case, a VM could be successfully restored from a backup, it would boot successfully and its MS SQL instance would come up and even serve some queries. Only running "DBCC CHECKDB" would reveal a corruption. For that DB the verification runs for over 2 hours, but we have another, not less important DB, for which "DBCC CHECKDB" runs for over 10 hours. Of course, you could write some custom verification logic for a SureBackup job to run those "DBCC CHECKDB" commands, the question is whether you can afford a backup verification for a single VM to span for over 10 hours.

Fortunately, that VM is running MS SQL, which has a built-in data consistency verification feature (DBCC CHECKDB), which eventually revealed the corruption. But what about VMs running applications without data consistency verification tools? What about VMs acting as file servers? There is simply no way to verify each and every file hosted on such server, neither manually nor automatically.

So the moment a successful backup job completion status could not be trusted, neither is SureBackup. I mean it could be trusted, but only to some degree. It all depends on the location of the corrupt block(s) - so if during the tests those locations weren't read from, the corruption just won't be noticed.

If, on the other hand, you disable CBT, then the successful job completion status would really mean "Veeam successfully copied the entire content for a given VM" - which indeed could be trusted to a very high degree. So when you test such a backup by booting a VM from it and the VM passes some basic tests (say starts listening on a given port, etc.), then chances are the backup is really fine.

As for us, for now we have completely disabled CBT for our production VMs in order for Veeam be able to produce working backups.

Post by **Gostev** » Apr 15, 2018 8:17 pm this post

Quick update - VMware support has ruled out any issues with Nimble storage specifically (which appeared to be returning allocated data blocks correctly), and are now working on another support tool that should help to pinpoint the issue in the actual vSphere/VADP code.

rollergirl · Apr 16, 2018 1:24 am

Trusty Nimble! Been using it with Veeam for 2 years now flawlessly! Glad to hear the update!

hyvokar · Apr 16, 2018 5:44 am

Not sure, if this is related, probably not.

I recently backed up my esxi5.5 host to iomeaga nas-device to re-partition the hard drives.

After re-partitioning I restored the VMs.
After restoring VMs were powered on, but backing up two VMs started instantly failing.
Found out that two VMDKs were corrupted.

Tommi0815 · Apr 16, 2018 7:14 am

@ staskorz

Thank you very much for your for your detailed explanation. This helps a lot.
We decided this morning to disable "CBT"

Best regards
Thomas

staskorz · Apr 16, 2018 10:07 am

Just had to post it here: HUGE shout out to Nimble Storage for having such an unbelievably great support. Here's the story: we had a remote support session with Veeam and VMware. At some point VMware support representative said we need Nimble support's assistance. Now listen to that: while on the phone with Veeam and VMware, I've opened a NEW support case with Nimble and got their senior support representative joining the remote support session within minutes!!! How many companies do you guys know that would do that? I mean no "send me the logs", no "let me check", no "we will get back to you" - but just a simple "sure, just give me the link to the support session and I'll join NOW".

Thank you and Nimble!!!!! Thumbs UP!!!

Post by **Gostev** » Apr 16, 2018 10:59 am this post

Yes, they are treating this issue very seriously.

Post by **mloeckle** » Apr 17, 2018 5:33 am this post

When was the last time CBT was reset on the affected VMs in this environment? Could it be that CBT has been corrupt for a while due to a previously fixed CBT bug?

staskorz · Post by **staskorz** » Apr 17, 2018 6:51 am this post

We have performed a CBT reset for all the VMs in our environment as one of the troubleshooting steps.

R&D Forums

vSphere CBT bug with QueryChangedDiskAreas("*")

Re: Relevant system characteristics affected by vSphere CBT

Re: Relevant system characteristics affected by vSphere CBT

Re: Relevant system characteristics affected by vSphere CBT

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Re: vSphere CBT bug with QueryChangedDiskAreas("*")

Who is online