Potential flaw in Veeam's or am i doing something wrong?

MGT1981 · Post by **MGT1981** » Sep 11, 2016 9:57 pm this post

Hi All,

I have been a loyal fan of Veeam for several years and use the product at many of my clients. However this week i ran into a situation that if i am not missing something has the potential to be a major problem with snapshot based (or for that matter block level) backups in general.

Here is the scenario

Have a client that has a large SQL VM that was carved out into different VMDK's for OS, LOGS, DB, etc. Each VMDK is sitting on a seperate LUN. Long and short we recently had a situation where the DB volume went uninitialized at the guest OS level. After fighting with it for some time it was determined that the DB drive was completely corrupt at the guest NTFS level and needed to be restored.

We went back to last nights backup and found that we could not do any guest OS level restores. This was the case back about 7 days. We spent several hours on the phone with veeam support and finally determined that it was effectively corruption that just slowly got worse and worse until it finally gave out.

Luckily we also had some SQL maintenance plans that were also doing backups at the SQL level so we were able to get everything back. It should also be noted that when restoring the previous nights VMDK we able to mount it and it was "good enough" that we were able to run a checkdisk on it and get some of the data back.

So the question is this, is this just a flaw in snapshot based backups ? I understand that we could verify it to some extent with a surebackup job but i dont know that corruption of this level would have been detected unless there was some custom script run to report on all drive size or something to that effect. I feel like this is the one scenario whare a legacy style file based backup would work better because it would fail on the corrupted files and alert you that there was a failure. However i am hoping i am just missing something.

Post by **Gostev** » Sep 12, 2016 12:08 am this post

Hi,

It's sounds like your client does not have an active SQL Maintenance Plan that runs Check Database Integrity Task periodically, which is basically designed to catch this kind of issues whether or not the particular SQL Server is even being backed up. Is that the case?

Now, if we are discussing a situation when client shifts the responsibility for monitoring database consistency to a backup application - then my answer will be long, because the issue is much more complex than the way you put it.

But first, to answer your question - yes, this particular issue your client faced can happen with any block-level incremental backups in general (not just VM snapshot-based backup). But even with a file-based backup, file system level issue will only surface if backup process copies changed files in their entirety, i.e. by reading the whole file - approach usually impossible today from a backup window perspective (at least with the amounts of data most people need to back up these days). However, as soon as you start reading only changed blocks - which is actually what even "legacy style file based backup" tools do these days to be able to fit backup window - then backup process will no longer implicitly "validate" the backed up files for consistency either.

However, all of this does not even matter much, because the whole issue is much bigger if you keep in mind that in addition to file system level file corruption issues (physical integrity), there are also application data corruption issues (logical integrity). For example, the latter is actually most common corruption type for our backup files due to "bit rot". Now, in case of application data corruption, even a legacy style file based backup that DOES read the entire changed files will not detect the corruption - because from the file system perspective, the file is perfect (when its content can be complete rubbish from application perspective).

In fact, the above issue is EXACTLY the reason why we recommend so strongly against using storage-based replication to off-site Veeam backups - as this process also simply copies the entire backup files (thus implicitly validating their physical integrity), but it does not validate their content for consistency from application perspective (logical integrity) - so you can potentially end up with an offsite backup that is as unrecoverable as the primary backup due to bad payload.

So, the real solution that will address all corruption types is only one - and that is to run application-specific data test that will read the entire data pool used by the application and validate its logical integrity using application-specific methods. This test cannot complete successfully without physical integrity, so you don't even have to worry about the latter. Going backup to my Veeam backup files example, such logical integrity test is done inline by Backup Copy job (as it reads the content of copied restore points), and also by storage-level corruption guard (for data at rest).

Now, what about production applications being backed up, like SQL Server? Well, again - if you decide to move the responsibility for application data consistency monitoring to a backup application (which may not be a good idea to start with) - then you can only perform such test during the backup itself (which is impossible these days from backup window perspective), or after the backup has been completed. And Veeam actually makes it possible with SureBackup - which allows to automatically spin up any VM directly from backup, and run any test against the application easily. For example, for SQL Server specifically, it can be a test script with DBCC CHECKDB query against database running in a SureBackup VM.

Does this answer your question?

Thanks!

MGT1981 · Post by **MGT1981** » Sep 12, 2016 12:51 am this post

To an extent yes. Also, to be clear it was not about saying it was veeams responsibility but more a topic for discussion.

That said, my specific situation was SQL related, however, replace SQL with for example a file server. In the older days if I were to run something along the lines of Backup Exec, if there was any "Corrupt data" encoutered i would get alerted in a job log. I know in certain situations this would even give false positives for items like PST files if they were mounted whent he backup ran. In theory with a block level backup you have real method of verification, you could end up with a bad data in/bad data out situation. Again, i dont know that this is really Veeam's problem, however it would be nice if there was some have a level of verification on the file level, even if it were to throw some kind of flag in the job log that caused it to give a "warning" instead of failure

I wonder, could this be accompllised if guest indexing were enabled ? In theory if you have a drive that can't be read you can't index it, which would in turn throw some kind of error? no ?

Post by **Gostev** » Sep 12, 2016 12:57 am this post

Yes, the following two features in Veeam can catch certain NTFS file system level corruption issues:
1. Just as you noted above, guest file system indexing (however, it is not enabled by default).
2. BitLooker (deleted file blocks processing is enabled by default).

Both features analyze MFT contents and should error out in case of certain issues, but I would not rely on them as this is really a side effect and not something they are designed to do reliably and consistently.

MGT1981 · Post by **MGT1981** » Sep 12, 2016 1:11 am this post

Thanks Gostev,

A few things,

1) Bitlooker is only enabled by default on new v9 installs correct? in other words if jobs were migrated from previous versions it will need to be enabled on a per-job basis correct?
2) regarding indexing, in my experience it slows the job down quite a bit as well as in certain instances the virtual guest.

With that, as a feature request for future versions, would it be possible to have a "Run guest file system check" option added into the job options? Maybe even an option to run a quick chkdsk or something like it in read only on the volumes?

To be blunt i have not fully vetted this out in my head as to potential downsides but it is just an idea to kick around.

Post by **veremin** » Sep 12, 2016 10:38 am this post

Bitlooker is only enabled by default on new v9 installs correct?

Correct.

in other words if jobs were migrated from previous versions it will need to be enabled on a per-job basis correct?

Correct. You can use this script in order to enable it on all existing jobs automatically.

Maybe even an option to run a quick chkdsk or something like it in read only on the volumes?

Have you thought about leveraging pre-freeze script doing exactly the same?

Thanks,

R&D Forums

Potential flaw in Veeam's or am i doing something wrong?

Re: Potential flaw in Veeam's or am i doing something wrong?

Re: Potential flaw in Veeam's or am i doing something wrong?

Re: Potential flaw in Veeam's or am i doing something wrong?

Re: Potential flaw in Veeam's or am i doing something wrong?

Re: Potential flaw in Veeam's or am i doing something wrong?

Who is online