Question on backup integrity

rawtaz · Post by **rawtaz** » Jan 27, 2012 5:38 pm this post

Hi,

I spoke to a technician at Veeam a couple of days ago, and asked him about if/when the backup storage experiences bit rot or similar corruption. In short terms, I asked him how Veeam deals with the situation where say you have a very big VMDK for a file server that is backed up, and there is some unnoticed corruption in the data areas of that file (the backup copy of the VMDK, i.e. Veeam's files).

More specifically, lets say the VM have a 2TB large VMDK with an NTFS volume in it. The volume has a lot of data in it as it is for a file server. Now lets assume that somewhere in this backed up VMDK (i.e. in the backup storage) some corruption occurs. A byte or two are corrupted, at the "place" of an important file in the fileserver. I am then wondering what means one has to detect this, if using Veeam to back up this VMDK? I am not sure if Veeam has any checksumming in its backup storage.

The technician said the following: What you have to work with is essentially the SureBackup, or as we specifically talked about, the Instant Restore way of testing it. He summed it up as "You can fire up the VM right from the backup storage, and what we do/Veeam does is to simply check if all parts of the VM mounts and can be used successfully. You can also run some additional checks inside the VM via scripting". This is all to my understanding as well, so far nothing weird.

However, when I asked about confirmation that Veeam won't "scan" or checksum the entire VMDK in the backup storage to find potential block/bit corruption further down the VMDK (i.e. in the pure data areas, which no ESXi/VM OS/application/whatever reads until the VM actually runs and someone asks for the file that is specifically stored at that place in the VMDK/filesystem), he didn't give a clear answer to this. He kept insisting that "if we can mount it and it starts up successfully, then everything is alright", which I think sounds very off.

I don't know the internals of VMDK or how Veeam works, but isn't it true that just firing up the VM using for example Instant VM Recovery and seeing that it runs is not an indicator that there isn't any corruption in the data areas of the corresponding VMDKs? I fail to see how without having checksumming and scanning the entire backed up data, Veeam would be able to determine that the backups are indeed intact.

The indirect question apart from the above is of course your opinion on what means there are to make sure that Veeam backups are indeed healthy. Is there something one can do using Veeam, or is it entirely up to features in the underlying storage to detect corruption (or alternatively doing a full check on the data inside the VM when doing test restores)? One could always run it atop ZFS but there's definately a lot of people that don't do that.

Please let me know if I need to clarify. Thanks!

rawtaz · Post by **rawtaz** » Jan 27, 2012 7:07 pm this post

I should add that I've read the docs about the backup job setting "Enable automatic backup integrity checks" but it's not clear exactly what it does. It says it detects for example when something cannot be read, not sure that implies checksumming though or just read failures. What I'm mostly thinking about here is more the silent type of corruption that could happen on disk.

Jan 27, 2012 9:45 pm

So my answer is pretty simple, certainly if your storage experienced "bit rot" then there is the potential the a file might have corruption that Veeam is completely unaware of. If you feel this is a significant risk then you must store your backups on media that is likely to be safe from this.

Note that this is not unique to Veeam, or really any backup software. Backups to tape have long experienced "bit rot" during storage. Disk are typically much more resilient to such issue that tape has ever been. Of course, since we are backing up at the VMDK level, it's also possible that your disk might already have file system corruption that is silent. For example, many years ago I lost a directory containing 30-40 documents that were backed up using traditional backups to tape. These were archived test reports that were several years old when it was discovered that they couldn't be opened. We had 6 months worth of backups on tape, and we could restore the backups, but the files on the backups were still corrupt. It was obvious that the corruption had happened many months previously (perhaps years) but had simply gone undiscovered.

The best strategies to avoid this are the same as they pretty much always have been, basically, having more than one copy of your backups. Of course you can script SureBackup to run things like CHKDSK on such volumes and report any filesystem corruption at the MFT level, but this is still unlikely to detect "rotten bits" in the backup. This is also why we typically suggesting running a "real" full backup at least every few months to help protect against this possibility. You can always perform this more often.

That being said, "bit rot" is much less likely with modern storage systems. All modern drives have ECC correction capability and will redirect blocks when there are failures, most RAID systems have background scan capabilities, and RAID6 provides added protections against single bit errors since there are two available checksums. Honestly, saving backups to reliable disk storage is likely to be far safer than tapes have ever been.

rawtaz · Post by **rawtaz** » Jan 27, 2012 9:52 pm this post

Thanks Tom, good summary there.

Indeed it's not unique to Veeam in any way. There is no checksumming going on in Veeam backups then, as I understand you.

I was mostly puzzled by what the technician said as I didn't think it made much sense, and I guess it didn't. Somehow we apparently failed to communicate.

Thanks again for clarifying and commenting.

Post by **chrisrd** » Jun 03, 2013 1:17 am this post

Per thread above, as at early 2012 Veeam provided no method to ensure the integrity of your backups, e.g. checksums of vbk, vbr files etc.

Can anyone confirm that this is still the case, e.g. in V7?

And, if this facility is still not provided, are there any plans to provide some method to guarantee backup integrity?

Post by **foggy** » Jun 03, 2013 11:41 am this post

Chris, no changes in this regard so far. You can also review some considerations regarding that here. To be 100% sure that your backups work, please use SureBackup functionality.

Btw, the "Enable automatic backup integrity checks" setting referred above ensures physical data integrity of the full backup file.

Post by **veremin** » Jun 10, 2013 12:59 pm this post

Additionally, if for some reason you can’t perform test restores you can put into use small utility called backup validator in order to check that the content of backup file is unchanged.

Thanks.

rawtaz · Post by **rawtaz** » Jun 11, 2013 4:21 pm this post

v.Eremin wrote:Additionally, if for some reason you can’t perform test restores you can put into use small utility called backup validator in order to check that the content of backup file is unchanged.

If I click the above link I arrive at a message from this forum software saying "You are not authorised to read this forum". Is this expected? I'm curious about the utility, sounds useful.

Thanks!

Post by **veremin** » Jun 11, 2013 4:30 pm this post

Is this expected?

Fixed it already. And below a short description of this tool that used to be in forum digest several months ago:

If you cannot perform test restores (for example, there is no infrastructure where offsite media is stored), you can use the backup validator support tool instead (included in 6.5 installation directory). This tool merely reads all blocks from the backup file, and ensures that each block's content matches the corresponding block's CRC that we include to ensure backup file modification or corruption is detected during restore. While the backup validator is a very basic tool, and does not perform full blown recoverability testing like the SureBackup functionality, it may still be useful in scenarios when you simply want to ensure that your backup file's contents are unchanged. For example, consider using this tool after transferring the backup files over a WAN, or after a storage disaster involving malfunctioning RAID controllers.

Hope this helps.
Thanks.

rawtaz · Post by **rawtaz** » Jun 11, 2013 4:34 pm this post

Now it works, thanks man!

Post by **veremin** » Jun 11, 2013 4:35 pm this post

You’re welcome. Should any additional help be needed, feel free to contact us. Thanks.

dr.Koen · Post by **dr.Koen** » Feb 10, 2014 2:09 pm this post

Hi

We use Veeam B&R 7.0.0.746 on a VMWare cluster. One of the backup jobs which runs daily has 20 VM’s and uses a reverse incremental scheme. This job has been running for months without any errors. I have a SureBackup job that I run manually now and then to verify if there are any problems, no errors either. I can initiate an instant recovery or restore individual guest files without problems.

Recently I wanted to do a full restore of one of these VM’s after an update that went wrong on that particular VM. To my surprise, the restore was not possible due to “Client error: Failed to decompress LZ4 block: Incorrect decompression result or length”. I tried several restore points but all of them went wrong. Finally, I deployed a completely new VM and recovered what I needed from the Backup Browser (which was luckiliy not so difficult in this particular case).

After further investigation it turns out that none of the VM’s in this backup job can be restored. There is definitely something wrong with the backup files. The backup repository is only a few months old (HP StoreEasy) and there are absolutely no indications of storage errors.
What I find very worrying here is that there was no indication at all that something is wrong with this backup set. I know that I should do an active full backup regularly and I will start doing that, but that is not enough to take my worries away. Surely, Veeam should be able to signal a problem like this.

Is there something that I can improve in our setup to avoid situations like this in the future?

Thanks in advance

Koen

Post by **Vitaliy S.** » Feb 11, 2014 11:22 am this post

Hi Koen,

You may want to enable data block verification of the backup files in the SureBackup job wizard, this will detect this kind of backup file corruption. This is the new feature we have added to v7.

Thanks!

dr.Koen · Post by **dr.Koen** » Feb 11, 2014 12:50 pm this post

Thanks Vitaly. I must have overlooked this.

Post by **Rumple** » Mar 03, 2014 10:40 pm this post

I have a serious concern with the reliability of the backup chains at the moment.
I have an Exchange server with about 1.5TB of data on it doing a nightly backup. Backups have all been completing successfully for months. I keep about 9 restore points or so (basically 2 full's) and then some.
I wanted to use the backups as a seed for a new replica but kept getting errors so I manually tried doing a restore. Unfortunately the OS disk (0:0) is corrupt with LZ4 errors when I try to restore. I can do a FLR no problem but I can't restore the VMDK. The other 14 disks seem to be fine.

THATS A PROBLEM....Case # 00523694, especially when the request from support is perform an active Full...yeah..I either spend about 20 hours doing 1 drive at a time each time I run the backup or I leave the system in snapshot mode for 1 week+. You can imagine how thats going to go.

I can understand my backups failing with an LZ4 if something has gone wrong...but how the hell are the backups completing successfully and are corrupt? If nothing else, the next FULL backup should have started the chain or errors.
How many of my other backups are corrupt and there is no indication until I need them?
Underlying storage checkdisk comes back fine...and I am ok if its a storage problem...but I should find that out during the next backup cycle...not the next restore.

Post by **veremin** » Mar 04, 2014 10:27 am this post

I'm wondering what type of device you're using as backup repository. The underlying storage might have experienced notorious "bit rot" problem that resulted in backup data corruption and went completely unnoticed to VB&R.

The best way to be prevented from such issues is to have more than one copy of backups and test the backup data recoverability. For instance, SureBackup should be able to track such problems.

Anyway, kindly, keep working with the support team. They will be able to shed more light on the root cause of such behavior.

Thanks.

Post by **Rumple** » Mar 08, 2014 4:43 am this post

The backup server is a Dell server with a Perc 6 Raid controller. There are no issues with the drives according to chkdsk,
I've worked with support and they found the backup validator for me and I am running it against all my jobs now and so far it appears I have multiple jobs with bad backups.
I am ok with the explanation that something on the storage is doing it..I really am....but someone wrote a validator tool...why are you not using it as part of the backup process?

However that still doesn't fix the problem of the fact Veeam is happily doing synthetic rollups of my jobs, happily completing each backup job and erssentially doing it on useless backups.

I'm sorry, but that's unacceptable. Sure I can spend my time doing restores every night of every job to make sure they are working (which is how I found the issue doing a monthly test ) but shouldn't the program be able to pick that up at some point.

Multiple copies of my backups wouldn't have helped now would it since all I would have done is replicate the error to my other copy (garbage in, garbage out).

The fact of the matter is...everyone should be very concerned about the state of their backups if you can only find the problem during a restore or a manual validation of each job.

Post by **Vitaliy S.** » Mar 10, 2014 3:53 pm this post

I agree that multiple backup copies might not help here, but you can use SureBackup and enable data block verification of the backup files in the SureBackup job wizard, this will detect this kind of backup file corruption. This is the new feature we have added to v7.

davidb1234 · Post by **davidb1234** » Mar 10, 2014 8:23 pm this post

I hate to say it but I recently had the same issue.

Forever Reverse Incremenal Backup of our production SQL vm was reporting successful. However when we boot up this VM from the backup file one of the databases was corrupted. This database was fine in production and only corrupted in the backup file.

Running a full active backup with CBT enabled still produced a corrupted backup file. It wasn't until we disabled CBT and did a backup that the database was not corrupted in the backup file. We ultimately reset CBT data and now appear to be fine.

However it is EXTREMELY SCAREY that Veeam can go for weeks or months thinking it is backing up fine yet there is corruption in the file and the only way to notice is if you literally validate the data somehow. Sometimes this is easy(DBCC CHECKDB), sometimes this is much harder(file system or exchange).

davidb1234 · Post by **davidb1234** » Mar 10, 2014 8:26 pm this post

Vitaliy S. wrote:I agree that multiple backup copies might not help here, but you can use SureBackup and enable data block verification of the backup files in the SureBackup job wizard, this will detect this kind of backup file corruption. This is the new feature we have added to v7.

This feature did not detect the corruption in our backup.

Running an active full also did not create a valid backup file.

The only thing that resolved our corruption was disabling CBT in the backup job or resetting CBT data.

At no point was Veeam able to tell us that the file was corrupt or unusable. We had to stumble upon the issue when we needed to restore the data and found it to be bad.

We have very good fiber channel storage end to end so our storage cannot be to blame.

Mar 10, 2014 9:03 pm

It sounds like CBT was to blame in this case. I've certainly seen cases where it appears that CBT data doesn't change from one day to the next which results in corruption of backup. This isn't really something Veeam could detect because the problem is that the CBT mechanism is not returning the complete/correct list of blocks to be backed up. I'm not sure what exactly leads to this issue, but I've seen it in two different environments.

However, Surebackup can indeed be used to detect this, but it would require creating a custom script that validates the databases as part of the verification process. By default it simply checks that the DB starts correctly. The same could be done with Exchange or perhaps to a limited extent, files.

Post by **Gostev** » Mar 10, 2014 10:38 pm this post

SureBackup script can validate just about anything. Some customers have shared really cool scripts with us, I believe we are planning to enhance the default application verification scripts in v8 based on that (at least for SQL), but of course you don't have to wait, just create your own script that validates whatever you feel is necessary.

Post by **tsightler** » Mar 11, 2014 12:28 am this post

Any product can backup garbage if the OS from which the data is being read doesn't provide good data. Many years ago we had an issue where some important files were discovered to be corrupt on a file server. Unfortunately, while these files were important, they were not accessed particularly often, and when it was discovered they were unreadable we found that our file level backups going back months were still damaged. We were able to recover some of the oldest files from archive tapes but most of the newer files that were only on monthly backups were still corrupt. It appeared to be corruption within the NTFS filesystem as even the metadata on the files were damaged, however, the OS would copy the files, bad metadata and all, so the backup software happily backed up the corrupt files and metadata information. Verifying integrity of files is quite difficult.

In the case of Veeam we are trusting that VMware provides us valid information about the blocks to be backed up. If for some reason this isn't the case, which certainly sounds like that was the problem if resetting CBT fixed the issue, then it would be very difficult for Veeam to detect this. Even a backup validation wouldn't check this and it would impact any product that uses the VMware APIs for backup. Surebackup technology with custom scripting may very well be the only solution on the market that offers any method of detecting this type of issue.

davidb1234 · Post by **davidb1234** » Mar 14, 2014 3:44 pm this post

Gostev wrote:SureBackup script can validate just about anything. Some customers have shared really cool scripts with us, I believe we are planning to enhance the default application verification scripts in v8 based on that (at least for SQL), but of course you don't have to wait, just create your own script that validates whatever you feel is necessary.

Support provided me a Sqlchecker script. Unfortunately it would not have caught these issues as the database was corrupted but still online/mounted. Only a manual DBCC CHECKDB caught the corruption.

The SQLchecker script that Veeam support is handing out as a fix only checks the databases to see that they are all online and mounted, not that there is block corruption in them. The only way to do that is manually that I know of.

It appears that CBT is more prone to corruption than taking file level backups so identifying these bad blocks and replacing/repairing them is very important. Veeam/VMWARE needs to start working on this. My perception is that this is happening more than people realize and just don't notice it until they need the data on those backups and find out CBT was corrupted and therefore the data is corrupted.

I've never run in to corrupted backups in my life until I started using products that bank on CBT like Veeam. Now it comes up from time to time and it can really ruin your day.

larry · Post by **larry** » Mar 14, 2014 8:04 pm this post

Vitaliy S. wrote:I agree that multiple backup copies might not help here, but you can use SureBackup and enable data block verification of the backup files in the SureBackup job wizard, this will detect this kind of backup file corruption. This is the new feature we have added to v7.

is this the checkbox "Enable automatic backup integrity checks"?

Post by **veremin** » Mar 15, 2014 9:52 am this post

I believe Vitaliy was talking about "Validate consistency of virtual machines' backup files" option that can be found in the settings of SureBackup job. (SureBackup Job -> Settings -> Job validation). Thanks.

R&D Forums

Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

[MERGED] Not possible to restore VM although SureBackup is O

Re: Question on backup integrity

Re: Question on backup integrity

[MERGED] : LZ4 Error during Restore

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Re: Question on backup integrity

Who is online