Corrupted backup copy

Post by **mkretzer** » Nov 29, 2017 5:51 am this post

Hello,

Case 02404420

yesterday we had a outage of the connection to our remote backup copy target. The outage happened as some copy jobs copied data and some merged. Also, we had a defective disk (of a RAID 6) the day before in one of the remote locations storages.
There was no outage of power and connection between the copy target host and its storage system. Just the WAN link.

Today the two health checks that ran both showed corrupted backups.

How likely is it that the corruption comes from the WAN link failure?

Markus

Post by **veremin** » Nov 29, 2017 11:13 am this post

Abrupt stop of merge operation might leave backups in an inconsistent state. Support team should be able to provide the precise root cause after debug log investigation. Thanks.

Nov 29, 2017 11:24 am

Next job run will repair corrupt restore points and finalize the merge correctly, this will happen automatically.

Post by **mkretzer** » Nov 29, 2017 8:48 pm this post

@foggy:
Thats what i thought. In the last 4 years we had crashes, disable of copy jobs in the middle of HC / Merge / Transfer but never had our monthly health check show an inconsistency.
We checked more backups and every backup file on the copy storage target shows inconsistencies for some VMs - even after merges.

Right now we are health-checking another backup copy on the same target repo server but a different (also different brand) backend storage.

Post by **foggy** » Nov 30, 2017 10:43 am this post

Then please check with support.

Post by **mkretzer** » Nov 30, 2017 5:46 pm this post

"whether it was caused by network issues - No, it wasn't"

Sounds like the backend storage is the problem - that would be nice that way we can continue to trust Veeam backup copies...

Dec 03, 2017 11:29 pm

Inconsistent restore point due to failed/aborted merge is not a problem as it will be repaired automatically by the next run (and previous restore points will not be affected regardless). As long as the backup storage itself is solid and not causing the corruption, there's nothing to worry about. This is why it is important to determine the root cause with our support. Thanks!

Post by **mkretzer** » Dec 04, 2017 5:36 am this post

Will do. We also suspect the storage now.

Problem is it is a copy target and support still has not provided us information how we can run the Validator without a full Veeam installation...

Dec 04, 2017 7:52 am

If you suspect the storage then let us know what storage are you using and firmware level? Someone out there may be able to through some light on the matter.

Post by **mkretzer** » Dec 05, 2017 8:32 am this post

It was a MD3600f from Dell. But that storage has no data corruption protection. Now we migrated to a HUS110 from Hitachi which has all the protections. Still, the same backup shows as corrupted (we re-seeded it from the main site).

Main site backup was also checked and shows no error.

Post by **foggy** » Dec 05, 2017 1:01 pm this post

So you've copied the verified backups from source, use them to seed the job. and they are shown as corrupt on target?

Post by **mkretzer** » Dec 05, 2017 2:02 pm this post

No, not immediately. The first check was ok.

Here is what happened exactly:

- Backup copy job mirroring backup from the primary site (last active full ~ 4 weeks ago, health check yesterday) to a remote site. Remote site had a Dell MD3600f (RAID 6 config) as backup storage.
- Last week this MD3600f had an strange warning that a disk in in per-failure state.
- My co-worker sadly replaced the wrong disk from the same RAID 6. We waited for the raid rebuild to finish before we replaced the real defective disk. My theory is that the first rebuild was done with the data from the partly dead disk
- After all the rebuilds we saw corruption on nearly all health checks we did on the remote site
- Since we had a free Hitachi HUS110 on the main site we used temporary backup copy jobs to create a new backup seed on that storage
- We then brought that storage system to the remote site, created & rescanned the repo, disabled the copy jobs, removed the old, corrupt backups copied from configuration, targeted the copy jobs to the new repo and mapped the backups
- All copy jobs did an initial health check without errors
- After two days (and the first merge) another health check showed a corruption in the same VM file again as on the old storage
- A heath check on the primary site showed no issues with the primary site backups

The only thing i can think of if that Veeam tried to "heal" the corrupt blocks on the remote site after the backup target was replaced (we kept the same jobs).

Post by **foggy** » Dec 05, 2017 4:36 pm this post

Looks strange, please ask support to investigate.

Dec 05, 2017 4:40 pm

They are on it. Case 02409456 now.

Post by **mkretzer** » Dec 06, 2017 7:25 am this post

One more thing: How should the "auto-healing" of the copy job data work? I do not have the feeling it is doing its job!

Post by **Gostev** » Dec 06, 2017 8:43 am this post

mkretzer wrote:How should the "auto-healing" of the copy job data work?

https://helpcenter.veeam.com/docs/backu ... tml?ver=95

Post by **mkretzer** » Dec 06, 2017 9:00 am this post

So until the merge of that point is done health check can still find corrupted data?

Post by **Gostev** » Dec 06, 2017 9:02 am this post

Not sure I get the question... but health check verifies the latest restore point regardless of its placement in the backup files.

Post by **mkretzer** » Dec 06, 2017 9:55 am this post

But it always shows the error in the vbk:
06.12.2017 09:54:42 :: Disk wsus_1-flat.vmdk of VM wsus.sw.buhl-data.com is corrupted, possible reason: Storage I/O issue. Corrupted data is located in the following backup files: wsus.vm-781264D2017-12-04T132056.vbk

If there is an incremental point after that (which fixes the "chain" but not the vbk until merge and another health check will run will Veeam still show the corruption in this health check?

DGrinev · Post by **DGrinev** » Dec 06, 2017 11:55 am this post

Hi,

No, since we're reading backup chain starting from the latest increment (which consist of healthy data blocks), the corrupted data blocks are not needed anymore. Thanks!

Post by **mkretzer** » Dec 06, 2017 3:31 pm this post

Then this does not seem to work. In the today copy job a corruption of a smaller VM got detected, the new incement was transfered but validator still shows the corruption. But i really have the feeling that validator checks only the vbk or at least starts with the vbk.

I will try to get a health check running and see if the result differs.

Post by **foggy** » Dec 06, 2017 3:39 pm this post

Veeam Backup Validator is a different thing, unlike health check, it recalculates checksums for the entire backup chain.

Post by **mkretzer** » Dec 06, 2017 5:05 pm this post

So in that case after ther backup with the correction is merged validator should show no error anymore?

DGrinev · Post by **DGrinev** » Dec 07, 2017 1:52 pm this post

After a brief discussion with the QA team, there are no errors expected after the merge of healthy blocks is done. Thanks!

Post by **mkretzer** » Dec 07, 2017 4:11 pm this post

Strange: According to support health check might not really fix these issues:
"About the HealthCheck - we can't be sure that it will be able to heal this particular corruption, and probably BackupCopyJob chain should be recreated. "

DGrinev · Post by **DGrinev** » Dec 08, 2017 2:48 pm this post

Unfortunately, there are several unique conditions leading to inconsistency of the backup chain without ability to fix it by Health Check.
I assume the support team found those conditions during the investigation and report you the negative results. Thanks!

Post by **mkretzer** » Dec 08, 2017 7:22 pm this post

Ok that leads to the question - what can health check fix and what not? From what i understand there was just one backup block which was too short after decompression. Why should this not be fixable?

DGrinev · Post by **DGrinev** » Dec 09, 2017 4:36 pm this post

That's the good question, I will try to find out next week and give you an example. Thanks!

Dec 11, 2017 4:23 pm

The answer to this question is in the name of the feature, which is "storage-level corruption guard". And it does just that, nothing more and nothing less: detects silent data corruptions in your backup storage, and attempts to make the latest restore point restorable by copying all corrupted blocks over again from the source.

Dec 11, 2017 7:45 pm

And it works

Last restore point shows clean health even without AF.

R&D Forums

Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Re: Corrupted backup copy

Who is online