Recovery after ReFS Corruption

woodman · Post by **woodman** » May 06, 2021 7:25 pm this post

Hi

I am currently in the process of recovering from corruption to our main repository ( Case ref: # 04792071). I just want to double check that I am doing the correct thing before resuming backups. I am using a 30 day forever forward incremental and don't want to be surprised in 30 days time when the full backup can't merge with the incremental or further corruption arises.

Brief timeline of events:

Backup completes ok no errors reported.

Linked copy job fails for single VM and points to corruption in latest incremental vib file. System events correlate with ReFS 133 error pointing to same file

Logged ticket with support and disabled backup and copy job.

Advised to run verification tool against backup job. This completes ok when run against whole job but fails when specifying job and vm (strange?) latest VIB was used in both runs.

Advised to rename the latest VIB file and rescan then forget. Re-run verification and then enable backup and copy job again.

So far I have removed the vib, re-run the verification and enabled the backup job again. This completed fine just now. My only concern is around CBT which might be because I don't fully understand the process by which it works and how it deals with tracking after the data from the last successful backup has been removed. I read the following "Veeam Backup & Replication queries CBT through VADP and gets the list of blocks that have changed since the last job session", if I have removed the data from the last backup session (by deleting the vib) how does it reconcile this?

I don't want to enable the copy job just yet in case there is an issue with how I am doing this which results in me corrupting the offsite copy. I have a feeling that I should be kicking off an active full backup to protect my chain. I was concerned this may result in large amounts of data being sent to the offsite repository by the copy job but after speaking to technical it seems both chains are separate and the backups are mounted both ends and change blocks are copied over. So essentially no more data than usual would be sent across. Is that correct?

Thanks

Post by **HannesK** » May 07, 2021 5:19 am this post

Hello,
and welcome to the forums.

CBT is nothing to worry about if the repository has an issue. post39946.html#p39946 - different scenarios, but long story short: it's fine.

Yes, the backup copy job is incremental forever per default. A new full backup doesn't influence the amount of data it transfers.

Best regards,
Hannes

woodman · Post by **woodman** » May 07, 2021 6:30 am this post

Hi Hannes,

Thanks for the reply and advice. Is it worth doing an active full backup just in case?

Thanks
Mike

Post by **HannesK** » May 07, 2021 6:40 am this post

I'm not sure whether I would like to continue storing any data on a corrupted filesystem. I recommend finding the root cause.

woodman · Post by **woodman** » May 07, 2021 9:45 am this post

Thanks Hannes, the only reason I can think of this happening is due to a hardware issue or firmware bug?

Coincidently earlier in the day (before the corrupt vib was generated) we received an alert for a cabling issue with the cache battery backing the array card. After getting vendor support to go over all logs and hardware, no trace of the event could be found and all hardware reports as healthy. To be on the safe side and to rule out firmware issues, I have updated (yesterday) all the firmware.

Other than that I cannot find any root cause for the issue occurring and it looks like I limited options for checking file integrity without enabling "integrity streams" which have a performance impact?

I have re-run verification using the integrity checker for the backups which completed ok and assume any additional corruption (133 events) to the volume\files would have been reported when the files were read. The corrupt file that was initially reported was also removed so I assumed the corruption is gone?

Post by **HannesK** » May 07, 2021 10:13 am this post

Hello,
ah, raidcontroller firmware bugs are one of the top reasons of data loss. That's probably the reason, yes.

Integrity checks are built-into the software (health check checkbox in advanced settings of a job)

REFS integrity streams and performance is hard to say how much impact it really has. post404522.html#p404522 is interesting read.

Best regards,
Hannes

woodman · Post by **woodman** » May 07, 2021 10:35 am this post

Thanks for pointing me to this post and all your help and advice!

R&D Forums

Recovery after ReFS Corruption

Re: Recovery after ReFS Corruption

Re: Recovery after ReFS Corruption

Re: Recovery after ReFS Corruption

Re: Recovery after ReFS Corruption

Re: Recovery after ReFS Corruption

Re: Recovery after ReFS Corruption

Who is online