Comprehensive data protection for all workloads
Post Reply
woodman
Novice
Posts: 4
Liked: never
Joined: May 06, 2021 6:47 pm
Full Name: Michael Wood
Contact:

Recovery after ReFS Corruption

Post by woodman »

Hi

I am currently in the process of recovering from corruption to our main repository ( Case ref: # 04792071). I just want to double check that I am doing the correct thing before resuming backups. I am using a 30 day forever forward incremental and don't want to be surprised in 30 days time when the full backup can't merge with the incremental or further corruption arises.

Brief timeline of events:
  • Backup completes ok no errors reported.
  • Linked copy job fails for single VM and points to corruption in latest incremental vib file. System events correlate with ReFS 133 error pointing to same file
  • Logged ticket with support and disabled backup and copy job.
  • Advised to run verification tool against backup job. This completes ok when run against whole job but fails when specifying job and vm (strange?) latest VIB was used in both runs.
  • Advised to rename the latest VIB file and rescan then forget. Re-run verification and then enable backup and copy job again.
So far I have removed the vib, re-run the verification and enabled the backup job again. This completed fine just now. My only concern is around CBT which might be because I don't fully understand the process by which it works and how it deals with tracking after the data from the last successful backup has been removed. I read the following "Veeam Backup & Replication queries CBT through VADP and gets the list of blocks that have changed since the last job session", if I have removed the data from the last backup session (by deleting the vib) how does it reconcile this?

I don't want to enable the copy job just yet in case there is an issue with how I am doing this which results in me corrupting the offsite copy. I have a feeling that I should be kicking off an active full backup to protect my chain. I was concerned this may result in large amounts of data being sent to the offsite repository by the copy job but after speaking to technical it seems both chains are separate and the backups are mounted both ends and change blocks are copied over. So essentially no more data than usual would be sent across. Is that correct?

Thanks
HannesK
Product Manager
Posts: 14322
Liked: 2890 times
Joined: Sep 01, 2014 11:46 am
Full Name: Hannes Kasparick
Location: Austria
Contact:

Re: Recovery after ReFS Corruption

Post by HannesK »

Hello,
and welcome to the forums.

CBT is nothing to worry about if the repository has an issue. post39946.html#p39946 - different scenarios, but long story short: it's fine.

Yes, the backup copy job is incremental forever per default. A new full backup doesn't influence the amount of data it transfers.

Best regards,
Hannes
woodman
Novice
Posts: 4
Liked: never
Joined: May 06, 2021 6:47 pm
Full Name: Michael Wood
Contact:

Re: Recovery after ReFS Corruption

Post by woodman »

Hi Hannes,

Thanks for the reply and advice. Is it worth doing an active full backup just in case?

Thanks
Mike
HannesK
Product Manager
Posts: 14322
Liked: 2890 times
Joined: Sep 01, 2014 11:46 am
Full Name: Hannes Kasparick
Location: Austria
Contact:

Re: Recovery after ReFS Corruption

Post by HannesK »

I'm not sure whether I would like to continue storing any data on a corrupted filesystem. I recommend finding the root cause.
woodman
Novice
Posts: 4
Liked: never
Joined: May 06, 2021 6:47 pm
Full Name: Michael Wood
Contact:

Re: Recovery after ReFS Corruption

Post by woodman »

Thanks Hannes, the only reason I can think of this happening is due to a hardware issue or firmware bug?

Coincidently earlier in the day (before the corrupt vib was generated) we received an alert for a cabling issue with the cache battery backing the array card. After getting vendor support to go over all logs and hardware, no trace of the event could be found and all hardware reports as healthy. To be on the safe side and to rule out firmware issues, I have updated (yesterday) all the firmware.

Other than that I cannot find any root cause for the issue occurring and it looks like I limited options for checking file integrity without enabling "integrity streams" which have a performance impact?

I have re-run verification using the integrity checker for the backups which completed ok and assume any additional corruption (133 events) to the volume\files would have been reported when the files were read. The corrupt file that was initially reported was also removed so I assumed the corruption is gone?
HannesK
Product Manager
Posts: 14322
Liked: 2890 times
Joined: Sep 01, 2014 11:46 am
Full Name: Hannes Kasparick
Location: Austria
Contact:

Re: Recovery after ReFS Corruption

Post by HannesK »

Hello,
ah, raidcontroller firmware bugs are one of the top reasons of data loss. That's probably the reason, yes.

Integrity checks are built-into the software (health check checkbox in advanced settings of a job)

REFS integrity streams and performance is hard to say how much impact it really has. post404522.html#p404522 is interesting read.

Best regards,
Hannes
woodman
Novice
Posts: 4
Liked: never
Joined: May 06, 2021 6:47 pm
Full Name: Michael Wood
Contact:

Re: Recovery after ReFS Corruption

Post by woodman »

Thanks for pointing me to this post and all your help and advice!
Post Reply

Who is online

Users browsing this forum: Bing [Bot] and 123 guests