Veeam B&R Health Check - feature request

averagejoe · Nov 06, 2021 10:22 pm

Below is some back ground information of our Backup Data Integrity efforts. I apologize for the length, but hopefully it provides a better understanding of the feature request that follows.

Requirements:
1. Implement adequate measures to identify and mitigate corruption in restore points
2. Maintain relevant RTO and RPO commitments established in SLA agreements
3. Mitigation solutions deployed must operate effectively at scale with acceptable operational and financial impact

Acknowledgements/Constraints:
1. All verification processes add additional I/O and processing demands on backup infrastructure
2. Insufficient resources or bandwidth available to continuously validate all historical recovery points for all workloads against all possible forms of corruption

Environment Description:
1. Two sites (DC1 and DC2)
2. B&R 11 is installed at each site
3. Backup jobs use their local site repository as their target
a. Weekly synthetic fulls scheduled on different days to stagger the I/O load temporally
4. Backup Copy “BC” jobs uses the remote site repository as their target

Selected Methodology:

As a baseline standard of care for all jobs, we implemented active weekly integrity checks for backup job’s current backup chain and active fulls from source to secondary repository on BC jobs. This would normally be accomplished with weekly Health Checks on backup jobs (one read pass) and weekly active fulls on BC jobs (one read pass) for a total of 2 read passes.

To accomplish both tasks concurrently with one read operation, we configured our BC jobs to “read the entire restore point from source backup”. In addition we configured the BC job to use a Data reduction / compression setting other than “Auto”. In our case we used “Dedup-friendly”. A setting of “Auto” instructs the job to leave the data exactly as it was on the source. By selecting a different compression setting the BC job will validate the checksum on the metadata file and the hash on each LZ4 block after decompressing it. I believe this is similar to the methods employ by a Health Check.

If all the blocks are corruption free the BC job completes successfully. If corruption is found, the job fails. In the event of a failure the step below are performed.

Quick backup of the VM to establish a new chain on the backup job repository
Active full of the BC job
Optional - Remove corrupted restore points from repository and instruct B&R to forget the restore points

Note: Additional tools such as SureBackup beyond this baseline standard of care for critical workloads/systems should be used.

Feature requests:

Health Check is currently an option on BC jobs, but only validates the target repository restore points.

Consider adding an option to validate the source restore points during the transfer similar to the configuration described above. It would be an option only if “read the entire restore point from source backup” is selected. If corruption is detected HC could use its self-heal feature to remediate the source chain, mark the restore point as corrupted and perform the BC active full on just the affected VM.

Post by **Dima P.** » Nov 08, 2021 12:03 pm this post

Hello Henry,

Thank you for the detailed description of your feature request: sounds interesting! We will discuss this request with the team.

averagejoe · Post by **averagejoe** » Nov 08, 2021 5:27 pm this post

There may be other opportunities to eliminate additional read passes and expand the coverage of best practices implementation with automation. Consider the following (Synthetic full integrated with In-line HC, GFS archiving & BC GFS archiving options). When all three of the options are employed, I/O is minimized by reducing the read passes required from traditional methods by a factor of 3.

Synthetic Full creation options
In-line Health Check
In-line GFS archive creation
In-line BC GFS archive creation

This integration would combine the best practice of synthetic fulls with best practices of periodic HC for backup job current chains, active fulls for GFS archives and active fulls for BC GFS archives. Self-healing for synthetic fulls, GFS archiving and BC GFS archiving could be integrated to provide end-to-end automation of best practices.

During the periodic synthetic full on backup jobs provide an option to use the read pass to:
1. Validate metadata file checksum and LZ4 blocks hash to validate current backup chains in-line (In-line Health Check)
2. Build the synthetic full on the backup job repository
3. If selected create a new full GFS archive and BC GFS archive (In-line GFS archive & In-line BC GFS archive)
a. This process uses the best practice of periodic active fulls for BC jobs (same as “read the entire restore point from source backup”)
b. Note: on Exagrid repositories the newly create synthetic full & GFS archive are identical and could be implemented by a single write pass to build the synthetic full followed by a pointer operation to point the GFS archive to it. This will happen during the deduplication process eventually.
4. If corruption is found automate self-healing for VMs with corrupted restore points found in the current backup chain after completion of the synthetic full process completes
a. Mark corrupted restore points in backup job chain as corrupted
b. Granular active full of VM from production source when corruption is found
c. Single read operation from source backup repository using the new VM active full from the above step and validate metadata file check sum and LZ4 blocks in-line
d. Create synthetic full, GFS archive & BC GFS archives

averagejoe · Post by **averagejoe** » Nov 08, 2021 9:34 pm this post

Additional comments:

[Follow on - Synthetic fulls feature requests]
If synthetic fulls are created with the in-line HC option, there is no requirement to perform a separate active full for the backup job GFS archive. In this case the existing full in the chain could become the latest backup job’s GFS archive.

If the suggested options for synthetic fulls are utilized on a backup job, it would negate/replace the BC job’s GFS archive definition and scheduling configuration. The only remaining items processed by the BC job would be the backup job’s backup chain restore points used as a source for the BC job’s backup chain.

[Follow on - BC job feature requests:]
It is unclear if the BC job’s transfers of the backup job’s restore points from the primary job’s repository to the BC job’s backup chain also include metadata file checksum comparison and LZ4 block hash verification when data reduction / compression settings for the BC job is set to other than “Auto”. If not this could also be a sub option (In-line source incremental Health Check) for configuring HC on BC jobs. Such a verification process would provide the most frequent granular periodic identification and self-healing opportunity as it would be performed on incremental restore point transferred from the backup job’s backup chain on the source repository to the BC job’s backup chain on the target repository.

averagejoe · Post by **averagejoe** » Nov 08, 2021 10:44 pm this post

[Follow on - BC job backup chain merge operation]
This may be one more process that could have an option (In-line Merge Health Check) were corruption detection could be performed without incurring additional I/O, but would involve additional compute resources. A self-healing could be optional, but as with all scenarios would involve additional I/O.

averagejoe · Post by **averagejoe** » Nov 12, 2021 12:15 am this post

Consider the scenarios below that build on previously discussed methods and proposes extending the functionality of Health Check.

Backup Job and Backup Copy Job – backup chain creation scenario

I. Example of backup copy immediate transfer of new backup job restore points with corruption detection and self-healing
(This configuration would negate 1 read pass (2x reduction) required for a separate Health Check on the backup job restore points)
1. Backup job writes all the VMs new incremental restore points to its repository.
2. Backup copy job with “Immediate Copy” mode starts the transfer of the new restore points for all VMs in the backup job as soon as it completes
3. During the transfer the backup copy job detects corruption in a new restore point for one or more VMs in the backup job
4. After the backup copy job finishes transferring the validated corruption free restore points using the decompression method for the other VMs
a. Health Check would mark the corrupted restore points “Corrupt” in the Backup and Replication database
b. Health Check would then initiate an active full from the production environment for only the VMs where it discovered corruption (possibly through the Veeam vSphere plug-in)
c. Once the granular active fulls complete, the backup copy job would immediately transfer them to the target proxy/repository to create new active fulls from the source again with corruption detection and self-healing

Backup Copy Active Full GFS Archive creation scenarios
II. Example of backup copy Active Full GFS Archives with optional Health Check
(This configuration would negate 1 read pass (2x reduction) required for a separate Health Check on the backup job restore points)
1. Backup copy job validates metadata file checksum for the backup job
2. Backup copy job begins to read backup chain fulls and incrementals restore points for all VMs in the job from the backup jobs source repository
3. Backup copy job decompresses the LZ4 block and compares the hash with the expected hash for the block that was stored in the metadata file for the backup job (enabled by setting compression to something other than “Auto”)
4. If the block is not corrupted the backup copy job compresses the block, transfers the block to the target proxy/repository to build a new weekly Active Full GFS Archive
5. If corruption is detected for the LZ4 block, the backup job stops processing that VM, removes the new partial Active Full GFS Archive for the affected VM, and continues with processing the remaining VMs in the backup copy job
6. Once the backup copy job finishes processing all VMs
a. Health Check would mark the corrupted restore points “Corrupt” in the Backup and Replication database
b. Health Check would then initiate an active full from the production environment for only the VMs where it discovered corruption (possibly through the Veeam vSphere plug-in)
c. Once the granular active fulls complete, the backup copy job would immediately transfer them to the target proxy/repository to create new active fulls from the source again with corruption detection and self-healing

III. Example of backup job weekly synthetic full integrated with creation of backup copy Active Full GFS Archives with optional Health Check
(This configuration would negate 2 read passes (3x reduction) required for separate read passes for the Health Check on the backup job restore points and the backup copy Active Full GFS Archive copied from source)
1. Backup job validates metadata file checksum for the backup job
2. Backup job begins to read backup chain fulls and incrementals restore points for all VMs in the job from its primary repository
3. Backup job decompresses the LZ4 block and compares the hash with the expected hash for the block that was stored in the metadata file for the backup job
4. If the block is not corrupted the backup job writes the block to the VMs new synthetic full, transfers the same block to the target proxy/repository to build a new weekly Active Full GFS Archive
5. If corruption is detected for the LZ4 block, the backup job stops processing that VM, removes the new partial synthetic full and Active Full GFS Archive for the affected VM, and continues with processing the remaining VMs in the backup job
6. Once the backup job finishes processing all VMs
a. Health Check would mark the corrupted restore points “Corrupt” in the Backup and Replication database
b. Health Check would then initiate an active full from the production environment for only the VMs where it discovered corruption (possibly through the Veeam vSphere plug-in)
c. Once the granular active fulls complete, the backup copy job with immediately transfer them to the target proxy/repository to create new active fulls from the source again with corruption detection and self-healing

Combining scenario I & II provides a robust automated end-to-end implementation of several best practices
1. Immediate copy of new backup copy incrementals
2. Integrated immediate corruption detection for backup job metadata file and incremental restore points
3. Eliminates read pass of separate Health Check
4. Weekly backup copy Active Full GFS Archive
5. Weekly integrated corruption detection on entire current backup job chain
6. Granular and comprehensive Health Check self-healing for current backup job full chain

Combining scenario I & III adds additional I/O efficiency, but the tradeoff is no in-line corruption detection will be performed on synthetic fulls. This could be mitigated to a degree by using periodic active fulls on the backup job. When using the this mitigation carefully consider the impacts to production resources and scalability.

R&D Forums

Veeam B&R Health Check - feature request

Re: Veeam B&R Health Check - feature request

Re: Veeam B&R Health Check - feature request

Re: Veeam B&R Health Check - feature request

Re: Veeam B&R Health Check - feature request

Re: Veeam B&R Health Check - feature request

Who is online