Thoughts about block-based repository

Post by **DanielJ** » Dec 02, 2019 3:22 pm this post

I just wanted to throw out this idea, feel free to comment/shoot it down.

Today a repository is a space on which to build backup files, full and incremental. What if we never built those files, but only saved the CBT blocks as they were read from the source? One block in each file (which could be named for its hash value). A block which is already in the repository would not be saved again. A block no longer in use would be deleted (by some cleanup job). No need for forward/reverse or indeed any transformations at all. Metadata would be saved on the side and vmdk's could be reconstructed from blocks if needed. I know CBT blocks have variable size but even today we have to re-read the whole disk when the size changes, so that wouldn't be much different. If a backup was interrupted and must be retried, it would benefit from the blocks from the previous run already being in the repository. Any unused blocks would be found and deleted by the cleanup job. This repository would also replace the SOBR, since it wouldn't be limited to a single server. The database keeps track of what is where. A job would be able to write to multiple servers in the same repository - a nice performance bonus. For added resilience we could define policies that each block must be saved in two or three copies (on different servers) as long as we had the disk space. That policy could also be configured to sacrifice extra copies rather than letting the repository fill up, and give a warning in the console that it's time to add another server.

Pros:
No transformations - increased performance and less risk of corruption.
No giant vbk files to handle.
"Automatic" block deduplication over the whole repository (even if only for same-sized blocks).
Multiple copies without copy jobs.
Easy scaling!

Cons:
Larger database to keep track of all blocks.
Loss of the database means loss of the backup data (but this is also true for some other backup products).
Impossible to just copy a vbk out of a repository and import it somewhere else.
Compressing/encrypting all blocks separately will be less efficient.
Things like instant recovery would have to work differently, but perhaps not much.

How about that?

Post by **PetrM** » Dec 02, 2019 6:07 pm this post

Hello Daniel!

Thanks for sharing this idea!

In fact, it's really interesting however I have some considerations regarding possible advantages:
1) "No transformations - increased performance and less risk of corruption"
The transformation or periodic full run will be required anyway in order to comply retention policy settings. Otherwise, it's not clear which blocks can be removed and when.

2) "No giant vbk files to handle"
Instead, there will be a need to scan the whole repository in order to find a necessary block during restore and SQL database will contain too many entries representing every block: the overall performance will be dramatically affected.

3) "Automatic" block deduplication over the whole repository (even if only for same-sized blocks)"
You may consider usage of dedupe appliances if you'd like to achieve higher deduplication ratio on repository.

4) "Easy scaling!"
I think SOBR with Capacity Tier can cover this use case.

Thanks!

Post by **Gostev** » Dec 02, 2019 7:31 pm this post

Shorter answer: one of the top features why our customers choose Veeam over competition is self-contained backup files.

This is because outside of operational and performance reasons, many got burned with other backup solutions (working in the suggested way) after loosing metadata that actually helps to makes sense of those millions of individual blocks

which they are completely useless without.

And to add to the previous response:

1) This has been already addressed by our advanced ReFS (and XFS in v10) integrations. Also 3, but only to some extent: 90% of the benefit is there with dedupe across GFS backups in the same job.

2) You should check out system requirements for vendors who implement this approach

one of our customers moved to Veeam instead of implementing the competing solution despite having the required hardware purchased (3 mega servers with tons of compute and RAM to take care of that block-based storage). We were only able to repurpose one of these servers, since Veeam literally did not need the rest. There's just no way global dedupe would have covered those compute and power savings.

Having said all that: we actually do work with object storage the way you explained above. However, obviously object storage is a totally different beast as it's not just a storage, but also a NOSQL database that is built for pretty much infinite scale - which solves the "SQL database" issue mentioned by Petr.

Mgamerz · Dec 02, 2019 8:40 pm

We used to use druva insync, which does "block" storage at file level for each small block. As repository grew, restore times would continue to get worse as block lookups got slower and slower. It made it work on any file system (well, mostly) but it had a huge performance impact and near the end it was practically unusable.

Dec 03, 2019 7:48 am

Thank you for your comments. This is probably something that could be fun to try to implement on a hobby level, rather than being practically usable.

R&D Forums

Thoughts about block-based repository

Re: Thoughts about block-based repository

Re: Thoughts about block-based repository

Re: Thoughts about block-based repository

Re: Thoughts about block-based repository

Who is online