Dedupe rate on daily DB-dumps

KarmaKuma · Post by **KarmaKuma** » Jul 22, 2024 9:52 pm this post

Lets assume the following: I backup vm-snaps or in-guest agent based backups (both variants with forever forward 30 days chain, no synth or active full) direct-to-object (not capacity) either as pbj or bcj.

Some of these vms contain daily database dumps on a dedicated drive/disk (uncompressed, unencrypted native full database backup/dump to - let's say - z:\dbbackup).

As long as the databases dont't have a lot of daily change, each db-dump is 99.9% identical to the last one (I verified this on Pure FlashArray with 10 daily dumps from two very similar db-systems - SAP prod and test - 25tb skinned down to ~300GB).

How would Object Storage handle this? Most parts of each daily incremental backup will be identical to an existing object/junk from the backup chain in the object store and thus be deduplicated, right?

Much more efficient than what is consumed on a SMB repo, correct? More in-line with a ReFS Repo with scheduled Dedupe and Compression, although not as good/efficient due to sample size being probably 1M/4M according to object/junk size vs few KB block size on ReFS...

Further:
If I am not wrong, two identical vms will not dedupe between/against each other in direct-to-object repos, identical to per-vm chains on traditional repos. Dedupe/compression only takes place inside their respective own chain (or specific object store/collection that represents their respective chain in the case of object storage repos).

Now, If I would collect all db-dumps from "mostly-same-database" vms on a central file-server and backup this vm via snap or in-guest agent (again, forever forward with 30 days chain), then all these db-dumps would end up in the same backup chain (or rather object store) and be deduped against each other inside the respective backup-chain in direct-to-object.

The result being that I would most probably achieve a very high dedupe rate, correct? Much better than on an SMB repo without any means of Host-dedupe/compression. Again, more in-line with a ReFS Repo with scheduled Dedupe and Compression...

Still under the assumption that a) the daily change-rate on the db-dumps is low and b) the collected source-dbs are mostly same-same (like recent sandbox copy of prod, recent q-control-copy of prod, etc...)

Correct?

Is there any difference between performance tier and capacity tier in this regard?

Thanks a lot for any insight

Post by **david.domask** » Jul 23, 2024 8:05 am this post

Hi KarmaKuma,

Your general idea is correct, and I think in your case, the in-line deduplication and compression will probably be a bigger source of space reduction than with block reuse elsewhere. While both will contribute, I think the main point is as you noted, the block reuse that Veeam can use is limited to per-chain, it's not a global deduplication.

The "dump and sweep" method you're discussing here will benefit from the in-flight deduplication and compression, and I think that is going to be the best source for reduction here. Performance and Capacity tier will differ naturally depending on what is backing them, but just at first blush, I would assume you'll see most of the savings during backup rather than with external space savings mechanism.

KarmaKuma · Post by **KarmaKuma** » Jul 23, 2024 12:31 pm this post

Hi David

What do you think will give the better dedupe rate - cbt based incrementals or in-guest agent based filesystem backups? I think cbt based might have "problems" with tracking Windows' own block re-use (new files are written to mft unlinked blocks with "longest not used" policy -> delete file "a" and recreate it, will occupy other blocks on disk than the original file "a" -> mft recoverability)... So agent based filesystem backup might have the edge?

Post by **david.domask** » Jul 23, 2024 2:03 pm this post

Hi KarmaKuma,

From my perspective, there are too many other variables to consider to give an answer I'd be comfortable with one way or the other as it would likely be pretty specific to the environment. I think you are correct that agent-based backups probably will have the edge on CBT based, but I'm not confident it will be a game-changing amount, but I would advise just testing with maybe a "dummy" machine and a copy of one of the same files; just copy it a few times and see the returns that work best for your workload + storage.

KarmaKuma · Post by **KarmaKuma** » Jul 23, 2024 2:47 pm this post

Hi David

Thanks a lot for your feedback. I will give it a try once we have an object storage playground somewhere

(we do not have any at this time - yet)

Best regards

R&D Forums

Dedupe rate on daily DB-dumps

Re: Dedupe rate on daily DB-dumps

Re: Dedupe rate on daily DB-dumps

Re: Dedupe rate on daily DB-dumps

Re: Dedupe rate on daily DB-dumps

Who is online