How De-duplication and Compression works

LMS · Post by **LMS** » May 29, 2017 5:42 am this post

Hi

We are new to Veeam, just completed the implementation and started taking HV VM backups (on-host backups are configured). Looking to get some clarity on how de-duplication & compression works on VM backup.

Consider one server with 40 GB used space disk, scheduled Active Full backup on every Saturday, Synthetic full on every Wednesday and incremental on weekdays with 60 restore points. As per Veeam documents, source side de-duplication ensures only unique data blocks not already present in the previous restore point are transferred across the network and target side de-duplication checks the received blocks against other virtual machine (VM) blocks already stored in the backup file. Here when we check backup file size, it's around 22 GB for all weekly full jobs. What we expect is since one full job data is already present in SAN, other full jobs shouldn't transfer and store full data again since we have multiple restore points. This causes issues with SAN space utilization and we changed many jobs to run as Forever forward incremental jobs, but this is not accepted by our organization. Just we want to know is the VBR works as expected or do we to make some changes here with the schedule.

One more point needs clarification. At present we configured scheduled jobs for each individual VMs. But if we create a single job with multiple VMs with similar retention and job schedule, will this improve de-duplication ratio (Target-side deduplication checks the received blocks against other virtual machine (VM) blocks already stored in the backup file, thus providing global deduplication across all VMs included in the backup job), even we tried this for a few VMs, but still found it creates multiple files for each VMs included in the job.

Looking for clarification and best practice to follow

Thanks in advance

Post by **Mike Resseler** » May 29, 2017 4:39 pm this post

Hi,

Furst: Welcome to the forums.

I am not sure if I understand everything what you are asking, but I will give it a try. Feel free to tell me that I am wrong

1. Our deduplication only works per job. Which means one VM per job will not give you that much profit.
2. When you have moved multiple VM's in one single job, was per-VM backup files enabled? Because if it does, you can't take advantage of multiple VM's in one single backup file (see here: https://helpcenter.veeam.com/docs/backu ... tml?ver=95)

Cheers
Mike

Post by **foggy** » May 29, 2017 4:58 pm this post

Moreover, Veeam B&R deduplication works within a backup file, so all full backups for the given job will have comparable size, since data is not deduplicated between them.

LMS · Post by **LMS** » May 29, 2017 9:33 pm this post

Thanks a lot you all. I will explain the current configuration in detail and what I understood from your reply.

We are using only Hyper-V 2012 R2 environment, having single physical VBR 9.5 server, using on-host backups, using in-line de-duplication.

We were confused on Veeam statement "source side de-duplication ensures only unique data blocks not already present in the previous restore point are transferred across the network and target side de-duplication checks the received blocks against other virtual machine (VM) blocks already stored in the backup file", also Mike mentioned "Our deduplication only works per job" and Foggy mentioned "deduplication works within a backup file". So what I understood now on statement "source side de-duplication ensures only unique data blocks not already present in the previous restore point are transferred across the network" is, if data blocks are there with previous restore points then it won't be transferred over network, for eg say, a full job won't transfer the full data if the data blocks present with previous full backup, but the size of full backup file will be same or more as old full job which already there due to multiple restore points. Am I right?

(We thought if a full backup is there, later on wards while take full backups the file size would be less compared to initial full backup because de duplication and now I understood how de duplication works)

As per best practice recommendation "per-VM backup files enabled" is configured, I read the link you provided. So should we disable this option for better de-duplication and configure jobs with multiple VMs? If we configure backups with multiple VMs with single file, then how many VMs should be selected per job and how can we calculate the number of VMs to be processed at a time since we are using on-host backup.

Thanks a lot

Post by **Mike Resseler** » May 30, 2017 4:47 am this post

It looks like you are correct. The source-side dedup will lower the traffic across the network, target-side is responsible for saving storage.

The deduplication is indeed per backup-file. I apologize, still an old habit of saying it like that but per-VM backup files indeed makes multiple backup files per job and the deduplication will be per file.

Per-VM backup files has advantages, one of them (as an example) being that is great when you use Windows Server 2k16 Deduplication on your repository. But if your storage is not a dedupe appliance or does not run software dedup then it might be more interesting to run a few jobs with multiple VM's in there.

Post by **foggy** » May 30, 2017 2:53 pm this post

LMS wrote:So what I understood now on statement "source side de-duplication ensures only unique data blocks not already present in the previous restore point are transferred across the network" is, if data blocks are there with previous restore points then it won't be transferred over network, for eg say, a full job won't transfer the full data if the data blocks present with previous full backup, but the size of full backup file will be same or more as old full job which already there due to multiple restore points. Am I right?

Full backup resets the backup chain and is a self-contained backup file, where data is not deduped against previous restore points (only within the processed VM disk).

LMS · Post by **LMS** » Jun 05, 2017 6:06 pm this post

Thank You all

We tried both options with the repository (Per-VM backup files enabled and disabled) against a set of VMs, but it didn't make any difference in size. So we will go with Per-VM backup files option.

sg_sc · Post by **sg_sc** » Jun 05, 2017 9:48 pm this post

Full backup files (VBK) will always take up the full space (unless ReFS), no matter if previous full backup files are still present.

Veeam does the magic on the source side, using changed blocks technology (or Hyper-V equivalent) to only transfer changed blocks and on the target side the in backup-file deduplication will save storage space when you have multiple VM's with same blocks of data. For instance 10 Windows 2012 R2 VM's will definitely have a lot of blocks containing OS files in common, that will be deduped in the backup file.
If you enable per-VM backup files you do not have that last benefit, also if you create a job per VM you do not have that benefit.

If you want huge space savings without the need for special deduplication processes or appliances, you should look into ReFS 3.1 64K and Synthetic fulls.
As a test I have 9TB of backup copies (GFS: Q, M, W synthetic full VBK files) on a 2TB disk, thanks to the ReFS and Veeam fast block clone magic.
It does have a downside that ReFS needs a beefy server (lots of RAM) if you intend to put allot of TB's on it, and remember it must be ReFS 3.1 (win 2016) and use 64K blocksize otherwise things will not go smooth.

LMS · Post by **LMS** » Jun 06, 2017 4:17 am this post

Thanks sg.

As I mentioned before we created a job which includes 4 VMs with & without Per-VM backup files option. But it didn't save a bit when compare each options. All the forum & Veeam documentation mentioned to disable Per-VM backup files to better de-dup, so we will open a case to check this

Regards

Post by **foggy** » Jun 06, 2017 9:58 am this post

What kind of VMs they are? VMs created from a single template would more likely get more blocks in common. Also, have you ensured per-VM option took effect (i.e. there were separate backup chains for each VM in the repository)? Since it takes effect only after active full backup, if the setting is changed for existing job.

Post by **BartP** » Jun 07, 2017 1:07 pm this post

Keep in mind that Deduplication often works best on (active) Full Backups.
Incremental backups use CBT and the changed blocks are, more often than not, unique blocks.
This changes when backing up a Fileserver or Mail server. Low values of dedupe can be achieved.

LMS · Post by **LMS** » Jun 11, 2017 4:39 pm this post

Hi

VMs are Windows 2012 R2 servers with SQL DBs (the 4 servers we tested backup are using shared disks / VMs in cluster), we tried both the options and with per VM option it's creating separate files for each VMs. All jobs are created freshly, means we tried only active full backups

R&D Forums

How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Re: How De-duplication and Compression works

Who is online