A question about deduplication

lars@norstat.no · Oct 29, 2012 8:59 pm

What if i have a reverse incremental backup job that is set to use local target for storage optimization and i figure out that i want more deduplication and change it to WAN target. Will i then also deduplicate the master file or first backup as well or only the following restore points ?

In other words, do i have to delete the whole backup and start again to get the full benefit from the deduplication ?

Post by **tsightler** » Oct 29, 2012 10:07 pm this post

You have to run a new full backup before any changes take effect at all. Be aware that choosing WAN target will increase the memory requirements of the Veeam agents by 4x. It's strongly suggested that the entire size of all VMs in a backup job using WAN target is limited to 2TB or less.

lars@norstat.no · Oct 29, 2012 10:13 pm

Ok, how big is really the difference in deduplication between the options ? Typically ?

I guess there is a faq somewhere that will tell me ...

The server is local and the size will be well over 2TB. The Backupserver have 18 GB of ram ...

Post by **tsightler** » Oct 29, 2012 10:40 pm this post

For the most part you're unlikely to see a significant increase for fileserver type workloads that are likely to already have limited change rate. It's primarily useful for transactional workloads like Exchange/SQL which make many small block (8-32KB) changes across a large range of the disk. In those cases the savings can be significant.

The memory in the server isn't really the limiting factor. The VeeamAgent.exe file is a 32-bit process, even on 64-bit platforms. Because of this it's memory usage is limited to ~1.7GB of RAM on Windows. Once the processes working set grows to this point performance of the backup process goes downhill quickly, and can even lead to memory allocation errors that cause backups to fail. The recommendation to keep backup sizes to <2TB when using WAN target is based on observed memory usage in real world deployments and does provide some leeway for future growth of the job.

In general the recommended sizes are <2TB for WAN target, <4TB for LAN target, <8TB for Local target. This provides room for future growth of the job going forward. With 6.5 we have introduced a new "local" mode for very large servers (>16TB).

rawtaz · Post by **rawtaz** » May 27, 2013 11:16 am this post

tsightler wrote:You have to run a new full backup before any changes take effect at all. Be aware that choosing WAN target will increase the memory requirements of the Veeam agents by 4x. It's strongly suggested that the entire size of all VMs in a backup job using WAN target is limited to 2TB or less.

Is it really true that if you change from LAN to WAN target (or the reverse) on an existing (previously run) reverse incremental job, the changed setting will not take effect until you run a new full backup of this job? If yes, is it the same for the compression setting?

The user guide does not mention this in detail, it just states that "Changing the compression level and deduplication settings in an existing job will not have any effect on previously created backup files. It will affect only those backups that will be created after you set the new settings." - Perhaps it would be a good idea to adjust the User Guide so that it is clearer on this point.

I'd also like to ask, in case the changed deduplication setting does take effect without a full backup run, will any of the previous deduplication data (that is not of the same block size as the new setting) have any effect, or will the subsequent runs only dedupe using the data backed up from that point in time (i.e. the previously backed up data being useless in terms of deduplication)?

Thanks!

Post by **foggy** » May 27, 2013 12:51 pm this post

rawtaz wrote:Is it really true that if you change from LAN to WAN target (or the reverse) on an existing (previously run) reverse incremental job, the changed setting will not take effect until you run a new full backup of this job? If yes, is it the same for the compression setting?

Yes, changing compression and storage optimization settings will not have any effect on the existing VBK files, since they already compressed and deduped using the current settings. You need to run a new full backup for these changes to take effect.

rawtaz wrote:The user guide does not mention this in detail, it just states that "Changing the compression level and deduplication settings in an existing job will not have any effect on previously created backup files. It will affect only those backups that will be created after you set the new settings." - Perhaps it would be a good idea to adjust the User Guide so that it is clearer on this point.

I'll pass this feedback to our tech writing team, thanks!

rawtaz · Post by **rawtaz** » May 27, 2013 2:18 pm this post

foggy wrote: Yes, changing compression and storage optimization settings will not have any effect on the existing VBK files, since they already compressed and deduped using the current settings. You need to run a new full backup for these changes to take effect.

I understand that it won't affect the existing backup files (such as "re-deduplicating" them), as "refactoring" them would take a lot of processing. But can you clarify what happens with the subsequent ones? For example:

- If just continuing to run the job as usual (not full backup) will the "new"/next .VBK be compressed and deduplicated with the same settings as before (e.g. as LAN target if you changed from LAN target to WAN target, as that's the best it can do)?
- Or will the "new"/next .VBK not be duplicated at all (as the new/current of the job setting doesn't match the existing files' deduplication data which therefore cannot be used)? This is probably not the case as it would be insane

Post by **foggy** » May 27, 2013 2:47 pm this post

rawtaz wrote:- If just continuing to run the job as usual (not full backup) will the "new"/next .VBK be compressed and deduplicated with the same settings as before (e.g. as LAN target if you changed from LAN target to WAN target, as that's the best it can do)?

Yes, the reverse incremental job will keep using its previous settings (unlike the forward incremental backup mode).

rawtaz · Post by **rawtaz** » May 27, 2013 7:50 pm this post

One more related question: When restoring a VM from backup, is the deduplication level relevant as well? I presume that the data transferred from the backup repository (being a Linux server) to the target (being Veeam running on the host to which it is restoring) will be the actual deduplicated data, and that the LAN vs WAN deduplication level will make a difference here as well (not just when backing up).

rawtaz · Post by **rawtaz** » May 27, 2013 8:00 pm this post

tsightler wrote:In general the recommended sizes are <2TB for WAN target, <4TB for LAN target, <8TB for Local target. This provides room for future growth of the job going forward. With 6.5 we have introduced a new "local" mode for very large servers (>16TB).

I'd be happy if someone clarified what "size" refers to here. Is it the combined size of all the VM disks that are part of the job, or is it the size of individual VMs, or something else?

EDIT: Is there any way to find out how much memory is used by the proxy to keep track of the deduplication meta data? I'd like to know how much room I have to play with when deciding LAN vs WAN target for an offsite job.

Post by **veremin** » May 28, 2013 9:16 am this post

I'd be happy if someone clarified what "size" refers to here. Is it the combined size of all the VM disks that are part of the job, or is it the size of individual VMs, or something else?

The number given is related to the total size of backup data. In other words, it’s recommended that in case of WAN target the size of all VMs in job doesn’t exceed 2TB limit, 4TB in case of LAN target, etc.

Thanks.

Post by **foggy** » May 28, 2013 1:36 pm this post

rawtaz wrote:EDIT: Is there any way to find out how much memory is used by the proxy to keep track of the deduplication meta data? I'd like to know how much room I have to play with when deciding LAN vs WAN target for an offsite job.

For 1MB block (Local target), you can take 1GB of memory for each 1TB of data to store dedupe hashes as a rough estimation. Selecting WAN target (256KB block) will increase memory consumption by 4x. This rate is the same for backup and restore operations.

rawtaz · Post by **rawtaz** » May 28, 2013 1:49 pm this post

foggy wrote: For 1MB block (Local target), you can take 1GB of memory for each 1TB of data to store dedupe hashes as a rough estimation. Selecting WAN target (256KB block) will increase memory consumption by 4x. This rate is the same for backup and restore operations.

Thank you for that pointer. I take it there's no way to monitor the real/actual size of a job's dedupe hashes.

However, what happens when the dedupe hashes exceed what the agent can handle? Say for example that you back up 3TB data with a job set to LAN target. One would then calculate around 3GB (for 1GB per 1TB data) * 2 (for LAN target), meaning dedupe hashes around 6GB. You said earlier that the VeeamAgent.exe file is a 32-bit process (even on 64-bit platforms), so it can deal with about 1.7 GB memory. How does this process/Veeam handle things when all dedupe data doesn't fit into memory, like in the above calculation?

By the way, VeeamAgent.exe is the one that on the source side does the processing, compression and deduplication of the data to be backed up, right? Also, why not make it 64 bit?

Post by **tsightler** » May 28, 2013 7:00 pm this post

rawtaz wrote:However, what happens when the dedupe hashes exceed what the agent can handle? Say for example that you back up 3TB data with a job set to LAN target. One would then calculate around 3GB (for 1GB per 1TB data) * 2 (for LAN target), meaning dedupe hashes around 6GB. You said earlier that the VeeamAgent.exe file is a 32-bit process (even on 64-bit platforms), so it can deal with about 1.7 GB memory. How does this process/Veeam handle things when all dedupe data doesn't fit into memory, like in the above calculation?

Actually, the 1GB per 1TB is a pretty high number, actual usage is generally a good bit less that that, but that's why there's a recommended maximum job size in the best practice guide of no more the 4TB of VM data in a job for LAN target. You will definitely exceed 1.7GB of RAM if you start pushing those limits or have very long backup chains, and from that point, job failures are likely. It's actually far better to keep the jobs smaller and run more in parallel.

The VeeamAgent on the source side actually only keeps hashes for the disk that it's actively backing up, the repository side VeeamAgent is normally the one that grows the largest since it dedupes across the entire job, and there will be a 64-bit VeeamAgent in V7. Of course, that simply means you'll need to allocate more memory to your proxies and repositories.

rawtaz · Post by **rawtaz** » May 28, 2013 8:04 pm this post

tsightler wrote: Actually, the 1GB per 1TB is a pretty high number, actual usage is generally a good bit less that that, but that's why there's a recommended maximum job size in the best practice guide of no more the 4TB of VM data in a job for LAN target. You will definitely exceed 1.7GB of RAM if you start pushing those limits or have very long backup chains, and from that point, job failures are likely. It's actually far better to keep the jobs smaller and run more in parallel.

The VeeamAgent on the source side actually only keeps hashes for the disk that it's actively backing up, the repository side VeeamAgent is normally the one that grows the largest since it dedupes across the entire job, and there will be a 64-bit VeeamAgent in V7. Of course, that simply means you'll need to allocate more memory to your proxies and repositories.

So if I understand you correctly, the repository side agent will hit the memory limit first. At that point, what will happen? Will it swap? When will it/the run fail?

There is a very big difference in what the recommendations (<2TB for WAN target and <4TB for LAN target) state and what the numbers/calculation does. I currently need to deploy an offsite Linux backup repository (64-bit), and the biggest job can span up right below those 2TB (four Windows VMs, of which the biggest VM is currently 1TB but might grow to 1.8TB or, though in at least a year's time) - Let's take that as an example:

- The recommendation says that I can back these 1.8-2TB up using WAN target dedupe.
- The 32-bit discussion says that the agent on the repository side can only handle 1.7GB dedupe hashes.
- The numbers/calculations (based on 1GB per 1TB) give that with 1.7GB memory the repository agent can handle dedupe hashes for just 430GB of data when using WAN target (2TB * 1.7GB/(2TB*1GB*4)) and 850GB of data when using LAN target (2TB * 1.7GB/(2TB*1GB*2)).
- Generously change the base to be 0.5GB memory per 1TB data to be deduplicated and we get a limit of 860GB for WAN target and 1700GB for LAN target. This is still way different from the recommendation that is said to be based on observed memory usage in real world deployments. It's 860GB vs 2TB and 1.7TB vs 4TB.

I am not trying to question the recommendation, just illustrating why I still don't know whether I dare run WAN target for the <2TB example job I wrote in the paragraph above. As you can see, there's quite a gap between what the recommendation says (that I can use WAN target without worry) and what the numbers says (that there's no way I can use WAN target).

Unfortunately I need to deploy an offsite server ASAP and cannot wait for v7, that's why I'm trying to determine WAN vs LAN target. Which dedupe level would you use in this situation/with the above job? Perhaps after all I should go with WAN target for now, since it's highly likely that we'll be on v7 by the time the VMs reach the 2TB "limit", and then the repository side agent shouldn't have a problem, nor the source side one given a little more memory. Does that make sense?

If you are "sure" that following the best practice will be successful (assuming the Veeam agents get their 1.7GB memory) then I'm all ears since I would prefer to dedupe as much as possible (i.e. use WAN target), it's just that the numbers are so way off that recommendation, even with 50% margins included.

Thanks a lot for following up!

Post by **tsightler** » May 28, 2013 9:23 pm this post

I actually wrote the best practice guide, and those numbers are based on what I've seen in the "real world" regarding memory utilization, etc. Honestly I'm not sure where Alexander came up with the formula he provided but by his estimate you could only have about 2TB of data in a backup job even using local target and I know this simply isn't the case, however, there's certainly nothing wrong with being cautious/conservative with the numbers. In the end, it's really based more on the number of hashes than anything so the total number of backups in the chain also matter since each one adds new hashes, that's why it's hard to put an exact number on it, but I deploy Veeam in large environments for a living so I'll stand behind my numbers in the best practice guide for now.

That being said, you do have a little bit of headroom since your target is Linux. Linux 32-bit processes can address almost the entire 4GB of memory when running on a 64-bit OS, and even on a 32-bit OS can access 2.7GB or 3.5GB (based on kernel options), quite a bit more than their Windows counterparts.

As far as "what will happen", as I said earlier you will begin to see job failures, likely with error messages like "unable to allocate memory for array" or some such. My suggestion would be to also "err on the side of caution", and use LAN target since you are so close to the 2TB maximum size. Honestly, the difference between the two is likely to be minimal and you'll be 100% safe no matter what using LAN target and when V7 comes out you'll be even better off.

Jun 03, 2013 5:01 am

To avoid problems at VM restore...
If you want to change block size from LAN to WAN please update to latest version and patch level first.
Run in any way an "Active" Full.

Jun 03, 2013 5:24 am

If you look at the WAN setting because of backup through small WAN links it might be a good idea to look at new WAN optimization feature in v7. This will help you more than this settings. For all customers who are in enterprise version, this option is for "free" (enterprise plus upgrade is cost free).

rawtaz · Post by **rawtaz** » Jun 03, 2013 10:42 am this post

Andreas Neufert wrote:If you look at the WAN setting because of backup through small WAN links it might be a good idea to look at new WAN optimization feature in v7. This will help you more than this settings. For all customers who are in enterprise version, this option is for "free" (enterprise plus upgrade is cost free).

Not an option in this case. It's unfortunate that the WAN optimization feature is in the highlest level/version of the product offerings.

Also, the backup copy job seems to copy an existing backup, which would be more error prone than having two separate backup jobs. That being said in the context of Veeam backups can fail too (there has been a bunch of threads in this forum, and the conclusion is that even though Veeam is solid it isn't 100% rock solid, as noone from Veeam seem to be willing to say that Veeam will for sure be able to detect corruption in backup files). For that reason it feels safer to have two separate backups going rather than having just one that is being copied to a second place.

Post by **Dima P.** » Jun 03, 2013 11:25 am this post

rawtaz,

the backup copy job seems to copy an existing backup

That is correct. For the backup integrity - you can run SureBackup Job before running copy job over a slow link using WAN optimization and have the existing backup checked anytime, so the question is my backup recoverable wont arise.

Jun 03, 2013 1:50 pm

You can also do 2 local backups and replicate one of them with the new WAN Option to second site.
Yes, I know then you double space on first site in your mentioned scenario.
SureBackup and WAN optimization automatic self checking is a good save way around it and trust only in one chain.

If WAN is such a limitating factor... before you buy expensive WAN optimization hardware, it will be a good option to use the Veeam integrated one.

rawtaz · Post by **rawtaz** » Jun 03, 2013 2:30 pm this post

d.popov wrote: That is correct. For the backup integrity - you can run SureBackup Job before running copy job over a slow link using WAN optimization and have the existing backup checked anytime, so the question is my backup recoverable wont arise.

I'm sorry, but SureBackup will (AFAIK) not check the integrity of the entire VM and all of its data, unless you set up additional tests for the SureBackup job that it would use/run to verify all of the data in the VM. In summary quite a different task/beast than a regular integrity check och the backup files on the repository disk that I'm referring to.

Also, running this automatically is something you need the Enterprise edition for, and the WAN acceleration is an Enterprise+ feature. These things combined effectively means that what you suggest is not a viable option to verify that backups are indeed 100% good, especially when you're on a Standard license.

For that reason (i.e. without a major SureBackup configuration and Enterprise license) I think it would be considered safer to run two separate backup jobs so that even if one of them would for some reason become corrupt, the other one would hopefully be okay. I'm open to other opinions on this, but based on all the threads I've read regarding integrity verification this is the only conclusion I see that one can draw. I've done quite a bit of reading in the forum on this topic.

Kernel Panic · Post by **Kernel Panic** » Jun 04, 2013 3:55 pm this post

Do the guidelines for deduplication storage optimization settings apply the same for replication as they do for backups? I have a 6TB VM I'd like to replicate off site and the idea of smaller block size is appealing to reduce WAN utilization. I'd like to do it right the first time since we'll have to bring the remote equipment on-site to seed the replication.

Post by **foggy** » Jun 05, 2013 2:26 pm this post

Since each VM is replicated and stored separately (in VMware native format), the effect of this setting for replicas is not that essential.

rawtaz · Post by **rawtaz** » Jun 11, 2013 4:31 pm this post

tsightler wrote:I actually wrote the best practice guide, and those numbers are based on what I've seen in the "real world" regarding memory utilization, etc. Honestly I'm not sure where Alexander came up with the formula he provided but by his estimate you could only have about 2TB of data in a backup job even using local target and I know this simply isn't the case, however, there's certainly nothing wrong with being cautious/conservative with the numbers. In the end, it's really based more on the number of hashes than anything so the total number of backups in the chain also matter since each one adds new hashes, that's why it's hard to put an exact number on it, but I deploy Veeam in large environments for a living so I'll stand behind my numbers in the best practice guide for now.

That being said, you do have a little bit of headroom since your target is Linux. Linux 32-bit processes can address almost the entire 4GB of memory when running on a 64-bit OS, and even on a 32-bit OS can access 2.7GB or 3.5GB (based on kernel options), quite a bit more than their Windows counterparts.

As far as "what will happen", as I said earlier you will begin to see job failures, likely with error messages like "unable to allocate memory for array" or some such. My suggestion would be to also "err on the side of caution", and use LAN target since you are so close to the 2TB maximum size. Honestly, the difference between the two is likely to be minimal and you'll be 100% safe no matter what using LAN target and when V7 comes out you'll be even better off.

Just to follow/wrap up on this discussion, I ended up going with WAN target after all. Main reason being that I watched the VeeamAgent process on the proxy/server VM, and it was quite low on memory, far from 1.7GB. That, combined with the fact that the agents should be 64 bit in a near future, and everything else that has been said makes me pretty sure this won't be a problem.

Thank you, tsightler and others, for your valuable opinions on this matter. The documentation is great, but there are still a lot of technical things that need vetting to fully understand relevant aspects.

Post by **foggy** » Jun 13, 2013 5:50 am this post

Thanks for sharing your findings, real-world practices are always much appreciated.

kte · Post by **kte** » Aug 26, 2013 8:13 pm this post

Is the veeam agent still 32 bit in version 7 ??

K

Post by **foggy** » Aug 27, 2013 7:34 am this post

Veeam B&R v7 introduces 64bit backup repository agents. However, agents running on the proxy servers are still 32bit.

lars@norstat.no · Nov 01, 2013 12:02 am

Why ?

Post by **tsightler** » Nov 01, 2013 1:16 am this post

Because the proxy component is the piece that integrates with the VDDK and thus switching to a 64-bit agent on the proxy also means switching to the 64-bit VDDK which introduces the potential for new bugs and thus require even more testing while the 32-bit agent and VDDK was already a well proven and tested combination used in every previous version. Plus, the proxy agent doesn't really benefit much from 64-bit code since it doesn't use as much memory as it only does per-disk dedupe while the repository requires enough memory to store hashes for the entire job.

Of course, this will change with the release of support for vSphere 5.5 since this version no longer includes a 32-bit version of the VDDK, likely due to the requirement to support >2TB VMDK. Based on previous history I wouldn't be at all surprised if the patch for 7.0 to support 5.5 only uses the 64-bit proxy/VDDK when it detects vSphere 5.5. This limits the exposure to new code only to those running the newest version of vSphere thus not causing issues for those more conservative users.

R&D Forums

A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Re: A question about deduplication

Who is online