Host-based backup of VMware vSphere VMs.
lars@norstat.no
Expert
Posts: 110
Liked: 14 times
Joined: Nov 01, 2011 1:44 pm
Full Name: Lars Skjønberg
Contact:

A question about deduplication

Post by lars@norstat.no »

What if i have a reverse incremental backup job that is set to use local target for storage optimization and i figure out that i want more deduplication and change it to WAN target. Will i then also deduplicate the master file or first backup as well or only the following restore points ?

In other words, do i have to delete the whole backup and start again to get the full benefit from the deduplication ?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: A question about deduplication

Post by tsightler »

You have to run a new full backup before any changes take effect at all. Be aware that choosing WAN target will increase the memory requirements of the Veeam agents by 4x. It's strongly suggested that the entire size of all VMs in a backup job using WAN target is limited to 2TB or less.
lars@norstat.no
Expert
Posts: 110
Liked: 14 times
Joined: Nov 01, 2011 1:44 pm
Full Name: Lars Skjønberg
Contact:

Re: A question about deduplication

Post by lars@norstat.no »

Ok, how big is really the difference in deduplication between the options ? Typically ?

I guess there is a faq somewhere that will tell me ...

The server is local and the size will be well over 2TB. The Backupserver have 18 GB of ram ...
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: A question about deduplication

Post by tsightler »

For the most part you're unlikely to see a significant increase for fileserver type workloads that are likely to already have limited change rate. It's primarily useful for transactional workloads like Exchange/SQL which make many small block (8-32KB) changes across a large range of the disk. In those cases the savings can be significant.

The memory in the server isn't really the limiting factor. The VeeamAgent.exe file is a 32-bit process, even on 64-bit platforms. Because of this it's memory usage is limited to ~1.7GB of RAM on Windows. Once the processes working set grows to this point performance of the backup process goes downhill quickly, and can even lead to memory allocation errors that cause backups to fail. The recommendation to keep backup sizes to <2TB when using WAN target is based on observed memory usage in real world deployments and does provide some leeway for future growth of the job.

In general the recommended sizes are <2TB for WAN target, <4TB for LAN target, <8TB for Local target. This provides room for future growth of the job going forward. With 6.5 we have introduced a new "local" mode for very large servers (>16TB).
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

tsightler wrote:You have to run a new full backup before any changes take effect at all. Be aware that choosing WAN target will increase the memory requirements of the Veeam agents by 4x. It's strongly suggested that the entire size of all VMs in a backup job using WAN target is limited to 2TB or less.
Is it really true that if you change from LAN to WAN target (or the reverse) on an existing (previously run) reverse incremental job, the changed setting will not take effect until you run a new full backup of this job? If yes, is it the same for the compression setting?

The user guide does not mention this in detail, it just states that "Changing the compression level and deduplication settings in an existing job will not have any effect on previously created backup files. It will affect only those backups that will be created after you set the new settings." - Perhaps it would be a good idea to adjust the User Guide so that it is clearer on this point.

I'd also like to ask, in case the changed deduplication setting does take effect without a full backup run, will any of the previous deduplication data (that is not of the same block size as the new setting) have any effect, or will the subsequent runs only dedupe using the data backed up from that point in time (i.e. the previously backed up data being useless in terms of deduplication)?

Thanks!
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: A question about deduplication

Post by foggy »

rawtaz wrote:Is it really true that if you change from LAN to WAN target (or the reverse) on an existing (previously run) reverse incremental job, the changed setting will not take effect until you run a new full backup of this job? If yes, is it the same for the compression setting?
Yes, changing compression and storage optimization settings will not have any effect on the existing VBK files, since they already compressed and deduped using the current settings. You need to run a new full backup for these changes to take effect.
rawtaz wrote:The user guide does not mention this in detail, it just states that "Changing the compression level and deduplication settings in an existing job will not have any effect on previously created backup files. It will affect only those backups that will be created after you set the new settings." - Perhaps it would be a good idea to adjust the User Guide so that it is clearer on this point.
I'll pass this feedback to our tech writing team, thanks!
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

foggy wrote: Yes, changing compression and storage optimization settings will not have any effect on the existing VBK files, since they already compressed and deduped using the current settings. You need to run a new full backup for these changes to take effect.
I understand that it won't affect the existing backup files (such as "re-deduplicating" them), as "refactoring" them would take a lot of processing. But can you clarify what happens with the subsequent ones? For example:

- If just continuing to run the job as usual (not full backup) will the "new"/next .VBK be compressed and deduplicated with the same settings as before (e.g. as LAN target if you changed from LAN target to WAN target, as that's the best it can do)?
- Or will the "new"/next .VBK not be duplicated at all (as the new/current of the job setting doesn't match the existing files' deduplication data which therefore cannot be used)? This is probably not the case as it would be insane :)
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: A question about deduplication

Post by foggy »

rawtaz wrote:- If just continuing to run the job as usual (not full backup) will the "new"/next .VBK be compressed and deduplicated with the same settings as before (e.g. as LAN target if you changed from LAN target to WAN target, as that's the best it can do)?
Yes, the reverse incremental job will keep using its previous settings (unlike the forward incremental backup mode).
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

One more related question: When restoring a VM from backup, is the deduplication level relevant as well? I presume that the data transferred from the backup repository (being a Linux server) to the target (being Veeam running on the host to which it is restoring) will be the actual deduplicated data, and that the LAN vs WAN deduplication level will make a difference here as well (not just when backing up).
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

tsightler wrote:In general the recommended sizes are <2TB for WAN target, <4TB for LAN target, <8TB for Local target. This provides room for future growth of the job going forward. With 6.5 we have introduced a new "local" mode for very large servers (>16TB).
I'd be happy if someone clarified what "size" refers to here. Is it the combined size of all the VM disks that are part of the job, or is it the size of individual VMs, or something else?

EDIT: Is there any way to find out how much memory is used by the proxy to keep track of the deduplication meta data? I'd like to know how much room I have to play with when deciding LAN vs WAN target for an offsite job.
veremin
Product Manager
Posts: 20413
Liked: 2302 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: A question about deduplication

Post by veremin »

I'd be happy if someone clarified what "size" refers to here. Is it the combined size of all the VM disks that are part of the job, or is it the size of individual VMs, or something else?
The number given is related to the total size of backup data. In other words, it’s recommended that in case of WAN target the size of all VMs in job doesn’t exceed 2TB limit, 4TB in case of LAN target, etc.

Thanks.
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: A question about deduplication

Post by foggy »

rawtaz wrote:EDIT: Is there any way to find out how much memory is used by the proxy to keep track of the deduplication meta data? I'd like to know how much room I have to play with when deciding LAN vs WAN target for an offsite job.
For 1MB block (Local target), you can take 1GB of memory for each 1TB of data to store dedupe hashes as a rough estimation. Selecting WAN target (256KB block) will increase memory consumption by 4x. This rate is the same for backup and restore operations.
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

foggy wrote: For 1MB block (Local target), you can take 1GB of memory for each 1TB of data to store dedupe hashes as a rough estimation. Selecting WAN target (256KB block) will increase memory consumption by 4x. This rate is the same for backup and restore operations.
Thank you for that pointer. I take it there's no way to monitor the real/actual size of a job's dedupe hashes.

However, what happens when the dedupe hashes exceed what the agent can handle? Say for example that you back up 3TB data with a job set to LAN target. One would then calculate around 3GB (for 1GB per 1TB data) * 2 (for LAN target), meaning dedupe hashes around 6GB. You said earlier that the VeeamAgent.exe file is a 32-bit process (even on 64-bit platforms), so it can deal with about 1.7 GB memory. How does this process/Veeam handle things when all dedupe data doesn't fit into memory, like in the above calculation?

By the way, VeeamAgent.exe is the one that on the source side does the processing, compression and deduplication of the data to be backed up, right? Also, why not make it 64 bit?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: A question about deduplication

Post by tsightler »

rawtaz wrote:However, what happens when the dedupe hashes exceed what the agent can handle? Say for example that you back up 3TB data with a job set to LAN target. One would then calculate around 3GB (for 1GB per 1TB data) * 2 (for LAN target), meaning dedupe hashes around 6GB. You said earlier that the VeeamAgent.exe file is a 32-bit process (even on 64-bit platforms), so it can deal with about 1.7 GB memory. How does this process/Veeam handle things when all dedupe data doesn't fit into memory, like in the above calculation?
Actually, the 1GB per 1TB is a pretty high number, actual usage is generally a good bit less that that, but that's why there's a recommended maximum job size in the best practice guide of no more the 4TB of VM data in a job for LAN target. You will definitely exceed 1.7GB of RAM if you start pushing those limits or have very long backup chains, and from that point, job failures are likely. It's actually far better to keep the jobs smaller and run more in parallel.

The VeeamAgent on the source side actually only keeps hashes for the disk that it's actively backing up, the repository side VeeamAgent is normally the one that grows the largest since it dedupes across the entire job, and there will be a 64-bit VeeamAgent in V7. Of course, that simply means you'll need to allocate more memory to your proxies and repositories.
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

tsightler wrote: Actually, the 1GB per 1TB is a pretty high number, actual usage is generally a good bit less that that, but that's why there's a recommended maximum job size in the best practice guide of no more the 4TB of VM data in a job for LAN target. You will definitely exceed 1.7GB of RAM if you start pushing those limits or have very long backup chains, and from that point, job failures are likely. It's actually far better to keep the jobs smaller and run more in parallel.

The VeeamAgent on the source side actually only keeps hashes for the disk that it's actively backing up, the repository side VeeamAgent is normally the one that grows the largest since it dedupes across the entire job, and there will be a 64-bit VeeamAgent in V7. Of course, that simply means you'll need to allocate more memory to your proxies and repositories.
So if I understand you correctly, the repository side agent will hit the memory limit first. At that point, what will happen? Will it swap? When will it/the run fail?

There is a very big difference in what the recommendations (<2TB for WAN target and <4TB for LAN target) state and what the numbers/calculation does. I currently need to deploy an offsite Linux backup repository (64-bit), and the biggest job can span up right below those 2TB (four Windows VMs, of which the biggest VM is currently 1TB but might grow to 1.8TB or, though in at least a year's time) - Let's take that as an example:

- The recommendation says that I can back these 1.8-2TB up using WAN target dedupe.
- The 32-bit discussion says that the agent on the repository side can only handle 1.7GB dedupe hashes.
- The numbers/calculations (based on 1GB per 1TB) give that with 1.7GB memory the repository agent can handle dedupe hashes for just 430GB of data when using WAN target (2TB * 1.7GB/(2TB*1GB*4)) and 850GB of data when using LAN target (2TB * 1.7GB/(2TB*1GB*2)).
- Generously change the base to be 0.5GB memory per 1TB data to be deduplicated and we get a limit of 860GB for WAN target and 1700GB for LAN target. This is still way different from the recommendation that is said to be based on observed memory usage in real world deployments. It's 860GB vs 2TB and 1.7TB vs 4TB.

I am not trying to question the recommendation, just illustrating why I still don't know whether I dare run WAN target for the <2TB example job I wrote in the paragraph above. As you can see, there's quite a gap between what the recommendation says (that I can use WAN target without worry) and what the numbers says (that there's no way I can use WAN target).

Unfortunately I need to deploy an offsite server ASAP and cannot wait for v7, that's why I'm trying to determine WAN vs LAN target. Which dedupe level would you use in this situation/with the above job? Perhaps after all I should go with WAN target for now, since it's highly likely that we'll be on v7 by the time the VMs reach the 2TB "limit", and then the repository side agent shouldn't have a problem, nor the source side one given a little more memory. Does that make sense?

If you are "sure" that following the best practice will be successful (assuming the Veeam agents get their 1.7GB memory) then I'm all ears since I would prefer to dedupe as much as possible (i.e. use WAN target), it's just that the numbers are so way off that recommendation, even with 50% margins included.

Thanks a lot for following up!
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: A question about deduplication

Post by tsightler »

I actually wrote the best practice guide, and those numbers are based on what I've seen in the "real world" regarding memory utilization, etc. Honestly I'm not sure where Alexander came up with the formula he provided but by his estimate you could only have about 2TB of data in a backup job even using local target and I know this simply isn't the case, however, there's certainly nothing wrong with being cautious/conservative with the numbers. In the end, it's really based more on the number of hashes than anything so the total number of backups in the chain also matter since each one adds new hashes, that's why it's hard to put an exact number on it, but I deploy Veeam in large environments for a living so I'll stand behind my numbers in the best practice guide for now.

That being said, you do have a little bit of headroom since your target is Linux. Linux 32-bit processes can address almost the entire 4GB of memory when running on a 64-bit OS, and even on a 32-bit OS can access 2.7GB or 3.5GB (based on kernel options), quite a bit more than their Windows counterparts.

As far as "what will happen", as I said earlier you will begin to see job failures, likely with error messages like "unable to allocate memory for array" or some such. My suggestion would be to also "err on the side of caution", and use LAN target since you are so close to the 2TB maximum size. Honestly, the difference between the two is likely to be minimal and you'll be 100% safe no matter what using LAN target and when V7 comes out you'll be even better off.
Andreas Neufert
VP, Product Management
Posts: 7081
Liked: 1511 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: A question about deduplication

Post by Andreas Neufert »

To avoid problems at VM restore...
If you want to change block size from LAN to WAN please update to latest version and patch level first.
Run in any way an "Active" Full.
Andreas Neufert
VP, Product Management
Posts: 7081
Liked: 1511 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: A question about deduplication

Post by Andreas Neufert »

If you look at the WAN setting because of backup through small WAN links it might be a good idea to look at new WAN optimization feature in v7. This will help you more than this settings. For all customers who are in enterprise version, this option is for "free" (enterprise plus upgrade is cost free).
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

Andreas Neufert wrote:If you look at the WAN setting because of backup through small WAN links it might be a good idea to look at new WAN optimization feature in v7. This will help you more than this settings. For all customers who are in enterprise version, this option is for "free" (enterprise plus upgrade is cost free).
Not an option in this case. It's unfortunate that the WAN optimization feature is in the highlest level/version of the product offerings.

Also, the backup copy job seems to copy an existing backup, which would be more error prone than having two separate backup jobs. That being said in the context of Veeam backups can fail too (there has been a bunch of threads in this forum, and the conclusion is that even though Veeam is solid it isn't 100% rock solid, as noone from Veeam seem to be willing to say that Veeam will for sure be able to detect corruption in backup files). For that reason it feels safer to have two separate backups going rather than having just one that is being copied to a second place.
Dima P.
Product Manager
Posts: 14726
Liked: 1706 times
Joined: Feb 04, 2013 2:07 pm
Full Name: Dmitry Popov
Location: Prague
Contact:

Re: A question about deduplication

Post by Dima P. »

rawtaz,
the backup copy job seems to copy an existing backup
That is correct. For the backup integrity - you can run SureBackup Job before running copy job over a slow link using WAN optimization and have the existing backup checked anytime, so the question is my backup recoverable wont arise.
Andreas Neufert
VP, Product Management
Posts: 7081
Liked: 1511 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: A question about deduplication

Post by Andreas Neufert »

You can also do 2 local backups and replicate one of them with the new WAN Option to second site.
Yes, I know then you double space on first site in your mentioned scenario.
SureBackup and WAN optimization automatic self checking is a good save way around it and trust only in one chain.

If WAN is such a limitating factor... before you buy expensive WAN optimization hardware, it will be a good option to use the Veeam integrated one.
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

d.popov wrote: That is correct. For the backup integrity - you can run SureBackup Job before running copy job over a slow link using WAN optimization and have the existing backup checked anytime, so the question is my backup recoverable wont arise.
I'm sorry, but SureBackup will (AFAIK) not check the integrity of the entire VM and all of its data, unless you set up additional tests for the SureBackup job that it would use/run to verify all of the data in the VM. In summary quite a different task/beast than a regular integrity check och the backup files on the repository disk that I'm referring to.

Also, running this automatically is something you need the Enterprise edition for, and the WAN acceleration is an Enterprise+ feature. These things combined effectively means that what you suggest is not a viable option to verify that backups are indeed 100% good, especially when you're on a Standard license.

For that reason (i.e. without a major SureBackup configuration and Enterprise license) I think it would be considered safer to run two separate backup jobs so that even if one of them would for some reason become corrupt, the other one would hopefully be okay. I'm open to other opinions on this, but based on all the threads I've read regarding integrity verification this is the only conclusion I see that one can draw. I've done quite a bit of reading in the forum on this topic.
Kernel Panic
Lurker
Posts: 1
Liked: never
Joined: Jun 04, 2013 3:51 pm
Contact:

Re: A question about deduplication

Post by Kernel Panic »

Do the guidelines for deduplication storage optimization settings apply the same for replication as they do for backups? I have a 6TB VM I'd like to replicate off site and the idea of smaller block size is appealing to reduce WAN utilization. I'd like to do it right the first time since we'll have to bring the remote equipment on-site to seed the replication.
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: A question about deduplication

Post by foggy »

Since each VM is replicated and stored separately (in VMware native format), the effect of this setting for replicas is not that essential.
rawtaz
Expert
Posts: 100
Liked: 15 times
Joined: Jan 27, 2012 4:42 pm
Contact:

Re: A question about deduplication

Post by rawtaz »

tsightler wrote:I actually wrote the best practice guide, and those numbers are based on what I've seen in the "real world" regarding memory utilization, etc. Honestly I'm not sure where Alexander came up with the formula he provided but by his estimate you could only have about 2TB of data in a backup job even using local target and I know this simply isn't the case, however, there's certainly nothing wrong with being cautious/conservative with the numbers. In the end, it's really based more on the number of hashes than anything so the total number of backups in the chain also matter since each one adds new hashes, that's why it's hard to put an exact number on it, but I deploy Veeam in large environments for a living so I'll stand behind my numbers in the best practice guide for now.

That being said, you do have a little bit of headroom since your target is Linux. Linux 32-bit processes can address almost the entire 4GB of memory when running on a 64-bit OS, and even on a 32-bit OS can access 2.7GB or 3.5GB (based on kernel options), quite a bit more than their Windows counterparts.

As far as "what will happen", as I said earlier you will begin to see job failures, likely with error messages like "unable to allocate memory for array" or some such. My suggestion would be to also "err on the side of caution", and use LAN target since you are so close to the 2TB maximum size. Honestly, the difference between the two is likely to be minimal and you'll be 100% safe no matter what using LAN target and when V7 comes out you'll be even better off.
Just to follow/wrap up on this discussion, I ended up going with WAN target after all. Main reason being that I watched the VeeamAgent process on the proxy/server VM, and it was quite low on memory, far from 1.7GB. That, combined with the fact that the agents should be 64 bit in a near future, and everything else that has been said makes me pretty sure this won't be a problem.

Thank you, tsightler and others, for your valuable opinions on this matter. The documentation is great, but there are still a lot of technical things that need vetting to fully understand relevant aspects.
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: A question about deduplication

Post by foggy »

Thanks for sharing your findings, real-world practices are always much appreciated.
kte
Expert
Posts: 179
Liked: 8 times
Joined: Jul 02, 2013 7:48 pm
Full Name: Koen Teugels
Contact:

Re: A question about deduplication

Post by kte »

Is the veeam agent still 32 bit in version 7 ??

K
foggy
Veeam Software
Posts: 21139
Liked: 2141 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: A question about deduplication

Post by foggy »

Veeam B&R v7 introduces 64bit backup repository agents. However, agents running on the proxy servers are still 32bit.
lars@norstat.no
Expert
Posts: 110
Liked: 14 times
Joined: Nov 01, 2011 1:44 pm
Full Name: Lars Skjønberg
Contact:

Re: A question about deduplication

Post by lars@norstat.no »

Why ?
tsightler
VP, Product Management
Posts: 6035
Liked: 2860 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: A question about deduplication

Post by tsightler »

Because the proxy component is the piece that integrates with the VDDK and thus switching to a 64-bit agent on the proxy also means switching to the 64-bit VDDK which introduces the potential for new bugs and thus require even more testing while the 32-bit agent and VDDK was already a well proven and tested combination used in every previous version. Plus, the proxy agent doesn't really benefit much from 64-bit code since it doesn't use as much memory as it only does per-disk dedupe while the repository requires enough memory to store hashes for the entire job.

Of course, this will change with the release of support for vSphere 5.5 since this version no longer includes a 32-bit version of the VDDK, likely due to the requirement to support >2TB VMDK. Based on previous history I wouldn't be at all surprised if the patch for 7.0 to support 5.5 only uses the 64-bit proxy/VDDK when it detects vSphere 5.5. This limits the exposure to new code only to those running the newest version of vSphere thus not causing issues for those more conservative users.
Post Reply

Who is online

Users browsing this forum: Majestic-12 [Bot] and 19 guests