Veeam, DataDomain and Linux NFS share

stormlight · Post by **stormlight** » Sep 07, 2012 4:34 pm this post

It depends on what you want. Enable dedup and compression if you want less data going over the wire to the DD box.

Disable them if you want more compression and dedup of the data domain box (at the expense of more network traffic)

However, I have not done enough test to see if if the dedup/compression ratio on the data domain box improves that much with it off in veeam. With it on in veeam I see good compression and dedup rates on the DD box. Enough to make me happy to get less data across the wire.

Sep 07, 2012 10:03 pm

Pretty much all "good" compression will have a significant negative impact on dedupe, and will lower dedupe ratio of dedupe appliance significantly. It it generally never suggested to use compression when writing to any dedupe appliance. Effectively, compression is already "dedupe", just of a single stream, and working with a smaller dictionary, while dedupe is a large dictionary, that looks at a much larger data set to find duplicated strings. If you have 50 different files, which all contain the same string, a dedupe appliance will reduce the number of times that same string is stored on disk to 1, and then compress that string as well. If the 50 different files that are compressed with Zlib, those string will likely all become unique because each stream will have compressed it with a different dictionary, thus it can't be easily "deduped" further. This is a common error when customers implement dedupe appliance. Honestly if you are going to compress the data sent to a dedupe appliance, you probably should have saved the money and just purchased regular disk.

On the other hand, our dedupe, and DD dedupe get along pretty nicely because they target the data in different ways. Veeam dedupe is largely focused on eliminating duplicate blocks with a single backup job, while DD dedupe is focused on eliminating duplicate "bit patterns" within an entire file store. The following is an extremely simple example to help explain the idea:

Assume we have 6 blocks on disk as follows:

12345678 12348765 43215678 56781234 12345678 43215678

So you will notice that blocks 1 & 5 are identical, as well as block 3 & 6, Veeam will thus only send the following blocks across the wire:

12345678 12348765 43215678 56781234

When it gets to block 5 it will simply send an indicator to say that this block matches block 1, and block 6 matches block 3. But notice that there is still plenty of "deduplication" left to do since several patters are repeated multiple times:

1234 - 3x
5678 - 3x
8765 - 1x
4321 - 1x

So when the dedupe appliance "dedupes" this data, we will have:

1234 5678 8756 4321

Now we can compress the data. If each number currently represents one byte, then each "deduped" string represents 32 bits, however, there are only 4 total patters in this simple string so I can represent every pattern with only 2 bits, so I will create a dictionary table as follows:

1234 = 00
5678 = 01
8765 = 10
4321 = 11

Now the data is:

00 01 10 11 which represents the original data, effectively only 1 byte, to represent what was originally 48 bytes worth of data.

To be completely fair, I took significant artistic license with this example, it's not 100% technically accurate, but the goal was to outline the concept, and show why Veeam dedupe does not interfere with the dedupe on the appliance, although it will slightly decrease the reported dedupe ratio from the appliance perspective since we obviously reduce the amount of data sent in the first place. In the example above, the dedupe appliance would likely report a 6:1 dedupe ratio of Veeam dedupe was disabled, but a 4:1 if Veeam dedupe was enabled, because we eliminated to data before it got to the appliance. The final amount of data on storage would be exactly the same.

If you really want to use compression when writing to a dedupe appliance, using "Low" compression is probably the best bet. This compressing uses a fixed dictionary and is somewhat predictable. It will still lower dedupe, in testing by about 20-30% or so, but it will provide some reduced data going to the dedupe appliance which can make backups faster.

zak2011 · Post by **zak2011** » Sep 08, 2012 9:45 am this post

Thanks a lot Tom for the excellent explanation of the dedup with Veeam and DD. Great job!
I always understood and thought that using DD as a Veeam backup repository was something normal and what most people do.
According to EMC..they said..if i upgrade the existing DD to 2.5TB totally to 6TB..then i would get a dedup of 20X and would effectively be able to store upto 50TB if the data was sent natively to the DD without Veeams dedup and comp.
Now i have to make a choice between upgrading the DD or using some a non dedup disk as i am running short of disk space and eventually want to stop backing up to tape.

zak2011 · Post by **zak2011** » Sep 08, 2012 12:28 pm this post

This is cool. Now Now when it comes to sizing the storage for the backup repository..i need help for doing some serious sizing for backup storage.
The total data which i need Veeam to backup comes t0 5 -7TB which i need to backup to the Data Domain so i need to know how much storage space i need.
For this i know i need to have my retension time for the backups clearly defined.
Lets assume i need to have the retension time for my backups as the following

1) Daily backups= 30 days
2) 1 Full Monthly backup= 1 year
3) 1 Full yearly backup= 10 years.

Lets say I am not going to have any tape backups. Based on the data size, how much storage space do i need for the Data Domain to fullfill this retension policy for Veeam backups.

Thanks!

Sep 08, 2012 3:45 pm

The 10 year requirement is very difficult to calculate since it will be based on the amount of unique data that is generated during those 10 years and things can change a lot in those times. The reality is, for a dedupe appliance, that's pretty much always the answer, how much total "unique" data do you have or will you create. This can vary so tremendously based on the type of data you are backing up. For example, if your content consist of many "image" scans, those are usually already compressed and very unique, thus dedupe won't help whole lot.

Since you already have a DD then you should be able to project the requirements for a year based on how much disk space the initial full backups take in the pool, how much data grows in a day (unique data per day), and how much does the pool grow in a month (unique data per month). Once you know these numbers then total required space for a year would be something like (total used space for full backup) + (unique data per day * 30) + (unique data per month * 12). Obviously your Data Domain team is really the group that should be able to help you with sizing in this case since Veeam will simply be sending all data uncompressed to the DD it's completely up to them how much space it will end up taking.

I've personally never seen a production DD that get's anywhere near 50x unless it is storing full database dumps every day for months, or something like that. Most ratios I've seen from customers are in the 10-15x range, with a few slightly less than that, and a few slightly higher (the highest I've personally seen is in the upper 20's). One customer told me that he was seeing 60x dedupe with Veeam, but it turns out he was running weekly fulls and had a year of them on his DD, which means he had 52 weeks of full backups to dedupe, which of course bumped up the amount of duplicate data sent to the DD.

Sep 09, 2012 8:29 pm

Also, it is to be considered the yearly backup copy would be simple VBK files without any incremental chains, so also the 12 monthly backup. So Tom I think an even better calculation could be (suppose forward incremental):

In 1 month, 5 full backup and 24 daily incremental

PLUS

Full backup * 22 (12 monthly copies for 1 year and 10 yearly copies) for long term retention

You would also need some scripting or manual intervention to save in a separated directory on the DD those long term backup files to keep it away from the usual Veeam rotation.

Sep 10, 2012 2:54 am

Hi Luca, you're math doesn't include any consideration for dedupe. If you have 12 monthly full backups, there will be significant duplicate data between them so they won't likely take anywhere near full backup * 12 on the dedupe device. For example, if the first full backup is 200GB, and is deduped/compressed 4:1 by the DD, it will take 50GB on disk. When you have the second months full backup it might be 210GB, but a lot of it's data will be the same as the first full backup, so only the "unique" data between the two backups will be actually be added to the dedupe pool. Likely increasing the entire used space by some minimal amount.

I have scripts that can automatically manage the archive backups on a DD when using a Linux system via NFS as the protocol for the DD. These are simple bash scripts that leverage the Data Domains "fastcopy" command to make pretty much instant archives in a separate directory. I'm sure an equivalent could be written for Windows in Powershell, but I wrote these bash scripts months ago and they are in use at quite a number of client environments so I know they work well.

mongie · Sep 10, 2012 5:23 am

I've just been through EMC's sales process for DataDomain. We're looking at a DD670.

Your sales rep will have a sizing tool for DataDomain that will show you what model of DataDomain you will need based on your backup requirements.

I'd tried to work it out based on the size of my Veeam Backups, but really, its best if you just know how much of each type of VM you have on disk.

e.g.
File Server = 2TB
Exchange = 1TB
General Servers = 4TB

The rep puts it all into his application and spits out the best unit for you.

Post by **tsightler** » Sep 10, 2012 1:32 pm this post

Then take that number and double it, and it will probably be close to what you'll really need, at least based on the customer experiences that I've seen.

zak2011 · Post by **zak2011** » Sep 10, 2012 2:52 pm this post

I addressed this to EMC again and this is their opinion

"you do not store backup to disk for 10 years. This is not backup, but archive. A file “”sales.doc”” that is 10 years old is a not a backup files, it is an archive copy of that file. It is too expensive to keep file like this on primary disk storage for backup.

What happens is that after 6 months or so, there are so many changes within the environment that the 5 TB (If no growth) will be “”like new”” 5 TB. and you will not get so high dedup.
Then again you do not store 10 years backup one primary backup storage...
We would not recommend it.
We at EMC have a solution for this, but it requires to get the DD860 with Extended retention. Extended retention are cheaper disk storage that takes care of the archive data from backup...

If you really want to do this then you will need about 5 times the capacity of the dedup full, meaning 5*2.5TB=12.5 TB on top of the 2.5TB and the incremental backup"...

Sep 10, 2012 3:26 pm

Now this is interesting. To me, what they are saying is that for long-term archival, there is still no economical replacement to tape backups. Quite contrary with their usual marketing around how DataDomain renders tape obsolette. But totally makes sense nevertheless... which is exactly why we are working on adding native tape support to Veeam B&R

Looking at future LTO specs, there is just no way how disk-based storage can ever match that from pricing perspective.

The future of long-term archival is self-contained, pre-deduped backup files archived to modern tapes. This is exactly why we designed our solution accordingly (with dedupe implemented on per-job basis, instead of globally).

Of course, for smaller backup sets, and depending on policies and requirements around data security, some customers may instead choose cloud archival storage such as Amazon Glacier. Where, I assume based on the RTO Glacier provides, data still ends up on tapes anyway

Sep 10, 2012 3:33 pm

Absolutely agree with the 10 year issue, although there will still be a lot to dedupe (Word documents have the same words as 10 years ago, Excel spreadsheets have the same numbers, databases have the same tables with the same redundant data), but truly "unique" data will be an issue for that long of a period. I'm not sure if paying EMC for even "cheaper" storage is the best option. For 10 years you'd likely have to move the data somewhere else as the current technology would likely age out of support at least once, and the cost of maintenance for those disks over 10 years would make them a very cost inefficient option.

Sep 10, 2012 8:10 pm

Interesting I was thinking about this thread today, and now back at home I'm reading in your posts some of my thoughts

Anyway, I see other problems in a 10 years retention on a deduplication appliance (not only related to DataDomain):
- first, usually hardware support is 3 or 5 years long, then we all upgrade it by replacement. DD is not different. So why do we care about a storage appliance sized for 10 years of data, while we will change it way before this timeline?
- second: we are going to save VMs on it. Ten years ago we had Windows 2000 and 2003 was in beta. Now we have 2008 R2 and 2012. Thinking about the 10 years backups, how many bits have in common the oldest backup (imagine it full of win2000 VMs) and the last one (full of 2008 R2 VMs)? Not so much I think, so lower deduplication ratio.
- third, you would need to keep around the old backup software to read oldest backups, if at some points binary files change their formats.

Tom, you are right I did not considered deduplication in my calcs, think of it as a "worst case" scenario. Incidentally, working with a DD competitor (ExaGrid) their sizing procedures usually creates bigger numbers like I did. For sue because they can sell bigger appliance in this way, but also to not give a customer a wrong sizing if the deduplication will have low results. They usually say "we cannot know how much we can deduplicate without actually saving real data on our appliances, we only have estimates based on customers history". At least, they can scaleout if sizing was wrong, but this is a different story.

At the end, data older than 2-3 years are better saved on tapes. Also because as time passes by, chances to access them become fewer and fewer, so slow operations offered by tapes are not an issue

zak2011 · Post by **zak2011** » Sep 11, 2012 9:04 am this post

Firstly..thanks guys to all those really cool suggestions.
So...i assume the final conclusion is i cannot get rid of tape backups.
If this is the case to make it simple and clear...could i confirm the best and simplest way to have a good backup retension in place is to do the following:-
A sensible strategy to keep daily/weekly backups (those with shorter retentions) on disk for the first month or two, with monthly and yearly backups, then kept on tape and held offsite. To provide protection from disaster, the daily and weekly backups on disk should ideally be replicated to a second site.

Sep 11, 2012 10:01 am

Wait, they are two different media for different goals: as DataDomain guys told you, you can get rid of tapes for backups, that's for sure. If you need long term archive of those backups, then go for tapes. Do not keep it as a general rule, but I think 2-3 years is the limit between the two media.
Hopefully in the future there will be other ways also for archives, I would like to test AWS Glacier in conjuction with Veeam, stay tuned...

zak2011 · Post by **zak2011** » Sep 12, 2012 9:57 am this post

Just needed to have a better understanding of this..
Regarding the retension time in Veeam backups...the number of restore points i have specified to keep on disk is 120.
Does this mean each backup file will be overwritten 120 days from the time it was stored on the backup repository?
How would i specify a seperate retension for daily backups- 30 restore points,
monthly- 1 year.
yearly -10 years.
Doesn't this mean i need to create three seperate jobs? ..if so..i don't understand that part..where i need to specify the backup mode..( why would i specify the backup modes in all the three jobs backing up the same VMs)

Post by **foggy** » Sep 12, 2012 10:21 am this post

zak2011 wrote:the number of restore points i have specified to keep on disk is 120.
Does this mean each backup file will be overwritten 120 days from the time it was stored on the backup repository?

As you said, you specify the number of restore points, not days. That means that at any given moment in time Veeam B&R will maintain the specified number of restore points in the repository, regardless of their age. Of course, the retention period in days will be equal to the number of restore points in case of an always successful daily job.

zak2011 wrote:How would i specify a seperate retension for daily backups- 30 restore points,
monthly- 1 year.
yearly -10 years.

There is an existing topic regarding this kind of retention, please review.

zak2011 · Post by **zak2011** » Sep 12, 2012 11:03 am this post

Thanks foggy for the reply.So this means after the 120 restore points or 120 successful days of backups are reached.. At the 121rst restore point, any of the current VBKs or vibs/ vrbs will be overwirtten..irrespective of when they were actually stored on disk.
Is this true?

Regarding the retension policy i reviewed through the post...
as Vitaliy has mentioned "each job uses its own backup files, so if you create two jobs, you will have two chains of VBK + VRB/VIB files."
So if i have three jobs backing up the same VMs with different rentension policies..i end up having three chains of VBK + VRB/VIB files.

I would specify one job to run daily..
one to run monthly
one to run yearly..
but is there a way not to have the same VIBs/ VRBs to come in the monthly and yearly jobs i run, since i already have them in the daily.

Thanks!

Post by **foggy** » Sep 12, 2012 11:14 am this post

In case of reversed incremental backup mode, Veeam B&R deletes the earliest restore point as soon as it falls out of the retention policy. Regarding the actual number of restore points on disk for forward incremental mode, please review this thread.

If you are interested in further details, kindly please continue posting in the corresponding topics (do not derail this thread). Thanks.

eiskra · Post by **eiskra** » Oct 12, 2012 4:22 pm this post

dellock6 wrote:I would like to test AWS Glacier in conjuction with Veeam, stay tuned...

VERY interested in this.

Post by **tomas.olsen** » Oct 24, 2012 6:44 am this post

Will veeam in-line dedup create read towards the repository. or is dedup information stored in the SQL database or some local cache of proxy. I suspect read will be performed.

Post by **Vitaliy S.** » Oct 24, 2012 9:18 am this post

Hi Tomas, information about hashes is stored in the repositories. Proxy servers do not have local cache and no read requests are performed against our SQL Server database.

Post by **tomas.olsen** » Oct 24, 2012 7:04 pm this post

so when running veeam with several proxies with several jobs at the same time. Inline dedup should not be on since dedup appliances perform badly with mix of reads and writes.

all best practice documents dealing with veeam and datadomain says no dedup and no compression. Both from Veeam and EMC. Almost all feedback from users in this forum also states that this gives best performance when doing backup over LAN. My experience from 15 to 20 customer installations also says that you will get good performance when disabling both when doing backup over LAN.
but Gostev wrote a FAQ saying that Veeam recommends dedup on and compression off when doing backup to dedup appliances.
http://forums.veeam.com/viewtopic.php?f ... 834#p39473
se bottom of the først article.

chrisdearden · Post by **chrisdearden** » Oct 24, 2012 8:23 pm this post

I'd say the hash table is probably small enough to sit in repository cache , not on the underlying CIFS share , but I could be wrong. From customer installs I've seen , keeping Veeam dedupe on has had no effect on the processing rate on the DD boxes.

Post by **tomas.olsen** » Oct 24, 2012 8:35 pm this post

chrisdearden wrote:I'd say the hash table is probably small enough to sit in repository cache , not on the underlying CIFS share , but I could be wrong. From customer installs I've seen , keeping Veeam dedupe on has had no effect on the processing rate on the DD boxes.

I suppose that if you enable dedup, each proxy would benefit more than one or to vcpu's because of the dedup processing. on full backup or first run of a job, a veeam backup server thats is acting as a proxy would utilize at least 8 cpu's at 100% if dedup is set to LAN and compression is set to best performance.
Wouldn't this be the case for each proxy as well??
If not, the cpu will restrict the throughput of the proxy?

Oct 24, 2012 10:14 pm

CPU is by far the primary bottleneck of a dedicated proxy. I've done several tests with 6.0 when it went out, running the same job in full active over and over and changing parameters. With a single job running (thus saving one VM after another) I've seen performance improvements adding up to 4 vCPU, while memory was no more an issue after reaching 2 Gb. All on Windows 2008 R2 Standard Core.

About DD, yes papers say disable both dedup and compression. If you have no problems with lan traffic towards the DD, for sure this configuration will be the fastest in executing the same job, since it's obvious source dedup or compression will add some latency in the job. they can be enabled if you traffic is congested, so you have less data travvelling in the lan.

Luca.

Post by **tomas.olsen** » Oct 26, 2012 6:33 pm this post

chrisdearden wrote:I'd say the hash table is probably small enough to sit in repository cache , not on the underlying CIFS share , but I could be wrong. From customer installs I've seen , keeping Veeam dedupe on has had no effect on the processing rate on the DD boxes.

I would say that backup is one thing. But restore is somewhat different. All we have talked about is performance doing backup. no one has mentioned restore times. Doing backup with dedup and or compression in veeam with datadomain as a repository would probably increase restore times. Especially if you are trying to do a surebackup. if you have to duplicate data twice and perhaps decompress as well this will increase the recovery time and surebackup will probably not run at all. There are allready issues doing surebackup of large vm's with datadomain WITHOUT dedup and compression enabled.

but if your goal is to do the fastest backup or to use as little storage as possible, you might win the race.
If your goal is to also be able to do fast restore, the story is somewhat different.

Oct 26, 2012 9:11 pm

Actually, enabling Veeam dedupe/compression makes Surebackup faster when running from DataDomain because less blocks have to be rehydrated by the DD, and the DD is far slower at rehydrating individual blocks then we are. It also improves restore times, even Surebackup, because less data has to be rehydrated from the DD and less sent across the network. Note that I'm not suggesting this is a good idea, buying a dedupe appliance and sending it compressed data that can't be deduped isn't very practical.

However, as I've explained at length in other post, our dedupe acts at a completely different level than DD dedupe, and adds relatively little overhead, and yet again reduces the amount of data that has to cross the wire, which speeds both backup and restores. Compression is a completely different matter, using Veeam compression and sending it to a dedupe appliance is faster, for both backups and restores, but pretty much kills any ability of the DD to provide reasonable dedupe savings, unless you use the "Dedupe Friendly" compression.

Bazillus · Post by **Bazillus** » Sep 24, 2013 8:36 am this post

We use DataDomain DD670 as NFS Target.
We connect the NFS-Share on the DataDomain with a Linux VM. During PoC we also tried CIFS but is slow. NFS is really fast.
But we can't use "Instand Recovery" or "SureBackup" because the DataDomain is to slow reading directly.

Is there a Tuning Guide around ?
(We know the Whitepapers from Veeam and DataDomain/EMC).

Post by **foggy** » Sep 24, 2013 9:17 am this post

Not sure you can tune something here. What you could do is use some other storage to act as primary, storing latest backups for operational restores (Instant Recovery) and recoverability checks (SureBackup), while sending backups for archival to DD using backup copy jobs.

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share

Re: Veeam, DataDomain and Linux NFS share