DataDomain replication performance issue

Jul 15, 2016 2:21 pm

I've been using a pair of EMC DD2500 units + DDboost, since Veeam was installed last year.

Backup Copy jobs are saved to the first device, then DD replication (not Veeam) is used to create another copy on the second DD2500 at a remote site.
This has been working well since installation. There's often a few TB (pre-compression) remaining when I check on a morning, but this is usually cleared by the afternoon.
Occasionally, the replication lag spans >24 hours before clearing.

For over a week now, the size of the replication set has been building up and up - and shows no sign of reducing.
Currently it stands at >400TB (pre-compression).

As the replication is a purely DD -> DD operation, with no Veeam involvement - I wouldn't usually post here.
But the sudden drop on replication performance occurred immediately after our Veeam v8 -> v9 upgrade.

So before opening a support call with EMC, I was just wondering if anyone had noticed similar behaviour after their upgrade.
I've noticed some settings that look a little different in the backup job configurations, eg:

"Local Target (legacy 8MB block size)" Legacy storage optimization setting left by upgrade process. Please switch to another setting, and initiate Active Full

Could there be an change in the way the backups are stored on the primary DD device post v9, which would affect replication to the other?

Post by **ferrus** » Jul 15, 2016 3:23 pm this post

This graph - from the DD system manager, shows the jump in size after v9 was installed

I could understand a one day jump - with the post-upgrade test backups we took, but that seems to be consistently bigger.

adb98 · Post by **adb98** » Jul 21, 2016 5:12 pm this post

Thought I might throw some hints your way. The first one is to check your nic speeds on the DD. Ensure that you are running at full 1gb or 10gb if your a lucky bastard. I found that my replication started falling behind to the point it would never have synced and it was due to the nics being bonded and stuck at 100mb.

The other thought is make sure you are following the DD best practice guide. A few things have changed in Veeam 9. Also check your historical and see if you are still getting good dedup rates and compression. This going down means more than likely a bad setting in Veeam. This in turn causes more data to have to be sent.
https://www.veeam.com/kb1956

Lastly on the settings note I wonder if using "Use per-VM Backup file" maybe causing a difference. This makes a backup file for each VM when its backed up. Should be an issue but you might want to expermiment with that.

Hope this helps a little.

Post by **ferrus** » Jul 22, 2016 11:46 am this post

Thanks for the reply.

I pretty sure we're following most of the best practice. We haven't switched on the per-VM backup file option yet, although it's something I'd like to do once this issue is resolved.
The only setting I'm not sure of is the number of concurrent tasks for the DD repos.
EMC recommend setting it to half of the maximum your DD model can deliver - but I can't find any published values of what that should be for a DD2500.

NIC speed is set to 1Gbps, but our network team mention the DDboost replication was peaking at around 100Mpbs over the WAN - and this amount hasn't changed before or after the backlog appeared.

For info, I've opened support calls with both Veeam and EMC.
Veeam support pointed to the resultant size of the Backup Copy Jobs, which appear not to have changed since the upgrade.
On the DD, the post-compression sizes are also similar either side of the upgrade.

The only thing to have changed is the pre-compression (and replication size), when it's first written to the disk.

One interesting point to note, is that EMC only support DD OS v5.6 - v5.7 with Veeam v9. Veeam support permits a wider OS range - v5.4 - v5.7.
So our next step is to upgrade to v5.6 on the DDs, which may align better with the DDboost v3 compatibility of Veeam v9.
At the very least, it will allow us to receive further EMC support.

Post by **nefes** » Jul 22, 2016 2:39 pm this post

Could you please check, whether Compact Full is scheduled for your backup copy jobs?
If no, full backup will grow during time, thus increasing replication time. (looks like it is the reason why your Pre-Comp size increased)

Be aware, that Compact is time-consuming operation, and puts certain load on your device, so it should not be scheduled for the same time for all your backup copy jobs.

Post by **ferrus** » Jul 22, 2016 3:18 pm this post

Thanks for the reply.

The compact full backup option is greyed out, with the an alert:

"Maintenance is not required when periodic full backups are enabled"

We use a split strategy: 30 RP Forever-Incremental on Tier 1 / GFS 7 days, 4 weeks, 18 months on Tier 2 DD.

Going to upgrade the DD OS tomorrow.

Interestingly, while investigating the settings - I found a couple of backup copy jobs that had been disabled - because of a sync issue with the DD repository.
It recommended doing a manual Repository Rescan, then re-enabling the job.
I've done that, and everything is OK again. Don't know if that would cause any write issues.

rreed · Post by **rreed** » Jul 22, 2016 7:59 pm this post

Block size changed a bit w/ v9, w/ what size are you writing? In v8 we noticed cranking them up ("1TB Larger Files") yielded much faster restores; I've left it that way in v9 though I think the block size was halved. I would also recommend enabling per-VM chains, for a lot of reasons. One I think is that dedupe devices might be able to better dedupe blocks of individual VM files vs. large consolidated files. I've left on Veeam dedupe and set compression none in my jobs. Overall, in v8 we saw real-world savings of ~%86-88% w/ our DD's and Dell DR's. After v9 and all our v8 restore points purged away we've seen ~90% space savings on both devices. I disabled limiting concurrent tasks since DD's can handle I think around 150-300 or so concurrent connections - you'd have to enable per-VM chains, have a lot of concurrent jobs, and a lot of VM's firing off all at once to saturate that. Default in Veeam I think is four?

Have your network guys monitor and report DD --> DD traffic, it should be saturating your WAN link if it's that far behind. If traffic is slim to none, make sure your DD replication throttling didn't get turned on, your network guys didn't accidentally QOS down your replication traffic, etc.

barresi · Post by **barresi** » Jul 29, 2016 6:30 am this post

Hello,
what is the solution for this? The OS update for the DD?
Regards barresi

rreed · Post by **rreed** » Jul 29, 2016 1:42 pm this post

Your graph is showing that you're writing a LOT of dedupable/compressible data to the DD, might check your job settings to make sure Veem is doing some of the dedupe and compression work (Enable inline deduplication, Compression level in Job -> Advanced -> Storage tab). Storage optimization might have an effect, but I set mine to largest size to help w/ restores, and it certainly seems to. Smaller block size yielded EXCRUCIATINGLY slow restore times, once I cranked it up they became usable. That was v8, I haven't experimented in v9 to see what difference block size makes. Ours is all still working well, our dedupe devices keep up, compress well, restores are acceptable, so I try to leave well enough alone.

Try having Veeam take off some of the load if you aren't already, and I can confirm when two DD's suddenly have some data to replicate, they will drive up their link to each other. Our saturates a 500Mb link after a backup job has ran.

Post by **ferrus** » Aug 23, 2016 8:17 am this post

Sorry - I've been away for a few weeks and forgot about this thread.
This issue still isn't resolved, and have quite a bit to post later.

Before then - could someone answer give a quick answer to the following:

If you have Direct Attached Storage for your Primary Backups, and from there take Backup Copy Jobs to Data Domains - which Storage Optimization is recommended?
I've found advice for Local Target 16TB+ for Dedup appliances, and Local Target for DAS - but they both have to be the same for Backup Copy Jobs. So which is the preferred option?

Post by **foggy** » Aug 23, 2016 12:58 pm this post

Recommendation in this case is to have the 'Local Target 16TB+' setting in the original job, since backup copy will then use the same block size to store data on the dedupe device. This is, however, not a "hard" recommendation, smaller block ('Local target' setting) will result in a bit slower backup, however, some types of restores will be faster, especially in v9.5.

victorB · Post by **victorB** » Oct 27, 2016 4:28 pm this post

Hi, I was wondering if a solution was found to this problem as we have been experiencing the exact same issues straight after our environment was upgraded from v8 to v9, while also using a pair of EMC DD2500 units + DDboost DD2500.

Post by **ferrus** » Oct 28, 2016 1:04 pm this post

I'm away from the office this week - but just noticed an e-mail from EMC support attempting to close the original support call - for the second time.
I'm absolutely no further on this.

We've raised the WAN link speed from 10Mbps to 1Gbps - and that seemed to provide a temporary speed burst, but the comms link is nowhere near saturated and we're back to having 100-200TB backlog.

The support call went round a few different EMC engineers and departments - but they kept insisting it was a replication issue only, and suggesting tweaks, when I see the issue as before that - and the replication a symptom.
The day after the Veeam v8 to v9 upgrade the pre-compression data (and resulting compression rate) rocketed up, and hasn't come down since. If I turned off replication altogether - I can't see those values changing.

They've provided me with graphs showing gentle linear increases of various counters over the last year - while consistently ignoring the graph above with the massive one day change.

I couldn't keep the configuration static for all this time - so in the last few weeks I've started rearranging our Veeam backups and amended some of the configs. We now have per-VM backups, compression, in-line dedup, a smaller block size, fewer higher density jobs. None seem to have made a difference to the DD performance - they're more to yield better Tier1 DAS efficiency,

If you find anything out - I'd be grateful if you let me know, as we're no further forward. The Veeam 9 upgrade has been successful in every other way, but the DD2500 performance has worsened dramatically.

victorB · Post by **victorB** » Oct 28, 2016 4:21 pm this post

Hi, I have been informed by EMC support that there is a problem with Data Domain replication using Veeam version 9. There has been a change in Veeam version 9 in the way it keeps base file relationships, therefore breaking the Data Domain VSR (Virtual Synthentic Replication) capability, it is essentially turned off during Data Domain replication. This means that Veeam backups will take longer to replicate from version 9 onwards (on DD).

I have also been informed that EMC have investigated this bahaviour and their conclusion is that this needs to be corrected from Veeam software side and not DDOS code.

The only recommendation to counter this has been to create additional Mtree's and create replication contexts for each additional Mtree created, I have done this and it has helped but not completely resolved the problem, we still suffer with replcation lag but no where near the levels we were experiencing.

Hope this helps.

Post by **ferrus** » Oct 28, 2016 4:53 pm this post

Victor - that's a great help.
Almost none of the above has been passed on to us from EMC support - apart from a comment in the support call close request a couple of days ago, that we might want to consider using additional MTree's - to allow more replication streams.
That's very useful information.
Do you know if EMC have published a KB article on the issue?

Anyone from Veeam like to comment on the issue? I suppose it's too late for v9.5 ....

Post by **Mike Resseler** » Oct 31, 2016 7:08 am this post

Hi Ferrus,

I have no information on this at this moment but I will try to contact some people to find out more. I will get back to you as soon as possible

Mike

Post by **ferrus** » Oct 31, 2016 10:00 am this post

Thanks for that. I'm going to hold off on creating any new MTree's for now.

Let me know if we can be of any assistance.
Our performance still reflects the graph on the previous page.

Post by **Mike Resseler** » Oct 31, 2016 8:02 pm this post

Ferrus,

It was discussed with DEV and QC today, but unfortunately it seems there is indeed something with the base file relationships which causes the Data Domain VSR capability to break. So yes, this means that the DD replication takes longer at this point in time. There is a discussion going on right now between Veeam and EMC to see where a potential solution can be defined, but I'm afraid it might take some time

Sorry, but at least I hope it helps a bit in your struggle that you have today and that you know we are aware of it

M/

Post by **ferrus** » Oct 31, 2016 8:40 pm this post

Mike

That's actually great news.

We've had a call in with EMC since July, which has been frustrating us more and more.
At the back of our minds we suspected it was a Veeam issue, or an incorrect configuration in our estate. We put the call in to them to help us identify the source of the problem.
Their response so far has been to deny that a problem exists with our replication, and to try and prove that everything is the same as it ever was - despite us demonstrating that in the hours following the Veeam upgrade we went from having clean replication an hour or two following the completion of the copy jobs - to a constant backlog of between 200-700TB, for the last 3-4 months.
Last week they closed the call (again), with no mention of this issue.

Now we know there's a problem, and it's being investigated - the pressure is off us slightly.
We know the replication worked fantastically in Veeam 8 - with a bandwidth 10x smaller, so I'm hoping it can return.

Let us know if you need access to our estate for any testing, etc.

Post by **Mike Resseler** » Nov 01, 2016 5:07 am this post

Ferrus,

Thank you for your kind offering... We might take you up on it

(Me or a product manager will PM you)

Thanks

Mike

adb98 · Nov 04, 2016 2:14 pm

I had the same issue and here is the solution that worked with me.

So I had worked with a guy from EMC support named Joseph. He is LV3 united states Sev 1 support. This guy is the big guns they call out when everyone is stuck (Getting ready to make a big purchase with EMC always helps when needing support

). Anyway I was also having a replication issues after going to Veeam 9. Not sure what changed in Veeam and causes it but we figured out a solution with the help of Joe in EMC support and fixed it.

***Our Solution***
We found that there are only so many streams on a DD. Which we knew but we did not know is that there is a limit to how many streams an Mtree can use which I couldn't believe no one in lower support knew. You can have 196 streams like our new DD4200 but each Mtree has a limit on how much it can use for replication. So when you hit the limit it is capped and will not go any faster so you never fully saturate your link or push it to what it can do.

We had one giant Mtree for Veeam as we have other backup products. What Joe with EMC support did was create 6 more Mtrees. He then setup DDBoost for these Mtrees and we worked to spread out our jobs between them. We disabled the job and used fast copy on the DD to move the data over to the new Mtree. Once that was done we created new repositories for the data in the new Mtree and then in the job, we changed the path to point to the new repository. We then enabled the job and ran it and ensure all is ok. We then set up new replication points for the new Mtrees.

This kinda freaks Veeam out a little and puts the old backup directory and job in Disk (imported) but that is ok because when you use fast copy it will be a mirror copy of the data so it thinks that is old data. If you look at the job though it has all the data in the new location. I gave it a good week and then deleted the old job from Disk(Imported). Joe said it was always wise to just give it a few days. Then I went thru and deleted the old repository if it was not being used by anything else.

By doing this my replication has not had an issue.

Post by **DeadEyedJacks** » Nov 04, 2016 3:37 pm this post

Just for reference:

We have two DD4200s with twelve source mtrees and twelve mirrors offsite on a DD4500, plus two DD2200 with six mtrees mirrored offsite to the same DD4500.
This was by design as aware of the replication streams by mtree being a potential bottleneck and if DD4x00s support ~128 mtrees why would you stick everything in a single ddboost container?
We do intend to periodically create additional mtrees as we foresee potential issues with extended resync or recover times as mtrees grow.

Veeam Backup system was built on v9 and so far has 458TB of source VM data backed up and replicated offsite, a notional 6PB of pre compression restore points.
So far replication performing inline with WAN link speeds which range from 100Mbs through 1Gbs to 10Gbs.

DDOS code base is mix of 5.6.0 thru 5.7.1

ndolson · Nov 06, 2016 10:44 pm

We had a very similar problem after upgrading to v9 - I believe we were one of the first support tickets on this issue with EMC. Something did indeed change with the way Veeam was marking data in v9. It resulted in our "pre compressed bytes sent" value on mtree replication to bloat incredibly - something like a petabyte needed to be sent before the appliances would be in sync, which would never happen over a 1 Gb link. Given the size of our environment, it'd have been mathematically impossible to have that volume of data to replicate. Unfortunately, our only option for quick resolution was to wipe the file system on our remote DD appliance, bring it back to our primary site, and perform a collection replication to resync locally. There were some new settings in the backup jobs that weren't present prior to v9 that Veeam recommended we have enabled, which EMC said the opposite was true and those should *not* be enabled. I don't recall specifically what those were, but our EMC SR# was 80283034 and the Veeam SR# was 01802332 and you may be able to reference those cases for details. About a day after EMC recommended we change our job settings to THEIR best practices, the Veeam "best practices" article was updated to reflect. This was a huge PITA for us, hopefully you can find a resolution that doesn't involve dropping your remote DD's data. I'm still a little salty about it.

mk2311 · Post by **mk2311** » Nov 07, 2016 10:06 am this post

We have two DD990's. In April this year, we upgraded the DDOS from v5.5.2 to v5.7.1.10.

At the same time we upgraded Veeam from v8 to v9

The DD started alerting on the replication lag messages started about a week later. At one point, it was in excess of 100tb

EMC said it was a Veeam Issue. Veeam said it was an EMC issue

To resolve, Veeam recommended that we change the backup jobs 'Advanced Settings / Maintenance' option. We checked the 'Defragment And Compact Full Backup File' and run this weekly on selected days. Gradually, the lag count dropped and eventually cleared, but has recently re-appeared, but this may be down to some huge filesystem backups we have started running (20tb plus)

Nov 07, 2016 10:17 am

Hi Neal,

First, I'm very sorry to hear about those problems that you had. It is indeed a PITA. Thanks for the additional information that you give us. We can use all the information to investigate. Really appreciated!

Mike

ndolson · Post by **ndolson** » Nov 07, 2016 2:56 pm this post

ndolson wrote:We had a very similar problem after upgrading to v9 - I believe we were one of the first support tickets on this issue with EMC. Something did indeed change with the way Veeam was marking data in v9. It resulted in our "pre compressed bytes sent" value on mtree replication to bloat incredibly - something like a petabyte needed to be sent before the appliances would be in sync, which would never happen over a 1 Gb link. Given the size of our environment, it'd have been mathematically impossible to have that volume of data to replicate. Unfortunately, our only option for quick resolution was to wipe the file system on our remote DD appliance, bring it back to our primary site, and perform a collection replication to resync locally. There were some new settings in the backup jobs that weren't present prior to v9 that Veeam recommended we have enabled, which EMC said the opposite was true and those should *not* be enabled. I don't recall specifically what those were, but our EMC SR# was 80283034 and the Veeam SR# was 01802332 and you may be able to reference those cases for details. About a day after EMC recommended we change our job settings to THEIR best practices, the Veeam "best practices" article was updated to reflect. This was a huge PITA for us, hopefully you can find a resolution that doesn't involve dropping your remote DD's data. I'm still a little salty about it.

To correct this statement, it was the "pre-compressed bytes remaining" metric that was skewed when we monitored replication stats.

Thanks Mike, looking forward to finding out what the fix is, even though we're no longer affected by it.

Post by **ferrus** » Nov 11, 2016 4:20 pm this post

Thanks to everyone who has posted. The symptoms everyone's reporting match ours completely.
Unfortunately, I don't think we got through to the best support people at EMC!
Seems there's a lot more of us with the issue now, than when I first posted this thread. With a bit of hope, this might lead to a speedier resolution.

--------------------------

Just wondering if there's anything is the causes of the problem - that might also affect DD Health Check performance?

I know it's a bit of a stretch, but since upgrading we've also had much longer monthly Health Check times.
I initially put this down to the increases in our job sizes rather than anything to do with the upgrade, but it's becoming a real problem.

Our largest copy job - the Exchange backup, is around 14TB is size. The Health Check to the fibre connected DD2500, is now well into it's 8th day.
We've lost all those daily copy jobs, and two weekly's while it's been running.

The further It progresses, it appears to get slower as well. It's stil only at 90% at them moment, and I don't think it has increased by a percent today.
At the current rate, this would mean losing half of every month, just for the previous two week's health check.

I'd be grateful for any advice.

ndolson · Post by **ndolson** » Nov 13, 2016 10:29 pm this post

Is it because the CPU is busier than normal due to it trying to crunch the replication delta between your appliances due to the bloated "pre compressed bytes remaining" value? I noticed the weekly data reclamation task was taking much longer than it previously did while this issue was affecting us...because the CPU was trying to compare the "new" [erroneous] data with what was on the destination appliance.

Post by **ferrus** » Nov 17, 2016 12:47 pm this post

So after 330+ hours of the Exchange Backup Job Health Check (just under 14 days) - the operation was stopped at 99%, by a Windows Update automatic reboot

Regardless of how it ended, it's unworkable to have a Health Check that runs that long. It's held back 13 nightly Backup Copy's, and 2 or 3 archived weekly/monthly restore points - of some of our most important data.

I took a look at the CPU of the DD. Not sure what a normal benchmark figure is - but it certainly wasn't using all of the processor capacity.

Not sure if this is related to the replication performance issue - or even if the issue is Veeam or EMC.
Can anyone from Veeam provide any pointers?

ndolson · Post by **ndolson** » Nov 21, 2016 3:20 pm this post

What model and number of disks do you have? I don't recall specifically the CPU utilization of ours when we were dealing with this issue, but unless you have a large number of disks in the appliance, I could see a scenario where the capabilities of the drives would max out well before the CPU's hit max utilization.

R&D Forums

DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Re: DataDomain replication performance issue

Who is online