Discussions specific to the VMware vSphere hypervisor
Post Reply
mbreitba
Enthusiast
Posts: 85
Liked: 8 times
Joined: Jun 11, 2012 3:17 pm
Contact:

Replication Target performance

Post by mbreitba » Jun 28, 2012 9:20 pm

Just curious as to what people are seeing for replication target performance. I'm running an EMC VNX 5300 (loaded w/ FAST cache, 2TB NL-SAS and 15k 600GB SAS) as the source of the replication, replicating over a 1Gbit backbone link to a separate datacenter. Remote datacenter has a VNX 5300 w/ 200GB FAST cache, and target datastore is comprised of 2TB NL-SAS drives.

When I'm doing replications, I'm seeing some huge amounts of IO on the target storage system (700-1000 read and write IOPS) while there is virtually nothing going on at the source end. Changed block tracking is enabled, and ESX hosts are Cisco UCS B-series blades. Is this normal? I've opened a ticket w/ Veeam (5201543) but have not heard back. I would assume that with changed block tracking on, it would read the changed blocks out on the source, compress/dedupe, send them over the wire, then uncompress/undedupe, and inject them into the snapshot. Based on what I'm seeing on the backend storage, it's reading/writing a _lot_ to the target storage, which doesn't make a whole lot of sense based on how I understand replication to work.

Gostev
SVP, Product Management
Posts: 24460
Liked: 3413 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Replication Target performance

Post by Gostev » Jun 28, 2012 10:21 pm

Must be due to old restore point's snapshot commit by the replica retention policy.

mbreitba
Enthusiast
Posts: 85
Liked: 8 times
Joined: Jun 11, 2012 3:17 pm
Contact:

Re: Replication Target performance

Post by mbreitba » Jun 29, 2012 2:17 pm

Does it commit snapshots during the restore process? I'm seeing this behavior at 35% complete, 50% complete, and pretty much all times during the backup run. I thought the snapshot commital process happened at the end of the backup? I'm seeing this behavior for hours on end during the backup.

mbreitba
Enthusiast
Posts: 85
Liked: 8 times
Joined: Jun 11, 2012 3:17 pm
Contact:

Re: Replication Target performance

Post by mbreitba » Jun 29, 2012 2:19 pm

Also - just to note, this replication job is only about 6 days old - retention policy is 14 days. It shouldn't be committing _any_ snapshots, should it?

mbreitba
Enthusiast
Posts: 85
Liked: 8 times
Joined: Jun 11, 2012 3:17 pm
Contact:

Re: Replication Target performance

Post by mbreitba » Jun 29, 2012 3:27 pm

One more thing to note - this is primarily on retries of replications when the replication fails. Is there something different that happens on these? I just killed off a replication job that had read 50GB of a 150GB VM in 6 hours and created an 8GB snapshot. We are able to flood our WAN link with 750Mbit/sec of traffic during testing, so the WAN connection doesn't appear to be the issue. If the retry's of replication jobs are this slow, it'd be better for us to just shovel across a whole new copy of the VM rather than wait for a day to do whatever Veeam is doing on a 150GB VM. On 2TB VM's this will not be practical.

FYI - previously we were replicating with our backend storage, and replication of changed blocks was taking approx 3 hours (rate limited to 150Mbit/sec) and we never had any issues. While Veeam replication allows us to have the VM's registered and ready for power-on in a DR scenario, if we have to deal with flaky replication we will have to revisit having the backend storage replicate daily rather than using Veeam for the replication.

Vitaliy S.
Product Manager
Posts: 22773
Liked: 1526 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: Replication Target performance

Post by Vitaliy S. » Jun 29, 2012 4:40 pm

mbreitba wrote:Does it commit snapshots during the restore process?
No.
mbreitba wrote:I thought the snapshot commital process happened at the end of the backup?
That's correct.
mbreitba wrote:Also - just to note, this replication job is only about 6 days old - retention policy is 14 days. It shouldn't be committing _any_ snapshots, should it?
Retention policy is measured in restore points (in actual job runs), so if you run your replication job once a day, the retention policy will kick in on the 15th day of the job run cycle.
mbreitba wrote:One more thing to note - this is primarily on retries of replications when the replication fails. Is there something different that happens on these?
Well...that explains it, on the retries/subsequent runs your new (unfinished) restore point is automatically removed, this operation basically reverts your VM to the last working state which obviously might take some time.

mbreitba
Enthusiast
Posts: 85
Liked: 8 times
Joined: Jun 11, 2012 3:17 pm
Contact:

Re: Replication Target performance

Post by mbreitba » Jun 29, 2012 5:42 pm

Question - if it reverts the snapshot, shouldn't it nuke the snapshot, and then just diff from the last known good point? I'm still not sure why I'm seeing consistent heavy read/write IO on the target end. I would expect something like this :

Replica fails
Replica retries
Replica sees bad snapshot, reverts snapshot
replica creates new snapshot
replica uses CBT to send only new data

If that's the process, why would there be such heavy read IO on the target side, along with double the IO of writes (seeing 2000 READ IOPS, 4000 WRITE IOPS, sometimes higher). I would expect that it would have to process all changed blocks again, since those would be gone, but it seems as though that's not quite the case. Could someone please explain the exact process that happens when a replica fails and is retried?

tsightler
VP, Product Management
Posts: 5382
Liked: 2215 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication Target performance

Post by tsightler » Jun 30, 2012 1:32 am 1 person likes this post

Where exactly are you measuring this I/O? Can you share a screenshot? Veeam V6 with VMware should not be performing reads on the target in any significant way. Are you replicating to a datastore that has a SAN snapshot? This would produce CoW traffic that would explain the behavior you are seeing.

mbreitba
Enthusiast
Posts: 85
Liked: 8 times
Joined: Jun 11, 2012 3:17 pm
Contact:

Re: Replication Target performance

Post by mbreitba » Jul 02, 2012 1:59 pm

So, found two problems.

1 - Veeam proxy server was on the same LUN/spindle group as the replication target. Since we were seeing such high IO, it was causing the proxy to time out, causing more issues. Moved the proxy, saw the timeouts and failures stop. Performance was left pretty much unchanged.

2 - One of our engineers found this nugget - http://www.interworks.com/blogs/ijahans ... job-vmware - Once we switched the target side to NBD mode, we saw IO drop by huge amounts on the backend, and throughput go through the roof. We were processing around 1MB/sec previously, now we're processing around 40MB/sec. Any thoughts as to what would cause such poor performance in hot-add mode, and extreme disk thrashing?

tsightler
VP, Product Management
Posts: 5382
Liked: 2215 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Replication Target performance

Post by tsightler » Jul 02, 2012 4:09 pm

I have confirmed the issue that you are seeing regarding significant read I/O when replicating via Hotadd vs NBD mode. Here are some stats from my disk during two different replication cycles:

Hotadd Mode

Code: Select all

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dm-3            836.00      6688.00         0.00       6688          0
dm-3           3477.00     27816.00         0.00      27816          0
dm-3           3603.00     28824.00         0.00      28824          0
dm-3           3725.00     29800.00         0.00      29800          0
dm-3           3123.00     24984.00         0.00      24984          0
dm-3           3358.00     26864.00         0.00      26864          0
dm-3           5478.00     43824.00         0.00      43824          0
dm-3           4697.00     29384.00      8192.00      29384       8192
dm-3           1555.00     12440.00         0.00      12440          0
dm-3          21606.00      5064.00    167784.00       5064     167784
dm-3           4312.00     34496.00         0.00      34496          0
dm-3           4836.00     38688.00         0.00      38688          0
dm-3            234.00      1872.00         0.00       1872          0
NBD Mode

Code: Select all

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dm-3           6215.00        64.00     49656.00         64      49656
dm-3              5.00        40.00         0.00         40          0
dm-3              3.00        24.00         0.00         24          0
dm-3             15.00       120.00         0.00        120          0
dm-3             22.00       176.00         0.00        176          0
dm-3          19879.21        55.45    158978.22         56     160568
dm-3           2271.00       192.00     17976.00        192      17976
dm-3             14.00       112.00         0.00        112          0
dm-3             16.00       128.00         0.00        128          0
dm-3             10.00        80.00         0.00         80          0
dm-3              4.00        32.00         0.00         32          0
dm-3           2107.00         8.00     16848.00          8      16848
dm-3            117.00       936.00         0.00        936          0
dm-3             14.00       112.00         0.00        112          0
dm-3              8.00        64.00         0.00         64          0
You can see that the amount of data written in both cases is nearly identical, but the amount of reads is astoundingly higher with Hotadd mode. Very interesting indeed.

mbreitba
Enthusiast
Posts: 85
Liked: 8 times
Joined: Jun 11, 2012 3:17 pm
Contact:

Re: Replication Target performance

Post by mbreitba » Jul 02, 2012 6:36 pm

Yup - pretty much the exact same thing that I'm seeing. We were seeing ABQL of 24+ on our VNX5300 array while replicating via hotadd. Haven't looked at the stats since switching to NBD, but the general "seat of the pants" feeling is that it's not hitting the array nearly as hard. I'd be really curious to see why we're seeing this behavior.

deduplicat3d
Expert
Posts: 100
Liked: 11 times
Joined: Nov 04, 2011 8:21 pm
Full Name: Corey
Contact:

Re: Replication Target performance

Post by deduplicat3d » Mar 16, 2013 8:23 pm

Has there been any thought to add an option to apply retention policy for all VM's at the end of the job. This would make sure that the job finishes within the backup window.

foggy
Veeam Software
Posts: 18034
Liked: 1533 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Replication Target performance

Post by foggy » Mar 18, 2013 7:28 am

Corey, this is actually how it works now - the retention policy is applied at the end of each successful job run.

veremin
Product Manager
Posts: 16699
Liked: 1394 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Replication Target performance

Post by veremin » Mar 18, 2013 9:31 am

This would make sure that the job finishes within the backup window.
Just out of curiosity – how having retention policy applied in the end of the job (this's how ,as mentioned above, VB&R works nowadays) will guarantee that the jobs finish within the job window?

Thanks.

deduplicat3d
Expert
Posts: 100
Liked: 11 times
Joined: Nov 04, 2011 8:21 pm
Full Name: Corey
Contact:

Re: Replication Target performance

Post by deduplicat3d » Mar 19, 2013 6:25 am 1 person likes this post

The restore point is deleted after each vm is backed up not each job run (what you are saying is true for backup jobs, but I'm performing a replication job as noted in the subject). If you hold off on the retention policy until the end and the job bleeds over into production time, the only server impacted would be a DR server which is not actively used in production.

Image

foggy
Veeam Software
Posts: 18034
Liked: 1533 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Replication Target performance

Post by foggy » Mar 19, 2013 1:47 pm

deduplicat3d wrote:The restore point is deleted after each vm is backed up not each job run (what you are saying is true for backup jobs, but I'm performing a replication job as noted in the subject).
Got it, somehow I missed that you were talking about replication. What you have proposed sounds reasonable, thanks for heads up.

deduplicat3d
Expert
Posts: 100
Liked: 11 times
Joined: Nov 04, 2011 8:21 pm
Full Name: Corey
Contact:

Re: Replication Target performance

Post by deduplicat3d » Mar 19, 2013 2:46 pm

Thanks!

Gostev
SVP, Product Management
Posts: 24460
Liked: 3413 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Replication Target performance

Post by Gostev » Mar 20, 2013 9:56 am

Thanks a lot for a great suggestion, Corey. The change you are suggesting makes total sense. We will consider implementing this enhancement in the short-term (I've already checked with R&D on the dev costs, looking good). Thanks again.

deduplicat3d
Expert
Posts: 100
Liked: 11 times
Joined: Nov 04, 2011 8:21 pm
Full Name: Corey
Contact:

Re: Replication Target performance

Post by deduplicat3d » Mar 20, 2013 4:33 pm

That sounds great! It will really help out my environment.

Post Reply

Who is online

Users browsing this forum: No registered users and 23 guests