Replication Target performance

mbreitba · Post by **mbreitba** » Jun 28, 2012 9:20 pm this post

Just curious as to what people are seeing for replication target performance. I'm running an EMC VNX 5300 (loaded w/ FAST cache, 2TB NL-SAS and 15k 600GB SAS) as the source of the replication, replicating over a 1Gbit backbone link to a separate datacenter. Remote datacenter has a VNX 5300 w/ 200GB FAST cache, and target datastore is comprised of 2TB NL-SAS drives.

When I'm doing replications, I'm seeing some huge amounts of IO on the target storage system (700-1000 read and write IOPS) while there is virtually nothing going on at the source end. Changed block tracking is enabled, and ESX hosts are Cisco UCS B-series blades. Is this normal? I've opened a ticket w/ Veeam (5201543) but have not heard back. I would assume that with changed block tracking on, it would read the changed blocks out on the source, compress/dedupe, send them over the wire, then uncompress/undedupe, and inject them into the snapshot. Based on what I'm seeing on the backend storage, it's reading/writing a _lot_ to the target storage, which doesn't make a whole lot of sense based on how I understand replication to work.

Post by **Gostev** » Jun 28, 2012 10:21 pm this post

Must be due to old restore point's snapshot commit by the replica retention policy.

mbreitba · Post by **mbreitba** » Jun 29, 2012 2:17 pm this post

Does it commit snapshots during the restore process? I'm seeing this behavior at 35% complete, 50% complete, and pretty much all times during the backup run. I thought the snapshot commital process happened at the end of the backup? I'm seeing this behavior for hours on end during the backup.

mbreitba · Post by **mbreitba** » Jun 29, 2012 2:19 pm this post

Also - just to note, this replication job is only about 6 days old - retention policy is 14 days. It shouldn't be committing _any_ snapshots, should it?

mbreitba · Post by **mbreitba** » Jun 29, 2012 3:27 pm this post

One more thing to note - this is primarily on retries of replications when the replication fails. Is there something different that happens on these? I just killed off a replication job that had read 50GB of a 150GB VM in 6 hours and created an 8GB snapshot. We are able to flood our WAN link with 750Mbit/sec of traffic during testing, so the WAN connection doesn't appear to be the issue. If the retry's of replication jobs are this slow, it'd be better for us to just shovel across a whole new copy of the VM rather than wait for a day to do whatever Veeam is doing on a 150GB VM. On 2TB VM's this will not be practical.

FYI - previously we were replicating with our backend storage, and replication of changed blocks was taking approx 3 hours (rate limited to 150Mbit/sec) and we never had any issues. While Veeam replication allows us to have the VM's registered and ready for power-on in a DR scenario, if we have to deal with flaky replication we will have to revisit having the backend storage replicate daily rather than using Veeam for the replication.

Post by **Vitaliy S.** » Jun 29, 2012 4:40 pm this post

mbreitba wrote:Does it commit snapshots during the restore process?

No.

mbreitba wrote:I thought the snapshot commital process happened at the end of the backup?

That's correct.

mbreitba wrote:Also - just to note, this replication job is only about 6 days old - retention policy is 14 days. It shouldn't be committing _any_ snapshots, should it?

Retention policy is measured in restore points (in actual job runs), so if you run your replication job once a day, the retention policy will kick in on the 15th day of the job run cycle.

mbreitba wrote:One more thing to note - this is primarily on retries of replications when the replication fails. Is there something different that happens on these?

Well...that explains it, on the retries/subsequent runs your new (unfinished) restore point is automatically removed, this operation basically reverts your VM to the last working state which obviously might take some time.

mbreitba · Post by **mbreitba** » Jun 29, 2012 5:42 pm this post

Question - if it reverts the snapshot, shouldn't it nuke the snapshot, and then just diff from the last known good point? I'm still not sure why I'm seeing consistent heavy read/write IO on the target end. I would expect something like this :

Replica fails
Replica retries
Replica sees bad snapshot, reverts snapshot
replica creates new snapshot
replica uses CBT to send only new data

If that's the process, why would there be such heavy read IO on the target side, along with double the IO of writes (seeing 2000 READ IOPS, 4000 WRITE IOPS, sometimes higher). I would expect that it would have to process all changed blocks again, since those would be gone, but it seems as though that's not quite the case. Could someone please explain the exact process that happens when a replica fails and is retried?

Jun 30, 2012 1:32 am

Where exactly are you measuring this I/O? Can you share a screenshot? Veeam V6 with VMware should not be performing reads on the target in any significant way. Are you replicating to a datastore that has a SAN snapshot? This would produce CoW traffic that would explain the behavior you are seeing.

mbreitba · Post by **mbreitba** » Jul 02, 2012 1:59 pm this post

So, found two problems.

1 - Veeam proxy server was on the same LUN/spindle group as the replication target. Since we were seeing such high IO, it was causing the proxy to time out, causing more issues. Moved the proxy, saw the timeouts and failures stop. Performance was left pretty much unchanged.

2 - One of our engineers found this nugget - http://www.interworks.com/blogs/ijahans ... job-vmware - Once we switched the target side to NBD mode, we saw IO drop by huge amounts on the backend, and throughput go through the roof. We were processing around 1MB/sec previously, now we're processing around 40MB/sec. Any thoughts as to what would cause such poor performance in hot-add mode, and extreme disk thrashing?

Post by **tsightler** » Jul 02, 2012 4:09 pm this post

I have confirmed the issue that you are seeing regarding significant read I/O when replicating via Hotadd vs NBD mode. Here are some stats from my disk during two different replication cycles:

Hotadd Mode

Code: Select all

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dm-3            836.00      6688.00         0.00       6688          0
dm-3           3477.00     27816.00         0.00      27816          0
dm-3           3603.00     28824.00         0.00      28824          0
dm-3           3725.00     29800.00         0.00      29800          0
dm-3           3123.00     24984.00         0.00      24984          0
dm-3           3358.00     26864.00         0.00      26864          0
dm-3           5478.00     43824.00         0.00      43824          0
dm-3           4697.00     29384.00      8192.00      29384       8192
dm-3           1555.00     12440.00         0.00      12440          0
dm-3          21606.00      5064.00    167784.00       5064     167784
dm-3           4312.00     34496.00         0.00      34496          0
dm-3           4836.00     38688.00         0.00      38688          0
dm-3            234.00      1872.00         0.00       1872          0

NBD Mode

Code: Select all

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dm-3           6215.00        64.00     49656.00         64      49656
dm-3              5.00        40.00         0.00         40          0
dm-3              3.00        24.00         0.00         24          0
dm-3             15.00       120.00         0.00        120          0
dm-3             22.00       176.00         0.00        176          0
dm-3          19879.21        55.45    158978.22         56     160568
dm-3           2271.00       192.00     17976.00        192      17976
dm-3             14.00       112.00         0.00        112          0
dm-3             16.00       128.00         0.00        128          0
dm-3             10.00        80.00         0.00         80          0
dm-3              4.00        32.00         0.00         32          0
dm-3           2107.00         8.00     16848.00          8      16848
dm-3            117.00       936.00         0.00        936          0
dm-3             14.00       112.00         0.00        112          0
dm-3              8.00        64.00         0.00         64          0

You can see that the amount of data written in both cases is nearly identical, but the amount of reads is astoundingly higher with Hotadd mode. Very interesting indeed.

mbreitba · Post by **mbreitba** » Jul 02, 2012 6:36 pm this post

Yup - pretty much the exact same thing that I'm seeing. We were seeing ABQL of 24+ on our VNX5300 array while replicating via hotadd. Haven't looked at the stats since switching to NBD, but the general "seat of the pants" feeling is that it's not hitting the array nearly as hard. I'd be really curious to see why we're seeing this behavior.

deduplicat3d · Post by **deduplicat3d** » Mar 16, 2013 8:23 pm this post

Has there been any thought to add an option to apply retention policy for all VM's at the end of the job. This would make sure that the job finishes within the backup window.

Post by **foggy** » Mar 18, 2013 7:28 am this post

Corey, this is actually how it works now - the retention policy is applied at the end of each successful job run.

Post by **veremin** » Mar 18, 2013 9:31 am this post

This would make sure that the job finishes within the backup window.

Just out of curiosity – how having retention policy applied in the end of the job (this's how ,as mentioned above, VB&R works nowadays) will guarantee that the jobs finish within the job window?

Thanks.

deduplicat3d · Mar 19, 2013 6:25 am

The restore point is deleted after each vm is backed up not each job run (what you are saying is true for backup jobs, but I'm performing a replication job as noted in the subject). If you hold off on the retention policy until the end and the job bleeds over into production time, the only server impacted would be a DR server which is not actively used in production.

Post by **foggy** » Mar 19, 2013 1:47 pm this post

deduplicat3d wrote:The restore point is deleted after each vm is backed up not each job run (what you are saying is true for backup jobs, but I'm performing a replication job as noted in the subject).

Got it, somehow I missed that you were talking about replication. What you have proposed sounds reasonable, thanks for heads up.

deduplicat3d · Post by **deduplicat3d** » Mar 19, 2013 2:46 pm this post

Thanks!

Post by **Gostev** » Mar 20, 2013 9:56 am this post

Thanks a lot for a great suggestion, Corey. The change you are suggesting makes total sense. We will consider implementing this enhancement in the short-term (I've already checked with R&D on the dev costs, looking good). Thanks again.

deduplicat3d · Post by **deduplicat3d** » Mar 20, 2013 4:33 pm this post

That sounds great! It will really help out my environment.

R&D Forums

Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Re: Replication Target performance

Who is online