-
- Enthusiast
- Posts: 85
- Liked: 8 times
- Joined: Jun 11, 2012 3:17 pm
- Contact:
Replication Target performance
Just curious as to what people are seeing for replication target performance. I'm running an EMC VNX 5300 (loaded w/ FAST cache, 2TB NL-SAS and 15k 600GB SAS) as the source of the replication, replicating over a 1Gbit backbone link to a separate datacenter. Remote datacenter has a VNX 5300 w/ 200GB FAST cache, and target datastore is comprised of 2TB NL-SAS drives.
When I'm doing replications, I'm seeing some huge amounts of IO on the target storage system (700-1000 read and write IOPS) while there is virtually nothing going on at the source end. Changed block tracking is enabled, and ESX hosts are Cisco UCS B-series blades. Is this normal? I've opened a ticket w/ Veeam (5201543) but have not heard back. I would assume that with changed block tracking on, it would read the changed blocks out on the source, compress/dedupe, send them over the wire, then uncompress/undedupe, and inject them into the snapshot. Based on what I'm seeing on the backend storage, it's reading/writing a _lot_ to the target storage, which doesn't make a whole lot of sense based on how I understand replication to work.
When I'm doing replications, I'm seeing some huge amounts of IO on the target storage system (700-1000 read and write IOPS) while there is virtually nothing going on at the source end. Changed block tracking is enabled, and ESX hosts are Cisco UCS B-series blades. Is this normal? I've opened a ticket w/ Veeam (5201543) but have not heard back. I would assume that with changed block tracking on, it would read the changed blocks out on the source, compress/dedupe, send them over the wire, then uncompress/undedupe, and inject them into the snapshot. Based on what I'm seeing on the backend storage, it's reading/writing a _lot_ to the target storage, which doesn't make a whole lot of sense based on how I understand replication to work.
-
- Chief Product Officer
- Posts: 31899
- Liked: 7396 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Replication Target performance
Must be due to old restore point's snapshot commit by the replica retention policy.
-
- Enthusiast
- Posts: 85
- Liked: 8 times
- Joined: Jun 11, 2012 3:17 pm
- Contact:
Re: Replication Target performance
Does it commit snapshots during the restore process? I'm seeing this behavior at 35% complete, 50% complete, and pretty much all times during the backup run. I thought the snapshot commital process happened at the end of the backup? I'm seeing this behavior for hours on end during the backup.
-
- Enthusiast
- Posts: 85
- Liked: 8 times
- Joined: Jun 11, 2012 3:17 pm
- Contact:
Re: Replication Target performance
Also - just to note, this replication job is only about 6 days old - retention policy is 14 days. It shouldn't be committing _any_ snapshots, should it?
-
- Enthusiast
- Posts: 85
- Liked: 8 times
- Joined: Jun 11, 2012 3:17 pm
- Contact:
Re: Replication Target performance
One more thing to note - this is primarily on retries of replications when the replication fails. Is there something different that happens on these? I just killed off a replication job that had read 50GB of a 150GB VM in 6 hours and created an 8GB snapshot. We are able to flood our WAN link with 750Mbit/sec of traffic during testing, so the WAN connection doesn't appear to be the issue. If the retry's of replication jobs are this slow, it'd be better for us to just shovel across a whole new copy of the VM rather than wait for a day to do whatever Veeam is doing on a 150GB VM. On 2TB VM's this will not be practical.
FYI - previously we were replicating with our backend storage, and replication of changed blocks was taking approx 3 hours (rate limited to 150Mbit/sec) and we never had any issues. While Veeam replication allows us to have the VM's registered and ready for power-on in a DR scenario, if we have to deal with flaky replication we will have to revisit having the backend storage replicate daily rather than using Veeam for the replication.
FYI - previously we were replicating with our backend storage, and replication of changed blocks was taking approx 3 hours (rate limited to 150Mbit/sec) and we never had any issues. While Veeam replication allows us to have the VM's registered and ready for power-on in a DR scenario, if we have to deal with flaky replication we will have to revisit having the backend storage replicate daily rather than using Veeam for the replication.
-
- VP, Product Management
- Posts: 27442
- Liked: 2817 times
- Joined: Mar 30, 2009 9:13 am
- Full Name: Vitaliy Safarov
- Contact:
Re: Replication Target performance
No.mbreitba wrote:Does it commit snapshots during the restore process?
That's correct.mbreitba wrote:I thought the snapshot commital process happened at the end of the backup?
Retention policy is measured in restore points (in actual job runs), so if you run your replication job once a day, the retention policy will kick in on the 15th day of the job run cycle.mbreitba wrote:Also - just to note, this replication job is only about 6 days old - retention policy is 14 days. It shouldn't be committing _any_ snapshots, should it?
Well...that explains it, on the retries/subsequent runs your new (unfinished) restore point is automatically removed, this operation basically reverts your VM to the last working state which obviously might take some time.mbreitba wrote:One more thing to note - this is primarily on retries of replications when the replication fails. Is there something different that happens on these?
-
- Enthusiast
- Posts: 85
- Liked: 8 times
- Joined: Jun 11, 2012 3:17 pm
- Contact:
Re: Replication Target performance
Question - if it reverts the snapshot, shouldn't it nuke the snapshot, and then just diff from the last known good point? I'm still not sure why I'm seeing consistent heavy read/write IO on the target end. I would expect something like this :
Replica fails
Replica retries
Replica sees bad snapshot, reverts snapshot
replica creates new snapshot
replica uses CBT to send only new data
If that's the process, why would there be such heavy read IO on the target side, along with double the IO of writes (seeing 2000 READ IOPS, 4000 WRITE IOPS, sometimes higher). I would expect that it would have to process all changed blocks again, since those would be gone, but it seems as though that's not quite the case. Could someone please explain the exact process that happens when a replica fails and is retried?
Replica fails
Replica retries
Replica sees bad snapshot, reverts snapshot
replica creates new snapshot
replica uses CBT to send only new data
If that's the process, why would there be such heavy read IO on the target side, along with double the IO of writes (seeing 2000 READ IOPS, 4000 WRITE IOPS, sometimes higher). I would expect that it would have to process all changed blocks again, since those would be gone, but it seems as though that's not quite the case. Could someone please explain the exact process that happens when a replica fails and is retried?
-
- VP, Product Management
- Posts: 6035
- Liked: 2863 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: Replication Target performance
Where exactly are you measuring this I/O? Can you share a screenshot? Veeam V6 with VMware should not be performing reads on the target in any significant way. Are you replicating to a datastore that has a SAN snapshot? This would produce CoW traffic that would explain the behavior you are seeing.
-
- Enthusiast
- Posts: 85
- Liked: 8 times
- Joined: Jun 11, 2012 3:17 pm
- Contact:
Re: Replication Target performance
So, found two problems.
1 - Veeam proxy server was on the same LUN/spindle group as the replication target. Since we were seeing such high IO, it was causing the proxy to time out, causing more issues. Moved the proxy, saw the timeouts and failures stop. Performance was left pretty much unchanged.
2 - One of our engineers found this nugget - http://www.interworks.com/blogs/ijahans ... job-vmware - Once we switched the target side to NBD mode, we saw IO drop by huge amounts on the backend, and throughput go through the roof. We were processing around 1MB/sec previously, now we're processing around 40MB/sec. Any thoughts as to what would cause such poor performance in hot-add mode, and extreme disk thrashing?
1 - Veeam proxy server was on the same LUN/spindle group as the replication target. Since we were seeing such high IO, it was causing the proxy to time out, causing more issues. Moved the proxy, saw the timeouts and failures stop. Performance was left pretty much unchanged.
2 - One of our engineers found this nugget - http://www.interworks.com/blogs/ijahans ... job-vmware - Once we switched the target side to NBD mode, we saw IO drop by huge amounts on the backend, and throughput go through the roof. We were processing around 1MB/sec previously, now we're processing around 40MB/sec. Any thoughts as to what would cause such poor performance in hot-add mode, and extreme disk thrashing?
-
- VP, Product Management
- Posts: 6035
- Liked: 2863 times
- Joined: Jun 05, 2009 12:57 pm
- Full Name: Tom Sightler
- Contact:
Re: Replication Target performance
I have confirmed the issue that you are seeing regarding significant read I/O when replicating via Hotadd vs NBD mode. Here are some stats from my disk during two different replication cycles:
Hotadd Mode
NBD Mode
You can see that the amount of data written in both cases is nearly identical, but the amount of reads is astoundingly higher with Hotadd mode. Very interesting indeed.
Hotadd Mode
Code: Select all
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dm-3 836.00 6688.00 0.00 6688 0
dm-3 3477.00 27816.00 0.00 27816 0
dm-3 3603.00 28824.00 0.00 28824 0
dm-3 3725.00 29800.00 0.00 29800 0
dm-3 3123.00 24984.00 0.00 24984 0
dm-3 3358.00 26864.00 0.00 26864 0
dm-3 5478.00 43824.00 0.00 43824 0
dm-3 4697.00 29384.00 8192.00 29384 8192
dm-3 1555.00 12440.00 0.00 12440 0
dm-3 21606.00 5064.00 167784.00 5064 167784
dm-3 4312.00 34496.00 0.00 34496 0
dm-3 4836.00 38688.00 0.00 38688 0
dm-3 234.00 1872.00 0.00 1872 0
Code: Select all
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
dm-3 6215.00 64.00 49656.00 64 49656
dm-3 5.00 40.00 0.00 40 0
dm-3 3.00 24.00 0.00 24 0
dm-3 15.00 120.00 0.00 120 0
dm-3 22.00 176.00 0.00 176 0
dm-3 19879.21 55.45 158978.22 56 160568
dm-3 2271.00 192.00 17976.00 192 17976
dm-3 14.00 112.00 0.00 112 0
dm-3 16.00 128.00 0.00 128 0
dm-3 10.00 80.00 0.00 80 0
dm-3 4.00 32.00 0.00 32 0
dm-3 2107.00 8.00 16848.00 8 16848
dm-3 117.00 936.00 0.00 936 0
dm-3 14.00 112.00 0.00 112 0
dm-3 8.00 64.00 0.00 64 0
-
- Enthusiast
- Posts: 85
- Liked: 8 times
- Joined: Jun 11, 2012 3:17 pm
- Contact:
Re: Replication Target performance
Yup - pretty much the exact same thing that I'm seeing. We were seeing ABQL of 24+ on our VNX5300 array while replicating via hotadd. Haven't looked at the stats since switching to NBD, but the general "seat of the pants" feeling is that it's not hitting the array nearly as hard. I'd be really curious to see why we're seeing this behavior.
-
- Expert
- Posts: 119
- Liked: 12 times
- Joined: Nov 04, 2011 8:21 pm
- Full Name: Corey
- Contact:
Re: Replication Target performance
Has there been any thought to add an option to apply retention policy for all VM's at the end of the job. This would make sure that the job finishes within the backup window.
-
- Veeam Software
- Posts: 21142
- Liked: 2142 times
- Joined: Jul 11, 2011 10:22 am
- Full Name: Alexander Fogelson
- Contact:
Re: Replication Target performance
Corey, this is actually how it works now - the retention policy is applied at the end of each successful job run.
-
- Product Manager
- Posts: 20448
- Liked: 2317 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: Replication Target performance
Just out of curiosity – how having retention policy applied in the end of the job (this's how ,as mentioned above, VB&R works nowadays) will guarantee that the jobs finish within the job window?This would make sure that the job finishes within the backup window.
Thanks.
-
- Expert
- Posts: 119
- Liked: 12 times
- Joined: Nov 04, 2011 8:21 pm
- Full Name: Corey
- Contact:
Re: Replication Target performance
The restore point is deleted after each vm is backed up not each job run (what you are saying is true for backup jobs, but I'm performing a replication job as noted in the subject). If you hold off on the retention policy until the end and the job bleeds over into production time, the only server impacted would be a DR server which is not actively used in production.
-
- Veeam Software
- Posts: 21142
- Liked: 2142 times
- Joined: Jul 11, 2011 10:22 am
- Full Name: Alexander Fogelson
- Contact:
Re: Replication Target performance
Got it, somehow I missed that you were talking about replication. What you have proposed sounds reasonable, thanks for heads up.deduplicat3d wrote:The restore point is deleted after each vm is backed up not each job run (what you are saying is true for backup jobs, but I'm performing a replication job as noted in the subject).
-
- Expert
- Posts: 119
- Liked: 12 times
- Joined: Nov 04, 2011 8:21 pm
- Full Name: Corey
- Contact:
-
- Chief Product Officer
- Posts: 31899
- Liked: 7396 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Replication Target performance
Thanks a lot for a great suggestion, Corey. The change you are suggesting makes total sense. We will consider implementing this enhancement in the short-term (I've already checked with R&D on the dev costs, looking good). Thanks again.
-
- Expert
- Posts: 119
- Liked: 12 times
- Joined: Nov 04, 2011 8:21 pm
- Full Name: Corey
- Contact:
Re: Replication Target performance
That sounds great! It will really help out my environment.
Who is online
Users browsing this forum: No registered users and 83 guests