V6 - Replication initial seed performance question

topry · Post by **topry** » Dec 09, 2011 5:10 pm this post

Source: ESXi 4.1
Target: ESXi 5.0
Source repository is on a Dell iSCSI SAN dedicated GB iSCSI LAN / switch
Source VM's on the same iSCSI SAN as the repository.
Target datasource is local to the host (SAS drives).
Backup performance under 6 is very good - comparable to what it was 5.

To take advantage of the new replication methodology as well as VMFS 5 >2TB datastores, we initialized our rep target installing VSphere 5 and created a single large datastore formatted as VMFS5. The target host is physically local to the source (same LAN).

We have run numerous tests replicating VMs using a single proxy (the Veeam VM), an external proxy, as well as two proxies - including a 2008R2 w/8 cores on the replication target host. All proxies have direct SAN access via a dedicated iSCSI nic. We have tried using both VMFS3 and VMFS5 datastores on the target (both local - no difference).

We have tried replicating without a seed as well as seeding from the iSCSI repository. All testing done when no other jobs were running and there was no other traffic on the LAN. When seeding, network monitoring confirms all I/O is using the iSCSI segment.

Regardless of the configuration,the throughput on the initial replication seems slow- as in a max of 17MB/s processing rate with a max of 53MB/s read rate of the source VMDK. I'm assuming that the variations we see in read rate are due to empty blocks, de-duping etc, hence the increase in read rate on the larger devices.

For VM's under 100GB the read rate averages 20MB/s, only on 500GB disks does it go above 30MB/s. The jobs always show the bottleneck of being the network, but I do not see any indications of network problems.

I was expecting the read rate, especially when using seeding from a SAN accessible repository to be a lot higher than it is. With several 500GB VMDKs to replicate, this is going to take a longer window than projected, so I wanted to verify that what we are seeing is not abnormally low before proceeding.

Are these numbers 'reasonable'? If not, any ideas on what we might look at?
Lastly - is seeding the fastest way to replicate the initial image?
Will using seeding interfere with running incrementals on the VM being replicated (does Veeam lock the vbk while reading it during seeding)?

Post by **tsightler** » Dec 09, 2011 6:50 pm this post

Are your target proxies using hotadd? The speed you are seeing feel about right if the target proxy is using network mode. I've seen speeds quite a bit higher for hotadd. That being said, the fact that your "bottleneck" is showing network is suspicious, what does your target display?

topry · Post by **topry** » Dec 09, 2011 6:55 pm this post

When the proxy used is a VM on the target, it shows hotadd/nbd and it shows network as primary bottleneck.
If I do not use the proxy on the target, then source proxy shows san/nbd and the target proxy (same or different but not on target) shows ndb - which I understand, since it only has network access to the datastore.

I was 'assuming' that since the proxy on the target has direct san access to the repository and local/direct access to the datastore, the transfer rate would be multiples of what we are seeing.

Post by **tsightler** » Dec 09, 2011 7:29 pm this post

So you're only using a single proxy, not proxy-to-proxy?

When the proxy is on the target node, do you see it actually hotadd both the source and target disk to the proxy?

topry · Post by **topry** » Dec 09, 2011 7:39 pm this post

I've tried both scenarios:
If using seeding, it uses only one proxy. If that proxy is on the target, then it does a hotadd of the disk, if not on the target it uses san;nbd. It doesn't give an error that its failing over to network mode, but I would presume that while it can reach the repository in san mode, the transfer would have to be on the network since it would not have direct access to the target datastore.

If not using seeding, it will use two proxies - if the target proxy is on the target, it uses hotadd and the source proxy uses san.

Performance, regardless of the configuration and whether or not I use seeding, is comparable (which is why I figured I am doing something incorrectly or simply do not understand how it is suppose to work). It DOES work, but we have some very large systems to replicate and at this rate, its going to take a few days.

Post by **tsightler** » Dec 09, 2011 8:47 pm this post

Can you clarify what you mean when you say you are "using seeding". I'm assuming you mean you are seeding from a backup. For local replication, seeding isn't going to save much, heck it might take longer. Seeding is primarily for use when there is limited bandwidth. Seeding is intended to allow an external method to provide the initial replica, either by restoring from an existing backup, or by mapping to a replica that already exist on the target, which lowers the amout of bandwidth required. Seeding doesn't have any bearing on whether you would use one proxy or two. Actually, in almost any scenario where seeding would providing a benefit, two proxies are needed, one on the target side where the seed is located, and another on source side.

Replication to a target cannot use SAN mode, SAN mode can only be used on the source proxy. The target is always either hotadd or network.

So, it sounds like you have the following scenarios:

1. Single proxy on source - Source: SAN, Target: Network
2. Single proxy on target - Source: Network, Target: Hotadd
3. Dual proxy on source/target - Source: SAN, Target: Hotadd

This third scenario should provide the best performance since your talking storage to storage, although the data must still be compressed and deduped and sent between the two proxies. I would run in that mode and then look at the bottleneck statistics.

topry · Post by **topry** » Dec 09, 2011 9:02 pm this post

Yes, seeding from a backup. From the documentation I had the impression it may be quicker since the files were directly accessible via SAN. rather than replicating from the live VM (no difference in my limited tests - just an assumption).

The scenarios you listed are accurate. When I select automatic mode for source/target it selects proxies as listed in your #3 scenario and what I tried initially. If the performance I'm seeing is not unreasonable, then its a matter of dividing the initial replication up into smaller pieces to fit our window of available time so it doesn't overlap other jobs, and then re-creating the replication jobs as needed for deployment.

Thanks for the feedback - just trying to do a sanity check, before I pull the trigger on some of the larger jobs.

Post by **tsightler** » Dec 09, 2011 9:28 pm this post

Well, if network mode is involved, then I'm not at all surprised at 20-30MB/s being the limit. That's actually pretty good for network mode. On the other hand, I'm somewhat surprised to not see better speeds than this with the SAN/hotadd dual proxy approach. If using this method is showing "network" as the bottleneck, then something is up. I would expect that using the method the bottleneck should be the source/target or CPU, not network. The fact that your showing "network" as the bottleneck with speeds at 20-30MB/s would imply some link in the network chain is running at 100Mb or some duplex mismatch. If on the other hand this method is displaying source or target as bottleneck, well, that's a different story and that may be the best you can expect. Please feel free to post the proxy-to-proxy bottleneck information.

topry · Post by **topry** » Dec 09, 2011 9:35 pm this post

Same sample VM 20GB disk w/16.1 in use:
Source proxy using san and Target proxy using hotadd:
Processing rate 11MB/s
Primary bottleneck: Target
Busy: Source: 11% > Proxy 71% > Network 97% > Target 99%

Source proxy on the Target using ndb
Processing rate 20MB/s
Target proxy on the Target (same proxy) using hotadd
Primary Bottleneck: Target
Busy: Source 21% > Proxy 54% > Network 95% > Target 98%

Post by **tsightler** » Dec 09, 2011 9:42 pm this post

So it's not the network that is the bottleneck in this scenario, it is the target disks. Notice they are 98 and 99% (this is obviously causing data to queue in the network causing it to be high as well). Are you replicating thin disks? That type of RAID controller? Is it battery backed and do you have the write-back cache enabled?

topry · Post by **topry** » Dec 09, 2011 10:19 pm this post

In those scenarios, primary bottleneck is the target. If I use a single proxy on the source (initial tests), then primary bottleneck was typically Network. The percentages vary depending on configuration,but overall performance (limited testing) was comparable.

Sources are Dell 710s with a Perc H700 (yes battery backed and write cache enabled)
Target is Intel SRCSASJV (this is an older system that we re-purposed for the replication target. I cannot recall if it has the bacttery backup option. The Intel RAID Web Console doesn't connect under VSPhere so it would require a reboot to check, which I will do during next maint cycle.) VSphere hardware tab doesn't show a battery for the controller, so my guess would be it does not.

Disk is thin provisioned
SAN is Dell 3200i dual controller with SAS drives.

Post by **Gostev** » Dec 09, 2011 11:50 pm this post

Well, single proxy will always show bottleneck being network, because without target proxy, there is no awareness of how target storage behaves. From perspective of the product engine, it waits for the network writer to accept more data (but it cannot know that network, on the other hand, is also waiting - for target disks).

As Tom correctly pointed out, your primary bottleneck is the target storage speed.

Post by **tsightler** » Dec 10, 2011 4:42 am this post

The reason I suspect that the target RAID does not have battery backed cache with write-back enabled is that VMFS write performance on RAID without write-back is universally known to be horrible.

topry · Post by **topry** » Dec 10, 2011 12:40 pm this post

I appreciate your input (and time) - I had not considered that, nor was I aware of the impact on VMFS without a caching controller.

R&D Forums

V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Re: V6 - Replication initial seed performance question

Who is online