Qlogic FC HBA capped on proxy

lars@norstat.no · Mar 14, 2014 10:52 am

I have discovered that the Qlogic FC HBA that i have is capped at about 25 % Utilization on my physical Veeam proxy.

The HBA is 8 Gbit and i should get at least 50 % as i belive the Utilization graph is for full duplex and i'm only reading data.

There is no difference if i use the proxy itself as the target for backup or if i replicate to an ESXi host, I still get the exact same speed.

The server is a IBM x3650 M3 with 2 x 4 Core with Hyperthreading and 24 GB of ram. Local disk is 13 x disk Raid 5. The server is running Windows Server 2008 R2 SP1.

This is what i tried so far:

Latest Veeam update
Latest Windows updates
Latest Qlogic Firmware
Latest Qlogic Storport driver
Latest IBM firmware for the machine
Latest IBM MPIO drivers
Activated 64 bit DMA in registry
Set Queue depth to different values
Moved the card to different PCIe slot
Tried a different card of same make and model
Checked PCIe bus is running at right speed
Cleaned SFP on fiberswitch
Cleaned cable and HBA
Tried brand new FC cable
Checked local disk speed (good)
Tested bus speed internally on the Veeam Proxy
Testet 10 Gbit card (higher speed)
Checked bios settings
Checked for errors on Fiberswitch and HBA
Checked speed between storage and ESXi host (Full speed (Double of the Veeam server)

Anything else that someone can think of that i haven't tried ? It seems to me that the speed is capped somehow on the Windows machine and i can't find out where it is.

Any help no matter hos obvious would be appreciated

Post by **Vitaliy S.** » Mar 14, 2014 2:28 pm this post

Hi Lars,

Can you please tell me what is current job bottleneck stats? Could it be that either source or target side cannot give/receive more data than you're actually transmitting through the fiber?

Thanks!

lars@norstat.no · Mar 14, 2014 2:40 pm

Stats are:
Source 38%
Proxy 11%
Network 38%
Target 96%

Now i know what you are going to say, but those stats can't be right because i know that the target is able to receive much more.

It's a brand new Storwize v5000 that is using Round robin and from the ESXi host i'm getting about 1100 MB/s.

Even if that where the case then i should be able to get more when copying data from the source SAN directly to the Veeam server but i don't

So i really don't think there is a problem with Veeam here, i think the problem lies in the combination of FC HBA and Windows. But since you are the experts i thought i would ask you aswell.

I allready have a support case open against IBM, Vmware and thinking of opening one against Microsoft aswell.

I'm just at a loss here ...

I have tested every single component individually and know that everything works at near link speed, the only thing that is slow is transfering data via Fibre channel to the Windows host not matter if i save the data right there or send it via the 10 Gbit adapter to the ESXi Target host. The same exact speed.

Post by **Vitaliy S.** » Mar 14, 2014 2:56 pm this post

lars@norstat.no wrote:Now i know what you are going to say, but those stats can't be right because i know that the target is able to receive much more.

It is not something that I wanted to say, but numbers don't lie

lars@norstat.no wrote:There is no difference if i use the proxy itself as the target for backup or if i replicate to an ESXi host, I still get the exact same speed.

Do you use HotAdd or NBD mode at the target proxy to write VM data to the datastore?

lars@norstat.no · Mar 14, 2014 3:11 pm

I use NBD mode.

I also tried something different now ... i was running one backup job and 10 replication jobs at the same time ... This time the FC interface jumped to an impressive 67 % Utilization !!

BUT the 10 GB adapter is still throttling around at 14-21 % ...

So now i'm back to maybe the NBD is not working and the vmware server cant receive data fast enough. This contradicts all the other test i have done, but why not ... As you say (although you didn't want to) numbers don't lie.

Because when the stats say target it's NBD it means, because Veeam has no control over data from the ESXi host to the SAN.

Then the question is:

Is it a Veeam, Windows, 10 Gbit adapter or Vmware problem ....

Post by **dellock6** » Mar 15, 2014 4:26 pm this post

Lars, I'm having a hard time trying to follow you, on first post you say the 8GB FC card seems to be capped, then here you say the capping seems to be on the 10GB adapter (an ethernet one?). Can you please explain me the FC connection what is connecting? The v5000 is a DAS for the veeam proxy or is the production datastore where you extract data from when running backups?

Thanks,
Luca.

lars@norstat.no · Mar 15, 2014 9:12 pm

Yes, i'm sorry this thread became a little messy. As i wrote in my last post, i tried something new. Let me recap:

First i tried doing only one or two replication jobs to test the speed. This speed was pretty low for 8 Gbit FC. Since i also tested with another program to copy the file directly from the SAN i the same way Veeam does when using SAN mode and got the same speed i naturally assumed that there was a problem related to reading from the SAN directly using i Windows machine. More specifically the Qlogic card.

After i made this post i testet something different. This time i was running about 10 replication jobs and a backup job at the same time. This time i completely saturated the FC link periodically which means that there is no problem at the source end. The reason why it's only periodically is because when the Veeam server is sending the data to the target it's at a almost steady 2 Gbit/s. Since Veeam always read more data than it actually transfers i was still able to saturate the FC link.

In other words, my first test was flawed and gave me the wrong conclusion.

My new test show that the 10 Gbit Ethernet link from the Veeam Server to the Target ESXi host is the bottle neck. And it should not be.

This is my setup:

Source SAN = Storwize v7000 with very little load
Link between source SAN and Veeam server = 8 Gbit FC
Veeam Server = IBM 3650 M3, 8 Cores with hyperthreading, 24 GB RAM
Link between Veeam Server and target ESXi host = Emulex 10 Gbit ethernet
Target ESXi Host = IBM x3850 X5, 40 Cores with Hyperthreading, 768 GB RAM (Does nothing else than receive replicas)
Link between target ESXi and target SAN = 16 Gbit FC
Target SAN = Storwize v5000 (Does nothing else than receive replicas)

i have done from the Vmware that shows that i can write data substained at about 1100 MB/s

So i think this is about NFC/NBD again. There where a lot of problems with this before, but my understanding was that all throttling on the management was gone with 10 Gbit ... It seems not

i have done the obvious things on the 10 Gbit adapters like firmware, jumbo frames, RSS, TCP offloading, TCP window size etc ...

I have also done Google searches for everything i could think of for about three weeks, 16 hours a day now and i can't seem to crack this one. Although i have found several hundred people having the same problems with NFC/NBD

Virtual proxy and Hot add is not an option for me.

Thnx for any help you can provide

Mar 16, 2014 7:35 pm

Hi thx for the recapt.

Nbd mode (read and write) are capped by vmware mgmt. You can use around 40% of max interface speed. At 10GBe this is
If I calced right 512MByte per sec. 10x1024/8x40%

oncremental job runs are not good to calculate, because of random read pattern. So pleas use always the initial run for testing.

Parallel preocessing can help to sature the vmkernel more than 40%.

Writing from vmware directly to the storage has no limitation.
So you said your storage system has an tested speed of 1100 MB/s with sequential write.
do you have 2 esxi hosts on target side?
2 jobs, each pointing to different esxi can theoretically bring you to this speed.
But your storage system can not handle 2 different streams at same speed as one sequential run.

Anyway after the first initial full transfer the random read change block tracking based source reada are the bottlenecks later. so you need to use parallel processing to sature the target 40% of vmkernel speed.

Check with the initial fulls the speed but focus more on parallel processing of incremental runs.

Potetially hotAdd mode can help to increase performance but came with trpple io on target penalty. if you have many spindles this can help. As you said this is not an option for you and at your wanted 1100MBs speed this is anyway not the best option because of io penalty at replica target.

lars@norstat.no · Mar 17, 2014 10:12 am

Hi Andreas.

Thanks for your post. This is the first time i have ever been able to confirm that there actually is a cap and what that cap is. I have opened a support case with Vmware to get them to lift the cap. This server is used only for this and has a separate management interface for actual management so this cap is just useless for me.

I think the my storage can do more, it's a limitation of the interface. The disk can handle more streams i think, but are still limited to the interface speed.

i only have one server at the DR site with two HBA running in Round robin with the 1 iops per interface mode.

Mar 17, 2014 11:03 am

lars@norstat.no wrote: I have opened a support case with Vmware to get them to lift the cap. This server is used only for this and has a separate management interface for actual management so this cap is just useless for me.

I belive that is hard coded, but if you get an solution from VMware, please post it here.
I found no real documentation about this limitation. But my collegue did some tests and we saw that in the monitoring graphs.

Post by **dellock6** » Mar 18, 2014 9:13 am this post

Yes, I can confirm the cap is definely there, but I never managed to find any doc stating it, all the numbers and speculations we did were always based on real tests.
I'm joining Andreas in the request for the feedback from VMware, it would be awesome to finally have a proper statement from them.

Luca.

lars@norstat.no · Mar 19, 2014 9:23 pm

So i have been on the phone with Vmware doing all sorts of test that show there is no problem with the hardware or operating system. And i have set up a "speedy" backup job to the local backup server which have about half the speed of my replication storage and there i get about 500 MB/s ... Kind of ok, but with my setup i should be much closer to 1GB/s (That's bytes not bits) when doing replication ...

What Vmware's answer about the cap is "There is NOT any cap on the management port, nor has there ever been a cap and there is no cap planned for the future"

Now, after the tests the Vmware guys can't answer what the problem is and they are asking me to send them detailed information about my setup with screenshots so that they can open a bug case on this. They are saying that backend engineering will set up the same equipment at their end and do tests and they believe that will take months.

I am kind of surprised that no one, Veeam included have gone so far with Vmware before. This is not a small problem and i guess that all replication providers have the same problem. There are a lot of forums around with long threads where people have the same problem since ESX 3.5 ... Why have no one been able to solve this before ?

Then i have also seen a powerpoint presentation from Veeam or someone selling Veeam solution with a screenshot of a task manager with above 60 % utilization during a replication job using 10 Gbit adapters.

So what do you guys think ? Could it still be a Veeam problem is it indeed a Vmware bug .... ?

Post by **dellock6** » Mar 19, 2014 9:38 pm this post

I'm speaking for myself and not as a Veeam employee, but I always saw this limit on every file operation involving the management interface. I did some tests after my reply in my lab, and the speed of a backup in NBD from a fusion-io datastore (so impossible the source is the bottleneck) and on my 1G connection I always saw around 40 MBs. Being the limit 125 in theory but a little less in real scenarios, this is around 40%, as Andreas told.
I never saw an official statement from VMware regarding the cap, but if you search around there are so many forums talking about this, not only Veeam.
The fact their support says there is no cap it doesn't mean it's not there, otherwise why everyone is seeing this behaviour?

I don't know what steps we did in the past to discover this issue.

Luca.

lars@norstat.no · Mar 19, 2014 9:59 pm

i'm not even getting that ... If i run replication jobs i never get more than 21-23 %

If I use the datastore browser I get about 4 % ...

Vmware was saying that they where trying to think of another way of running NFC traffic against the management port, but they could not think of anything.

Then there is that presentation from Veeam ... I was wrong, it was exactly at 40 % infact

Post by **emachabert** » Mar 20, 2014 3:14 pm this post

Hi,

did you try that kind of setup (entering ESXi from a standard interface to an onhost veeam proxy connecting to the host management port internally) :

I often do that and throughput is good (I generally saturate the inbound interface)

And try to compare with the same king of I/O pattern. 1100MB/s is common when dealing with sequential read/writes but is way much harder to achieve on random pattern (I do not know what is the v5000 configuration...). First full is sequential, any other incremental is random.

Eric.

lars@norstat.no · Mar 20, 2014 4:17 pm

Hi, and thnx for the tip .... i will try this.

I can't see the whole drawing there but i thinki know what you mean .... one extra link in the chain, but if they can push at full speed it would be worth it.

The physical server would be the "remote proxy" and the Virtual machine would be the "local proxy"

Post by **veremin** » Mar 20, 2014 8:45 pm this post

Not sure why the picture isn't shown correctly in your case, but you can use the direct link and see it as a whole. Thanks.

lars@norstat.no · Mar 21, 2014 5:51 pm

When i did this i got even less ... EXACTLY 1 Gbit ... That's suspicious ... Why exactly 1 Gbit .... Could there be a Veeam mechanism that adjust the flow of data if it thinks i only have a 1 Gbit adapter ?

Post by **Gostev** » Mar 21, 2014 6:01 pm this post

We have no such mechanisms.

Post by **emachabert** » Mar 22, 2014 5:08 pm this post

You should double check all the chain, something is limitating the stream somewhere.
Also check the switch(s) counters to see if flow control is entering in action at some point.

Post by **Gostev** » Mar 22, 2014 5:38 pm this post

I too suspect that one of the ports simply failed over from 10Gb to 1Gb... we commonly see it in support when 1Gb ports fail over to 100Mb due to the cabling issues.

lars@norstat.no · Mar 24, 2014 9:15 pm

There is no switch involved except the virtual one of course

There is only a 10 Gbit adapter at both ends with an DAC cable between. When i use Lanbench between the physical Veeam Proxy and the virtual ones that i have set up as in your diagram i get 99% of the bandwith so there is nothing wrong with the network itself.

Post by **emachabert** » Mar 24, 2014 9:49 pm this post

You had my attention, you now have my curiosity

To be sure, we are ok we are speaking about the first replication (sequential writes) ?
What is the configuration of the v5000, perhaps you are hiting the max random writes throughput ?

Even if hot-add is not an option for you, have you tried for testing purpose, to use the on-host proxy in virtual appliance mode ?
Is this virtual machine using vmxnet3 adapters ?

lars@norstat.no · Mar 24, 2014 10:08 pm

Yes i delete a few test replicas every time so that it is the first every time. Does not matter if it's 1 or 20 vm's at the same time. The storage is fine, i have tested with other tools and sequential write is about 1200 MB/s substained. The FC interface is the limiting factor there.

Yes, when i say that Hot-add is not an option for me i don't mean i can't actually use it, but i have tested it before and the problem is that it's buggy, creates enormous amounts of IO. In fact a stupid amount and it's very slow starting the actual job. What we are trying is to get this as close to real time as possible so that the replicas is as new as possible if something happens. i need the bandwith even though it's incremental because there is about 13 servers replicating all the time and i would like to add more if i can get it to work. Possibly about 70 vm's. The speed when it starts for first replication and indeed second as well is good using hot-add though.

If i had alot of jobs the IO itself would kill the SAN completely as it did to the old one ....

Yes they are using Vmxnet3 adapters and both machines are using jumbo frames, have chimney (RSS) enabled. They have 12 GB ram and 8 CPU each and the server they are running on have 40 cores so i could add more if needed. The vm server have 768 GB ram and is clean installed with Vsphere 5.5 before starting the replication.

To sum up:

Source SAN to Veeam Proxy 1 = full speed (16 Gbit)
Veeam Proxy 1 to Veeam Proxy 2 = full speed (2 x 10 Gbit)
ESXi host running Veeam Proxy 2 to target SAN = full speed (16 Gbit)

Running actual replication that uses the management port from the outside = 2 Gbit
Running actual replication that uses the management port from the inside = 1 Gbit

Mar 24, 2014 11:55 pm

Hi Lars! I expect that part of the reason no one has ever pursued this is that this is really a very unusual use case for Veeam replication, i.e. attempting to get many servers running with continuous replication. That was never really the target market as there are many products on the market that already offer this level of RPO if that's what you need. Far more common are customers that are replicating every hour, every 4 hours, or every 24 hours, although they may have a couple of machines here or there. In other words, you are really pushing the product to the limits, not that there's anything wrong with that, but you're going to hit issues that are simply not normal for any other real world deployment. That doesn't make it not interesting though!

Stats are: Source 38% Proxy 11% Network 38% Target 96%

Now i know what you are going to say, but those stats can't be right because i know that the target is able to receive much more.

Well, you're right, I am going to say somthing about these numbers as I think it's important to remember what these stats are really showing. Effectively these stats are a measure of how much time we spent waiting at the various points in the processing chain. When using NBD mode for the target mode this isn't a measure of disk performance, it's a measure of how much time the write process spend waiting for writes to complete which mean the bottleneck could be anywhere in the the entire stack. This includes the physical network connection between the proxy and the ESXi management port, which isn't likely to be the limiting factor in this case, but it also includes the NFC network stack and NBD stacks within the management console interface, and the overhead of VMFS writes via the management stack, which requires sync writes and locking due to the high number of metadata updates when writing via this method.

So I have a couple of questions. Have you tried running two different replications from two different source VMs to two different management ports? Does the performance double or does it stay overall the same and just split between the two? Also, if you run ESXTOP, what type of load are you seeing on CPU0 which normally handles all of the interrupts for the service console. That's another factor, the service console is effectively single CPU, so it's possible that you may be saturating the console CPU. The NFC/NBD protocol was never really intended for mass data transfers.

Also, I'd strongly suggest doing something to measure the maximum performance you can write data to the VMFS volume via the service console, this is usually quite a bit slower than you can write from within a VM due to the slower I/O path from the service console and extra locking and sync required (the I/O path from virtual worlds is highly optimized for performance). I'd strongly suggest running something like the following on the service console by changing directory to the datastore in question and running something like:

Code: Select all

time dd if=/dev/zero of=test.dd bs=1M count=100000

This will write a file of 100GB of zeros as fast as possible via the service console to the datastore and print out the time required to do so. Divide 100GB by the number of seconds reported to get a raw MB/s rating that shows the maximum performance you could reasonably expect writing to the VMFS volume from the service console.

You may also want to check the following thread regarding improving restore performance by disabling the hardware zeroing as the overall logic used for writing replication data is exactly the same as used during a restore:

http://forums.veeam.com/vmware-vsphere- ... 92-15.html

Looking forward to seeing your results, lets see how far we can push it.

lars@norstat.no · Mar 25, 2014 9:58 am

When i was talking to the sales manger for Veeam before we bought this and from the reseller as well it was clear they where selling a near realtime replication solution and the provided use cases that showed people using it for that with more servers than i have. They where also pushing this sales pitch at vmworld that year i believe. When i bought the solution we already had a backup solution and only starting using it for backup later. If i was mislead during the sales phase then that would be concerning. There where no doubt what we where after from our side.

i will have a look at the material you have provided and do the test you have suggested and come back to you.

Post by **tsightler** » Mar 25, 2014 12:55 pm this post

Veeam certainly markets the "continuous" scheduling feature of our replication jobs as "Near CDP". Using the term "Near CDP" indicates that we are not a CDP product, but by scheduling jobs to run in a continuous fashion it's possible to achieve results that are "near" to those of traditionally very expensive and sometimes complex CDP style products. It's certainly possible to create a replication job with 70 VMs and run it continuously which, for an environment like yours, would likely provide RPOs in the 15-30 minutes range, which is "near" CDP, in comparison to daily replication.

However you seem to be focused on achieving more than that, and that will be quite difficult using our snapshot based technology, and it's unlikely that bandwidth will be the limiting factor. Sure, for initial replication, bandwidth is key, but once you start performing incremental replication, the amount of time spent transferring data quickly becomes a small fraction of the time spent processing snapshot replication. The rest of the time is spent communicating with vCenter, taking snapshots, removing snapshots, removing old restore points, etc. Normally, even for a VM with very few changes, it will take 60-90 seconds to perform all of this additional processing which is probably going to be more than the actual transfer as, if you're replicating "continuously" most VMs will typically have only a few dozen or perhaps 100 MB/s of data each cycle.

So if a VM has 400MB of changed data, and I can do 400MB/s transfer, the transfer process itself will only take 1 second, but the job will likely spend 60 seconds processing snapshots and restore points. If transfer time isn't the bottleneck, it doesn't do a lot of good attempting to optimize that beyond the already good performance.

Of course, it's possible you have some very high change rate VMs that produce significantly more data per cycle, but think even if they produce 10GB every 5 minutes cycle (a very high rate of change for such a short time period), at 400MB/s (the numbers you are already getting), that's only ~28 seconds for the actual transfer, so yet again transfer time is a relatively small fraction of the replication cycle. Sure, being able to get to the theoretical 1100MB/s might cut that to 10 seconds, which would be great, but would leave the RPO time largely unchanged.

I guess if all 70 of your VMs are very high change rate I could perhaps see the advantage of really pursuing this, however, that would be quite rare indeed. Typically VMs replicating every few minutes will have only a few hundred MBs max, so I'm guessing transfer will be a very, very small fraction of the processing time, and that's why most people aren't going to spend a lot of time if they are already getting 400MB/s.

So I'm only suggesting why customers might not have pursued the 400MB/s limit previously, because they likely didn't find this being a major limiting factor to their replication (in many cases they are replicating across distance so it's not even close to the limiting factor). As far as why there might be a limit, my guess is that it's limited by the TCP/IP implementation on the ESXi service console because the service console has never been optimized for throughput. My guess is they simply don't have enough buffer space to keep the TCP flow optimized for 10GbE. Just a shot in the dark though as I don't have access to any systems fast enough to test this at the moment.

lars@norstat.no · Mar 25, 2014 12:59 pm

i did the test and i got a really bad result of only 179 MB/s so there is the core of the problem then probably. There seems like there are several CPU's involved in the process but it's still slow.

Tried hot-add again and although the speed is a little bit better it's incredible unstable still. Suddenly everything stops and then picks up a bit later ... Unusable imo.

lars@norstat.no · Mar 25, 2014 1:07 pm

You are right of course, i sometimes persue things that might not matter that much, but i still wanted the original job of all my VM's to go faster. We will see where the case with vmware goes, if nothing comes of it i will give this up.

This thread is no longer about the original subject, but lets talk about processing time. You say 60 seconds ... i WISH .. If it where 60 seconds and my transfer time was 30 seconds i would be so happy. But unfortunately it's much much longer and i see a lot of time processing different things.

Once i asked what you guys was checking for and i asked my colleague to find this data manually by looking in the Vsphere client and tell it to me and it took half the time ...

I got this time a little bit down by denying Veeam access to other servers not involved in the replication in Vcenter, but this strategy was broken with version 7 where Veeam demanded full admin access to everything in vcenter.

if i could get the processing time down, that would be great.

Post by **tsightler** » Mar 25, 2014 2:01 pm this post

lars@norstat.no wrote:You are right of course, i sometimes persue things that might not matter that much, but i still wanted the original job of all my VM's to go faster. We will see where the case with vmware goes, if nothing comes of it i will give this up.

I can understand that, I'm somewhat the same. I want to understand why I can't get the full performance I was expecting. I'm only pointing out how small of a factor this should be in a typical scenario like yours.

lars@norstat.no wrote:This thread is no longer about the original subject, but lets talk about processing time. You say 60 seconds ... i WISH .. If it where 60 seconds and my transfer time was 30 seconds i would be so happy. But unfortunately it's much much longer and i see a lot of time processing different things.

Threads go that way sometimes.

Those numbers were really just "best case" examples since their purpose was to show how the bandwidth is a small fraction of the time even in this theoretical "best case". More realistic "real-world" per-VM numbers for replication are probably 4-5 minutes best case, although since some operations occur in parallel the average time per-VM might be less (for example, if I have 5 VMs processing in parallel, and that all take 5 minutes, that's still only 1 minute per VM average). What types of processing times are you seeing per-VM?

How do you have your jobs setup? Are you perhaps doing one job per VM? I see that commonly for replication but this adds significant processing time per-VM and quite a bit of additional load on the vCenter server.

R&D Forums

Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Re: Qlogic FC HBA capped on proxy

Who is online