Much VM's failing / multipath issues

JorisK · Post by **JorisK** » Jan 28, 2010 12:14 pm this post

Hi,

1) We backup about 150VM's each night on a total and usually per queue (we run with four concurrent queue's) about 30% failed with a strange error message. After that, three more retry jobs picked up the failed VM's and usually in the end we had two or three VM's which couln't be backed up during the night. We received lots of different error messages about the failing VM's, we tried VCB, vStorage but usually no luck: every night we had a lot of VM's failing to backup.
I've found this is being caused by multipathing, we have 4 paths to our storage system which carries the VM data. I blocked three of the four paths and now we have a 100% job. Is this a known issue?

2) I backup to local storage. I can set the cache settings to 100% write / 0% read, 0% write and 100% read and some value's between them. Which one is recommended?

Post by **Gostev** » Jan 28, 2010 12:28 pm this post

Hello Joris,

1. Based on all the feedback I have to date, I understand that multipathing can cause issue with certain SAN makes and models. With some it works fine, with other there are issues. This is kind of inline with former VMware multipathing support statement with VCB, it was limited and they had a table with supported SAN devices and versions of multipathing software.

For vStorage API, I have since at least one success report on these forum for the following SAN:
• Dell/EMC CX4
• Dell Equallogic
• IBM DS3300 iSCSI

It was also reported before a few times that customers were able to resolve multipathing issues by upgrading the multipathing software to the latest version, for example this was the case with Dell/EMC CX4 SAN and its PowerPath software.

2. I would recommend 50%/50% as synthetic backup does both read and writes... may be even 66% write 33% read, as writes are twice more common (read replaced block from VBK, write replaced block to VRB, write new block to VBK). Pure theory though, I am not big storage guy.

Thanks!

JorisK · Post by **JorisK** » Jan 28, 2010 2:16 pm this post

Thanks Gostev,

We use an HP EVA 4400 with two controllers and two ports per controller, which makes 4 paths. When i enable all my paths, i still see only traffic to Controller 1, port 1. All the others are idle. Does VEEAM use all the paths or does it simlpy only challenge one?

Post by **Gostev** » Jan 28, 2010 2:22 pm this post

Joris, actually all low lever I/O operation with storage are performed by vStorage API. Veeam Backup backup operates on higher level, we cannot really control which paths are used. We request raw virtual disk data from vStorage API, and it returns it to us.

JorisK · Post by **JorisK** » Jan 28, 2010 11:16 pm this post

Thanks Gostev,

One last question:
We have four queue's, when all set to use the vStorage API and start all at the same time, our complete SAN starts to shake! Can you or one of your technicians explain this behaviour?

I stopped two out of the four queue's and that gave our SAN some breath again, but still it's kinda strange since it should be able to handle it easily. When fireing up the queue's with about 10 minutes between them, nothing actually happens and the backup runs fine without causing SAN trouble. Can you explain this? What happens on the SAN when four queue's starting at the same time? Does a backup queue scan all the VM's on the SAN during startup?

Regards,
Joris

drbarker · Post by **drbarker** » Jan 29, 2010 12:07 am this post

Hi JorisK,

I don't know much about the EVA 4400 array - I know it's ALUA capable, but does it explicitly need enabling on the box? It sounds like you're experianceing path thrashing during a backup - have you setup multipathing on your Veeam host?

Post by **Gostev** » Jan 29, 2010 1:28 am this post

Joris, VMs are processed sequentially in every job, and not all at the same time. I am not storage guru at all, but it sounds quite plausible that your SAN cannot handle the load from 4 concurrent jobs well for some reason. It would be best to open support case with your storage vendor, as they are in better position to research why the SAN cannot keep us with such workload. They typically have special SAN specific monitoring/support tools to do the troubleshooting, as well as SAN knowledge we do not have.

Anyhow, I never recommend running multiple concurrent jobs against FC because even couple of job can usually fully load average Veeam Backup server processing capacity in case of FC4 SAN... so running multiple jobs usually gives no benefit other than putting additional stress SAN.

biskitboy · Post by **biskitboy** » Jan 29, 2010 2:44 am this post

Just some more ideas for you:

Check your vmkernel logs to for multipathing/scsi, etc errors. This can give you insight into what is going on. Check your EVA logs and make sure you don't have vdisks changing mastership.

For EVAs, you are probably ALUA ready. Make sure ESX is aware of this (use the esxcli nmp commands, etc). We typically configure our ALUA aware arrays to round-robin IO to the primary path. You will have 2 Active (IO) paths toward the Primary controller (that you defined for you vDisk), and 2 Active paths to the secondary controllers (no IO though).

I would recommend forcing your vDisks to specific controllers (Path A or B with Failover/Failback). If the EVA sees IO destined for the same vDisk on both controllers, it will increase the likelyhood of controller mastership changes, which will probably kick off other problems like scsi reservation errors, timeouts, etc during that time. If you set the vDisk to a specific path and use ALUA, you should be golden.

JorisK · Post by **JorisK** » Jan 29, 2010 8:17 am this post

Biskitboy: Yes but forcing a vdisk to a path makes our redundancy go zero, isn't that correct?

Gostev: We need multiple jobs in order to get the backup done in one night since some VM's just don't backup as fast as others. (about 200+ VM's)

Post by **Gostev** » Jan 29, 2010 11:51 am this post

Joris, in that case I recommend to add additional Veeam Backup server, as opposed to running multiple jobs on the same server.

drbarker · Post by **drbarker** » Jan 29, 2010 4:48 pm this post

Biskitboy: Yes but forcing a vdisk to a path makes our redundancy go zero, isn't that correct?

Changing the multipathing policy doesn't reduce your redundancy - it just changes the failover characteristics.... Chad Sakac wrote an article about multipathing & ALUA in VMWare. It's a little EMC centric, but the general priciples apply to EVA arrays too.

http://virtualgeek.typepad.com/virtual_ ... notes.html

HP also have a great document on how to configure an EVA with vSphere that might be helpful - http://h20195.www2.hp.com/v2/GetPDF.asp ... 185ENW.pdf

R&D Forums

Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Re: Much VM's failing / multipath issues

Who is online