2012 R2 2 Node Cluster - Random VMs Reboot [Case #00629434]

mjt · Post by **mjt** » Sep 17, 2014 8:32 pm this post

Greetings,

I've created a 2012 R2 cluster using an EQL PS6000 for CSV and two M610s with dual x5650, 96GB RAM, local Intel SSDs for host OS. The PS6000 and M610s all have the latest patches, drivers, and firmware. The cluster passed Microsoft's Cluster Validation Wizard without any issues. I've contacted EQL Support, and they can't find anything wrong with the storage. I'm running Veeam 7.0.0.871.

Every time I run a Veeam backup of a VM hosted on the CSV using the EQL HW provider, one or more VMs reboot. Every time I run a Veeam backup of a VM hosted on the CSV using the MS SW provider, one or more VMs become unresponsive. Every time I run a Veeam backup of a VM on local storage, there is no issue. Backing up all VMs simultaneously using the DISKSHADOW command on the CSV and EQL HW provider caused no issue. This issue is consistently reproducible.

Has anyone else run into this issue? I looked at the last couple of months of posts, but didn't see anything that quite matched. 2012 R2 and EQL CSV seems common enough, though, that I thought someone may have encountered it before.

Please let me know if posting any other information would be useful.

Thanks,
Mitch

Post by **Gostev** » Sep 19, 2014 8:42 pm this post

Hi. Do you have ODX enabled? If yes, try to disable it. Thanks!

mjt · Post by **mjt** » Sep 19, 2014 8:46 pm this post

Greetings Gostev,

Disabling ODX was one of the first things I did to trouble shoot this issue before calling Veeam support. It is disabled, as per this TechNet article:

http://technet.microsoft.com/en-ca/libr ... cebaseline

The Get-ItemProperty command returns a value of 1 on both hosts. Please let me know if there is any other information I can provide.

Thanks,

Mitch

boje · Post by **boje** » Oct 04, 2014 5:46 pm this post

Hi!

Have 2012R2 cluster with 4 nodes. Was running fine with veeam until 7.0.0.871.patch4. I installed the patch on Monday and yesterday when full backup ran the CSV whent offline (paused).
I know it was a similar problem before with 2012 Hyper-v cluster using CSV. But since 2012R2 it was working fine.
Something happened with the latest Patch4. Is there any known issues with this?
I´m afraid now to run backup because about 65 VM´s just died!

Br
Patrik

boje · Post by **boje** » Oct 06, 2014 6:21 am this post

I don´t see this as the same issue.
Since 2012R2 and Veeam with patch 3 every has been working fine. Until i installed Patch 4. The incremental backups has been running fine all week. But on Friday´s Synthetic Full backup I received this messages in Hyper-V Cluster:
Cluster Shared Volume '' ('Cluster Disk 1') has entered a paused state because of '(c000000e)'. All I/O will temporarily be queued until a path to the volume is reestablished.
And:
Cluster Shared Volume 'Volume1' ('Cluster Disk 1') has entered a paused state because of '(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.
And:
Cluster Shared Volume 'Volume1' ('Cluster Disk 1') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

Again, this has been working just fine until Patch 4 was installed. Please help me!
Br
Patrik

Post by **foggy** » Oct 08, 2014 12:01 pm this post

Patrik, there were no fixes that could affect Hyper-V backup introduced in patch 4. And still, you had successful increments through the whole week, so I don't believe it is fair to blame the patch. Have you already contacted technical support with this? I wonder what they could say after reviewing the log files.

mjt · Post by **mjt** » Oct 09, 2014 5:12 pm this post

At least in my case, having a tech look at the logs (every conceivable log: Veeam, Windows, Hyper-V Failover Cluster, Equallogic) has proven fruitless. Over a month now since I opened my ticket, and precisely zero progress. I've talked to two tier 2 techs, and still nothing. Not sure what I'm supposed to do at this point...

I don't mean this as a criticism of the techs I've worked with, but what more can I do when they've looked at every possible log and still can't figure out the problem? Makes me wish Veeam had some kind of verbose logging that I could enable, but Damon said they just get back whatever Windows Server Backup gives them.

Meanwhile, 38 VMs go unprotected.

Post by **foggy** » Oct 10, 2014 10:30 am this post

mjt wrote:Makes me wish Veeam had some kind of verbose logging that I could enable, but Damon said they just get back whatever Windows Server Backup gives them.

Could you please elaborate on this, I'm not sure I follow this sentence.

mjt wrote:Meanwhile, 38 VMs go unprotected.

How these 38 VMs are distributed among volumes? Do all they reside on a single volume? What is the max concurrent snapshots setting for the volume(s) (right-click the host in Backup Infrastructure view > Manage Volumes)? Could you also describe the test you've performed with diskshadow in a bit more detail?

Thanks.

mjt · Post by **mjt** » Oct 10, 2014 2:31 pm this post

Greetings Alexander,

Not sure which you mean, so I'll elaborate both. By verbose, I just mean more detailed. Often, for performance and ease of reading, there or different logging levels. Verbose is always the most detailed. In Veeam, there doesn't seem to be any way to get more detailed logs than what you see when the job is running. I'm working with Damon D. currently, and one thing he mentioned during our last WebEx session was Veeam in a Hyper-V environment simply uses the built in Windows Server Backup to take the backups. So if an error occurs, Veeam is simply reporting what Windows Server Backup told it.

How these 38 VMs are distributed among volumes?
All VMs are on a single CSV volume on the PS6000. The only other volume on the PS6000 is the Quorum. There is only one member in the group. There are no other SANs.

Do all they reside on a single volume?
Yes

What is the max concurrent snapshots setting for the volume(s) (right-click the host in Backup Infrastructure view > Manage Volumes)?
As a part of troubleshooting, I was advised to set all the volumes to 1 concurrent snapshot. Under Options -> Advanced, I have also disabled parallel VM and Virtual Disk processing.

Could you also describe the test you've performed with diskshadow in a bit more detail?
I'm not sure if you have access to the history of my ticket, but if you search the ticket history for "diskshadow" you can see my exchange with Damon S. about this:

I ran the following DISKSHADOW commands on DELTA last night:

SET CONTEXT PERSISTENT
SET OPTION TRANSPORTABLE #TechNet article is incorrect, must be OPTION, not CONTEXT
SET METADATA transHWshadow_p.cab #name is arbitrary, left as per instructions
ADD VOLUME C:\ClusterStorage\Volume2 PROVIDER {d4689bdf-7b60-4f6e-9afb-2d13c01b12ea}
#Specified the CSV volume and the EQL HW provider. If provider is not specified, the "default" is used. Didn't know what that would be, assumed you wanted HW rather than SW provider
CREATE
#At this point, every single VM on both hosts of the cluster took a shadow copy of every single VM. Once complete, the VMs on GOLF merged and returned to normal, while the VMs on DELTA continued to have a status of "Backing up" until I ran the next command
END BACKUP #I left the backup running for over 30 minutes before I ended it

I am happy to report that, even with every single VM on both hosts backing up, no VM rebooted, and there was no perceptible performance issue. Also, I ran the command 'vssadmin list shadows' before, during, and after the DISKSHADOW commands. Every time, the result was the same: "No items found that satisfy the query."

Post by **foggy** » Oct 10, 2014 9:47 pm this post

mjt wrote:In Veeam, there doesn't seem to be any way to get more detailed logs than what you see when the job is running.

There're six logging levels in Veeam B&R. Default one is level 4, 6th level is the most detailed.

mjt wrote:I'm working with Damon D. currently, and one thing he mentioned during our last WebEx session was Veeam in a Hyper-V environment simply uses the built in Windows Server Backup to take the backups. So if an error occurs, Veeam is simply reporting what Windows Server Backup told it.

I don't believe our engineer could say that. Probably some sort of misunderstanding took place here (probably what he actually meant was that both software uses the same calls or something). Veeam B&R does not use Windows Server Backup in any way.

mjt wrote:I ran the following DISKSHADOW commands on DELTA last night:

Seems that you've simply attempted to take a snapshot, but haven't tried to mount it and read VM data from it (as the backup does). Having all VMs residing on a single volume puts extremely high load on it during backup, which could result in the observed behavior.

mjt · Post by **mjt** » Oct 15, 2014 3:23 pm this post

Greetings Alexander,

Excellent to hear about the logging level, I'll mention it to Damon and see if he thinks it would help track down this issue.

As you said, miscommunication. Obviously Veeam does not literally use WSB, as it is not installed by default, nor required for Veeam to function. It's the underlying functionality that WSB leverages, that Veeam also leverages.

I had not considered the possibility that it could be the mounting of the snapshot that causes the issue. To be clear, when using the software provider, the moment data starts being copied to the Veeam server, VMs become responsive again. So it's either creating or mounting the snapshot (or both) that result in this issue. How would you suggest I narrow it down to one or the other?

Can you point me to any current documentation/best practices on CSV sizing with Server 2012? I know it used to be the case many years ago that best practices said one VM per LUN. However, that has not been the case for a while now. As I said, before contacting Veeam, I validated my storage with EQL. I've emailed the EQL engineer I'm working with the same question, and am very interested to read any current information on this topic.

It's also worth pointing out that I had 75+ VMs running on the same PS6000 with a single CSV, but with M600 (instead of M610) blades running 2012 (not R2), and experienced no issues whatsoever backing them up with Veeam.

collinp · Post by **collinp** » Oct 16, 2014 1:23 am this post

Latency on the CSV can cause this. Espcially since your test works on local disk. If you were thin provisioning and also using dynamic vhd's and squeezing in 75 VM's on too few large SATA spindles, there could be performance issues. Since this is a LUN and you have 75+ VM's running on the same LUN, what is the LUN queue depth allowed on the Equallogic? How many drives and what speed are the drives that make up this single CSV? Are you getting any disk latency alerts from the equallogic? Are your hosts low on memory or cpu? You can also try to reproduce the problem and check your multi-pathing to see if you still have paths to the SAN or if the hosts are freezing and not seeing any paths to the LUN temporarily.

Post by **foggy** » Oct 16, 2014 1:49 pm this post

mjt wrote:I had not considered the possibility that it could be the mounting of the snapshot that causes the issue. To be clear, when using the software provider, the moment data starts being copied to the Veeam server, VMs become responsive again. So it's either creating or mounting the snapshot (or both) that result in this issue. How would you suggest I narrow it down to one or the other?

You can try to mount snapshot using the expose command and read its content.

R&D Forums

2012 R2 2 Node Cluster - Random VMs Reboot [Case #00629434]

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

[MERGED] Veeam 7.0.0.871.patch4 CSV Problem!!

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Re: 2012 R2 2 Node Cluster - Random VMs Reboot [Case #006294

Who is online