Best practice for CAS NLB Exchange 2010

stelben · Post by **stelben** » Feb 03, 2012 8:25 am this post

Hi!

We have an environment with two CAS 2010 servers, load balanced with Windows NLB.
Everything is running on esx4.1.
When the Veeam backup runs it seems to freeze the current CAS for a while (is it standard VMware snapshot being done?) causing it to failover to the other CAS. This takes some time before the NLB sorts it out and the CAS-service is up again. In practice it means we have a mail outage for 10 min every night...

What is the best practice to backup such CAS setup?

Thanks in advance

chrisdearden · Post by **chrisdearden** » Feb 03, 2012 9:50 am this post

can you force the behaviour by taking a snapshot manually ?

stelben · Post by **stelben** » Feb 03, 2012 11:15 am this post

Hi!

Yes, the same behaviour occurs when doing a snapshot manually.

Post by **Vitaliy S.** » Feb 08, 2012 7:45 am this post

Jonas, unfortunately, I'm not that familiar with Exchange CAS servers, but if there is any option to extend this timeout (for keeping CAS connection alive between the nodes), try to use and see if that helps.

Feb 08, 2012 9:16 pm

Hi stelben,

thanks for your enquiry.
Think there are 2 problems:
1. If you delete a snapshot the VM freezes
2. Because of the VM freeze ... you have problems with the Windows NLB Cluster heartbeat

So this is no Exchange Problem and as you said, it also happens if you do this manual so it is no Veeam Problem, too. It is a infrastruktur problem.

Solution for Problem 1:

@all with snapshot freeze problems.
@all with DAG cluster pans

NFS Datastores => Install VMware fixes (symtom: snapshot freezes at snapshot delete)
SAN Datastores => Install latest VMware Versions and check your HBA/datastore access profile if it suites your SAN Storage (Dedicated/Rounrobin/...)
iSCSI Datastores => Install latest VMware Versions and check your HBA/datastore access profile if it suites your SAN Storage (Dedicated/Rounrobin/...) + Use a dedicated enterprise switch for iSCSI VMware traffic
Update your SAN/iSCSI/NAS Firmware (in case of VMware snapshot commit/delete VMware writes a large amount of random writes) I saw a lot of old firmwares that have problems with that.

Do you use Disk System based sync mirroring?
To check if this is the problem: Disable Storage System syncron mirroring (I saw some systems that perform not well beacues of firmware bugs)

To check out if your Disk/network environment have problems, you can use local disks to check this out.(Storage vmotion of all Volumes)

And use NTP Servers for time sync on each VMware host and VM:
http://kb.vmware.com/selfservice/micros ... nalId=1318

For Problem 2 if problem 1 can not be solved:
Extend the heartbeat timeout

http://technet.microsoft.com/en-us/libr ... S.10).aspx

NLB assumes that a host is functioning normally within a cluster as long as it participates in the normal exchange of heartbeat messages between it and the other hosts. If the other hosts do not receive a message from a host for several periods of heartbeat exchange, they initiate convergence. The number of missed messages required to initiate convergence is set to five by default (but can be changed).

You can find the entry here:
http://technet.microsoft.com/de-de/libr ... S.10).aspx
Keyword: "AliveMsgTolerance"

In my life before Veeam I saw a lot of Problems with the NLB Unicast Mode. If you use it I recommend to change it to IGMP Multicast together with your network spezialist, because you have to do some changes in your network for that.

Windows NLB is maybe not the best way to cluster CAS Server because NLB is not service (Exchange) aware. It only cares for the network, and not for Exchange CAS is running behind it or not.
A hardware load balancer cares also about the service availability.

Let me say again, that this is a infrastruktur problem not a Veeam Backup & Replication Software Problem. Veeam uses standard VMware Snapshots for the backup. If these Snapshots don´t work, I recommend to analyse this together with VMware, your Storage Vendor and your Infrastruktur service contractor.

Hope this information can help you to fix your problem.

CU Andy

R&D Forums

Best practice for CAS NLB Exchange 2010

Re: Best practice for CAS NLB Exchange 2010

Re: Best practice for CAS NLB Exchange 2010

Re: Best practice for CAS NLB Exchange 2010

Re: Best practice for CAS NLB Exchange 2010

Who is online