nWorks Alerts finetuning

thenamaris · Post by **thenamaris** » Nov 03, 2011 5:13 pm this post

Hello,

we just recently purchased the SCOM MP and it is giving us several Alerts. Most of them close on their own after some minutes. Since we are monitoring our systems through the emails we receive by SCOM, we would like to filter/reduce the emails we get, this means we would like to finetune the alerting and resolve issue and only get emails about real issues. Hence you can find below two of the most common alerts that auto-close (they appear many times per day). We would like some help on those, meaning some understanding on what they tell us / what we can do / if we should somehow override them:

Alert: nworks VMware: ESX Host VMHBA has logged SCSI aborts
Source: vmhba3:C0:T0:L8
Path: server1.thenamaris.gr;DISK:server1.thenamaris.gr
Last modified by: System
Last modified time: 11/3/2011 6:58:05 PM Alert description: VMHost LUN vmhba3:C0:T0:L8 on ESX Host server1.thenamaris.gr has exceeded threshold over 4 samples by logging 6 aborts.

Alert: nworks VMware: ESX Host VMHBA has exceeded threshold for queueLatency
Source: vmhba3:C0:T0:L8
Path: server2.thenamaris.gr;DISK:server2.thenamaris.gr
Last modified by: System
Last modified time: 11/3/2011 3:58:08 PM Alert description: VMHost LUN vmhba3:C0:T0:L8 on ESX Host server2.thenamaris.gr has exceeded threshold over 2 samples by logging 2390 ms.

The latter also often comes up with TotalReadLatency or TotalWriteLatency.

Many Thanks in advance.

ZachW · Post by **ZachW** » Nov 04, 2011 12:14 am this post

Hi,

Please open up a case with support and we would be more than happy to assist you with this.

http://www.veeam.com/support-form.html

-Zach

thenamaris · Post by **thenamaris** » Nov 04, 2011 10:07 am this post

Hello and many thanks for the answer.
We have opened a support case.

Post by **Alec King** » Nov 07, 2011 7:16 am this post

Hi! I would also say, from the two alerts that you listed - you are having some problem with your back-end storage.
The aborts monitor is looking for storage commands that have timed out.
And the latency monitor is looking for storage commands which are spending too long in the internal vmkernel queue waiting to be processed.

I'd advise diving into the performance and configuration of that VMHBA on that host. 6 aborts is bad but not terrible, however queue latency of 2390ms = two and a half seconds! That is a lifetime of waiting in disk IO terms.

I'd say you have a storage performance issue on that host. And I'd say the nworks MP is working as designed by alerting you to that!

Cheers,
Alec

thenamaris · Post by **thenamaris** » Nov 08, 2011 8:43 am this post

Hello Alec and many thanks for the answer.
We have contacted our IT infrastructure support in order to investigate the backend issue.
I will revert as soon as possible.

thenamaris · Post by **thenamaris** » Nov 14, 2011 1:48 pm this post

Hello all,

Some new issues with totalReadLateny and totalWriteLatency have appeared on some LUNs.

The default threshold levels are:

totalWriteLatency: 60/100
totalReadLatency: 100/250

The ‘problematic’ LUNs produce values that range between:

totalWriteLatency: 65 - 410
totalReadLatency: 110 - 480

No overrides have been set up.

From your experience, do you think that these metrics should be overridden?
Are these thresholds a bit “strict” or should we check our storage infrastructure for bottlenecks?

Last but not least, we're kind of puzzled by the definition of the deviceReadLatency/deviceWriteLatency counters:

The “Product knowledge” tab for the above metrics state:

*** totalReadLatency ***
This totalReadLatency counter shows the latency from vmkernel to device (HBA) through to the back-end storage, e.g. SAN.
Note there is another counter deviceReadLatency that show latency from vmkernel to HBA only, this should help you troubleshoot where the performance bottleneck is located.

*** totalWriteLatency ***
This totalWriteLatency counter shows the latency from vmkernel to device (HBA) through to the back-end storage, e.g. SAN.
Note there is another counter deviceWriteLatency that show latency from vmkernel to HBA only, this should help you troubleshoot where the performance bottleneck is located.

So that means that deviceReadLatency and deviceWriteLatency check the VM <--> HBA path.

But, copying from your “metrics definition” (http://www.veeam.com/support/metrics/dictionary.html):

*** deviceReadLatency ***
The average amount of time taken to complete a read from the physical device.
This is the time from the device to the HBA in milliseconds.

*** deviceWriteLatency ***
The average amount of time taken to complete a write to the physical device.
This is the time from the HBA to the device in milliseconds.

So here, these 2 metrics seem to check the HBA <--> Device (Storage) path.

Can you please clarify which path these metrics exactly monitor?

Thanks in advance.

vBPav · Post by **vBPav** » Jan 19, 2012 7:05 am this post

Hello,

We will be releasing our Best Practice and Advanced Configuration Guide here shortly which will explain in detail how you may want to tune the Latency monitors. The short answer is, YES, you will probably want to tune these monitors for your environment. Disk latency is dependant on several factors.

IO throughput
LUNs ability to service IO

If you have some LUN with slow storage (iSCSI with SATA disks for example) you can expect a higher latency versus a LUN with fast storage (fiber with fiber disks). Baselining using our reports would be the best way to determine which thresholds you should set for each vmHBA. It is always a good idea to baseline the different types of storage in your environment. You may come to realize that for faster storage, a 40-60ms response time is expected where for slower storage a 100-200ms may be expected.

The monitors "totalWriteLatency" and "totalReadLatency" measure the total time it takes to write/read data from the kernel to the HBA to the SAN and then back. deviceReadLatency and deviceWriteLatency is the time it takes just from the Kernel to the HBA. High Device latency is an indication of some sort of issue or bottleneck at the Host/HostHBA level. A low Device Latency, but a high Total Latency is an indication that the SAN is having performance issues.

Keep a look out for our BPAC Guides. These should be published soon!

treemon · Post by **treemon** » Feb 29, 2012 11:21 am this post

Hi there

we are also getting a few latency issues
are there any updates on the BPAC guides?

tx

Post by **Alec King** » Feb 29, 2012 7:28 pm this post

Hi, the BPAC Guides have been released and are in the downloads section here - http://www.veeam.com/vmware-microsoft-e ... urces.html
Enjoy!

R&D Forums

nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Re: nWorks Alerts finetuning

Who is online