Monitoring and reporting for Veeam Backup & Replication, VMware vSphere and Microsoft Hyper-V in a single System Center Operations Manager Console
Post Reply
thenamaris
Novice
Posts: 4
Liked: never
Joined: Nov 03, 2011 4:39 pm
Full Name: Thenamaris Inc.
Contact:

nWorks Alerts finetuning

Post by thenamaris »

Hello,

we just recently purchased the SCOM MP and it is giving us several Alerts. Most of them close on their own after some minutes. Since we are monitoring our systems through the emails we receive by SCOM, we would like to filter/reduce the emails we get, this means we would like to finetune the alerting and resolve issue and only get emails about real issues. Hence you can find below two of the most common alerts that auto-close (they appear many times per day). We would like some help on those, meaning some understanding on what they tell us / what we can do / if we should somehow override them:

Alert: nworks VMware: ESX Host VMHBA has logged SCSI aborts
Source: vmhba3:C0:T0:L8
Path: server1.thenamaris.gr;DISK:server1.thenamaris.gr
Last modified by: System
Last modified time: 11/3/2011 6:58:05 PM Alert description: VMHost LUN vmhba3:C0:T0:L8 on ESX Host server1.thenamaris.gr has exceeded threshold over 4 samples by logging 6 aborts.

Alert: nworks VMware: ESX Host VMHBA has exceeded threshold for queueLatency
Source: vmhba3:C0:T0:L8
Path: server2.thenamaris.gr;DISK:server2.thenamaris.gr
Last modified by: System
Last modified time: 11/3/2011 3:58:08 PM Alert description: VMHost LUN vmhba3:C0:T0:L8 on ESX Host server2.thenamaris.gr has exceeded threshold over 2 samples by logging 2390 ms.

The latter also often comes up with TotalReadLatency or TotalWriteLatency.

Many Thanks in advance.
ZachW
Enthusiast
Posts: 68
Liked: 10 times
Joined: Aug 02, 2011 6:09 pm
Full Name: Zach Weed
Contact:

Re: nWorks Alerts finetuning

Post by ZachW »

Hi,

Please open up a case with support and we would be more than happy to assist you with this.

http://www.veeam.com/support-form.html

-Zach
thenamaris
Novice
Posts: 4
Liked: never
Joined: Nov 03, 2011 4:39 pm
Full Name: Thenamaris Inc.
Contact:

Re: nWorks Alerts finetuning

Post by thenamaris »

Hello and many thanks for the answer.
We have opened a support case.
Alec King
VP, Product Management
Posts: 1445
Liked: 362 times
Joined: Jan 01, 2006 1:01 am
Contact:

Re: nWorks Alerts finetuning

Post by Alec King »

Hi! I would also say, from the two alerts that you listed - you are having some problem with your back-end storage.
The aborts monitor is looking for storage commands that have timed out.
And the latency monitor is looking for storage commands which are spending too long in the internal vmkernel queue waiting to be processed.

I'd advise diving into the performance and configuration of that VMHBA on that host. 6 aborts is bad but not terrible, however queue latency of 2390ms = two and a half seconds! That is a lifetime of waiting in disk IO terms.

I'd say you have a storage performance issue on that host. And I'd say the nworks MP is working as designed by alerting you to that! :wink:

Cheers,
Alec
Alec King
Vice President, Product Management
Veeam Software
thenamaris
Novice
Posts: 4
Liked: never
Joined: Nov 03, 2011 4:39 pm
Full Name: Thenamaris Inc.
Contact:

Re: nWorks Alerts finetuning

Post by thenamaris »

Hello Alec and many thanks for the answer.
We have contacted our IT infrastructure support in order to investigate the backend issue.
I will revert as soon as possible.
thenamaris
Novice
Posts: 4
Liked: never
Joined: Nov 03, 2011 4:39 pm
Full Name: Thenamaris Inc.
Contact:

Re: nWorks Alerts finetuning

Post by thenamaris »

Hello all,

Some new issues with totalReadLateny and totalWriteLatency have appeared on some LUNs.

The default threshold levels are:

totalWriteLatency: 60/100
totalReadLatency: 100/250

The ‘problematic’ LUNs produce values that range between:

totalWriteLatency: 65 - 410
totalReadLatency: 110 - 480

No overrides have been set up.

From your experience, do you think that these metrics should be overridden?
Are these thresholds a bit “strict” or should we check our storage infrastructure for bottlenecks?

Last but not least, we're kind of puzzled by the definition of the deviceReadLatency/deviceWriteLatency counters:

The “Product knowledge” tab for the above metrics state:

*** totalReadLatency ***
This totalReadLatency counter shows the latency from vmkernel to device (HBA) through to the back-end storage, e.g. SAN.
Note there is another counter deviceReadLatency that show latency from vmkernel to HBA only, this should help you troubleshoot where the performance bottleneck is located.

*** totalWriteLatency ***
This totalWriteLatency counter shows the latency from vmkernel to device (HBA) through to the back-end storage, e.g. SAN.
Note there is another counter deviceWriteLatency that show latency from vmkernel to HBA only, this should help you troubleshoot where the performance bottleneck is located.

So that means that deviceReadLatency and deviceWriteLatency check the VM <--> HBA path.

But, copying from your “metrics definition” (http://www.veeam.com/support/metrics/dictionary.html):

*** deviceReadLatency ***
The average amount of time taken to complete a read from the physical device.
This is the time from the device to the HBA in milliseconds.

*** deviceWriteLatency ***
The average amount of time taken to complete a write to the physical device.
This is the time from the HBA to the device in milliseconds.

So here, these 2 metrics seem to check the HBA <--> Device (Storage) path.

Can you please clarify which path these metrics exactly monitor?

Thanks in advance.
vBPav
Expert
Posts: 181
Liked: 13 times
Joined: Jan 13, 2010 6:08 pm
Full Name: Brian Pavnick
Contact:

Re: nWorks Alerts finetuning

Post by vBPav »

Hello,

We will be releasing our Best Practice and Advanced Configuration Guide here shortly which will explain in detail how you may want to tune the Latency monitors. The short answer is, YES, you will probably want to tune these monitors for your environment. Disk latency is dependant on several factors.

IO throughput
LUNs ability to service IO

If you have some LUN with slow storage (iSCSI with SATA disks for example) you can expect a higher latency versus a LUN with fast storage (fiber with fiber disks). Baselining using our reports would be the best way to determine which thresholds you should set for each vmHBA. It is always a good idea to baseline the different types of storage in your environment. You may come to realize that for faster storage, a 40-60ms response time is expected where for slower storage a 100-200ms may be expected.

The monitors "totalWriteLatency" and "totalReadLatency" measure the total time it takes to write/read data from the kernel to the HBA to the SAN and then back. deviceReadLatency and deviceWriteLatency is the time it takes just from the Kernel to the HBA. High Device latency is an indication of some sort of issue or bottleneck at the Host/HostHBA level. A low Device Latency, but a high Total Latency is an indication that the SAN is having performance issues.

Keep a look out for our BPAC Guides. These should be published soon! :)
Brian Pavnick | Cireson| Solutions Architect

- Follow me on Twitter @ vbpav
- Reach me on e-mail @ brian.pavnick@cireson.com
treemon
Lurker
Posts: 1
Liked: never
Joined: Feb 29, 2012 11:19 am
Contact:

Re: nWorks Alerts finetuning

Post by treemon »

Hi there

we are also getting a few latency issues
are there any updates on the BPAC guides?

tx
Alec King
VP, Product Management
Posts: 1445
Liked: 362 times
Joined: Jan 01, 2006 1:01 am
Contact:

Re: nWorks Alerts finetuning

Post by Alec King »

Hi, the BPAC Guides have been released and are in the downloads section here - http://www.veeam.com/vmware-microsoft-e ... urces.html
Enjoy! :D
Post Reply

Who is online

Users browsing this forum: No registered users and 4 guests