100k alerts in one week

sameerdave · Post by **sameerdave** » Dec 02, 2009 9:09 pm this post

We recently deployed Nworks in our environment. We have around 30 esx hosts.
To my surprise in one weeks time we have more than 100k in closed alerts from nworks on the SCOM console.

Are there any initial configuration we have to do or run for the monitors for these alerts?
Our esx environment looks quite stable and two of the main alerts which I have been receiving are "VMWare Tools heartbeat status changed to Red" and "ESX Host VMHBA has exceeded threshold for totalwritelatency". And these are like hundreds and hundreds of them appearing and closing itself every 2-4 mins..

Any Help?

Thx

Post by **Alec King** » Dec 02, 2009 9:42 pm this post

Hi sameerdave,

For the VMTools Heartbeat monitor:
The heartbeat events as we receive them from the VI-API have been known to be 'unstable'...we can receive the Red event, then within a couple of minutes we receive the "Heartbeat = Green" event. And the guestOS is running fine throughout this time. This is a VI-API issue.....
For this reason this Monitor comes with a configurable timer, set at 10 minutes by default. We will wait 10 minutes to see if a Green event arrives, and only then raise an alert.
It may be in your environment that you need to increase the Timer on this Monitor - this is a standard SCOM override to parameter 'Correlation Interval' (default 600 seconds)

For the totalWriteLatency Monitor -
This is also overridable, so you can configure it for your environment.
However before you override it, I would advise deeper analysis. The MP is reporting high latency between hypervisor and backend storage. Even if you see no clear and direct impact on your VMs now, this Monitor is pro-actively advising you that there is some problem in storage performance.
I would examine the following :
- is the monitor firing Warning or Alert level? (The thresholds are 60 and 100 milliseconds)
- what type of storage is the vmhba attached to? (SAN, NAS, iSCSI etc - you can see which Datastore the vmhba is attached you in nworks Topology view)
- is this latency expected on this type of storage? (100ms = 1/10s for a disk write....that is not great performance and does require checking out)

You can use the nworks Disk performance dashboards to gather more info :
- is there high I/O on the Host vmhba at the same time? (Host Disk Dashboard)
- If yes, which VM(s) cause it? (VM Disk Dashboard)

If you do decide that this latency level is acceptable in your environment, then again you can override one or both thresholds on this Monitor in the standard way.
However I definitely would advise you be pro-active and do a deeper dive into this issue - if you have a performance bottleneck in storage, better to find it now rather than when it goes critical!

Hope that helps - any further questions let us know,
Regards

sameerdave · Post by **sameerdave** » Dec 02, 2009 10:14 pm this post

Thanks Alec for the very analytical answer..
I would definitely get along with my vm and storage admins to see how we can override this.

Thanks again for your prompt reply

Sameer

R&D Forums

100k alerts in one week

Re: 100k alerts in one week

Re: 100k alerts in one week

Who is online