Monitoring and reporting for Veeam Backup & Replication, VMware vSphere and Microsoft Hyper-V in a single System Center Operations Manager Console
Post Reply
nico.weytens
Influencer
Posts: 17
Liked: 2 times
Joined: Jul 02, 2012 8:30 am
Full Name: Nico Weytens
Location: Belgium
Contact:

Datastore Latency Analysis monitor

Post by nico.weytens »

Daily we get several alerts from the Datastore Latency Analysis monitor. We are bugging our storage team about them when we see a pattern, but they claim at their end everything is just fine, that we are exaggerating the problem. There would only be sporadic/short latency issues that can safely be ignored.

I've examined the Product Knowledge for this monitor, and the possible overrides, but I'd like some extra info.
The Product Knowledge summary:
This monitor tracks threshold breaches for the following metric: maxDeviceLatency - the highest of maxDeviceReadLatency and maxDeviceWriteLatency
This is a 'Top N' monitor - the top hosts reporting latency, and their I/O to this datastore, will be listed in the alert description.

Possible overrides, with their default values:
  • Instance Count 5
  • Num Samples 1
  • Threshold1 40
  • Threshold2 80
I figure the Instance Count of 5 stands for the Top N values in the alert, while the thresholds represent 40ms warning and 80ms critical level. I'm somewhat confused on the Num Samples value though... The sample interval isn't mentioned anywhere: not in the monitor, nor in the maxDeviceLatency/maxDeviceReadLatency/maxDeviceWriteLatency collection rules.

Am I correct to assume the interval is the value we've set in our VES webportal for collection interval of the collectors? The default is 5mins, but we have it on 10.
So if we'd set the Num Samples value to 2, the monitor would only spawn an alert when the threshold is breached over 2 consecutive collections, in our case 10mins apart.

Any holes in my reasoning? :)
Alec King
VP, Product Management
Posts: 1441
Liked: 361 times
Joined: Jan 01, 2006 1:01 am
Contact:

Re: Datastore Latency Analysis monitor

Post by Alec King »

Hey Nico!

You are entirely correct :D

Num Samples defines how many over-threshold-triggers we need, before we generate an alert. And each sample is delivered on the poll schedule you defined in Veeam Extensions settings.

So if you override Num Samples to 2, then it will be 2 x 10 minutes = 20 minutes (in your configuration) before average latency generates an alert.

Cheers,
Alec
nico.weytens
Influencer
Posts: 17
Liked: 2 times
Joined: Jul 02, 2012 8:30 am
Full Name: Nico Weytens
Location: Belgium
Contact:

Re: Datastore Latency Analysis monitor

Post by nico.weytens »

alright, great :D

Still 2 remarks though...
1. isn't it 10mins apart, not 20? Because on a timeline it's sample-10min-sample-10min-sample-10min etc, I mean: 2 samples are 10mins apart
2. what do you mean with 'average' in before average latency generates an alert. This monitor doesn't make averages, does it? It samples taken breach the threshold, or they don't. No?
Or is the sample itself already an average from the 10min interval?
If the latter would be the case, then I don't see why our storage team can claim there are only short spikes. If an average latency over 10mins is over 40ms, then that's BAD!

*10min in our situation, the default is 5min
Alec King
VP, Product Management
Posts: 1441
Liked: 361 times
Joined: Jan 01, 2006 1:01 am
Contact:

Re: Datastore Latency Analysis monitor

Post by Alec King »

1. OK, so what I meant by "20 minutes" was ~20 minutes since sampling started. The timeline could be -
00.00 Collector sampling starts
00.02 high latency starts
sample for 10 mins...
00.10 deliver sample of >40 ms
sample for 10 mins...
00.20 deliver sample of >40 ms
If NumSamples = 2, then now we get an alert.
So, it could be ~20 minutes after high latency started (in my example, 18 minutes after). But you are correct, the time between samples is 10 minutes.

2. The latency metric is an average over the sample interval, we take "realtime" samples (in vCenter, that's every 20 seconds) and average those to deliver each data point.
So, if you get a sample of >40ms in SCOM; then in your case that means average latency over 10 minutes was >40 ms. And I agree with you - that's bad! That's why our default NumSamples setting is 1 :wink:
I'd say, that MP is working correctly as designed to give you those latency alerts, and maybe you should talk with your storage team again.....
nico.weytens
Influencer
Posts: 17
Liked: 2 times
Joined: Jul 02, 2012 8:30 am
Full Name: Nico Weytens
Location: Belgium
Contact:

Re: Datastore Latency Analysis monitor

Post by nico.weytens »

OK, crystal clear, Alec. Thanks for that.

We'll take this up with our storage guys again.
keithkleiman
Enthusiast
Posts: 42
Liked: never
Joined: May 23, 2011 8:38 pm
Full Name: Keith Kleiman
Contact:

Re: Datastore Latency Analysis monitor

Post by keithkleiman »

Alec,

Some clarification the following comment:

"2. The latency metric is an average over the sample interval, we take "realtime" samples (in vCenter, that's every 20 seconds) and average those to deliver each data point."

So the "collection interval" in the "collector settings" of the web UI is a sample taken from an average of 20 second samples from vCenter? In other words...

If my "collection interval" in the collector settings is set to 15 minutes, then I am capturing a sample that is written back to SCOM every 15 minutes. That sample (taken every 15 min) is actually an average of 45 samples [15 (collection interval) x3 (vsphere samples per minute)] from vsphere.

So if I increase the sample to "2" in the "Veeam VMware: Datastore Latency Analysis" monitor, A performance sample will still be written every 15 minutes (per the collector settings), however an alert will not be generated by the monitor until 30 minutes has passed and both 15 min samples written to scom averages over the thresholds.

TIA,
Keith
Alec King
VP, Product Management
Posts: 1441
Liked: 361 times
Joined: Jan 01, 2006 1:01 am
Contact:

Re: Datastore Latency Analysis monitor

Post by Alec King »

Hi Keith,

Yes you got it exactly 8)

Cheers,
Alec
stanyb
Influencer
Posts: 22
Liked: 2 times
Joined: Nov 24, 2010 12:09 pm
Full Name: Stanislas Borgilion
Contact:

Re: Datastore Latency Analysis monitor

Post by stanyb »

Hi Nico,
last weeks we discover the same issue like you had mid 2014. Did your storage guys do something or did you find another way to solve it?

rgds,
Stany
sergey.g
Veteran
Posts: 452
Liked: 76 times
Joined: May 02, 2012 1:49 pm
Full Name: Sergey Goncharenko
Contact:

Re: Datastore Latency Analysis monitor

Post by sergey.g »

Hi,

Hopefully Nico could reply to your question with his experience of dealing with the issue, but what I would recommend if we are talking about datastore latency alarms, is to check Datastore Traffic Analysis dashboard for the affected datastore, if you can spot a specific VM which has increased activity going around the time you receive latency alarm - it could be a root cause of the issue. If Datastore usage is within expected barriers, then probably it's a time to review storage configuration. However I would recommend also checking VM latency values in the Datastore Latency Analysis dashboard - if some group of VMs are more affected than others - it could be useful to check if they reside on the same host - it could be a host storage issues as well. There is a kernel latency counter on each host, so check this one too.

Hope this could be helpfull.
Thanks.
stanyb
Influencer
Posts: 22
Liked: 2 times
Joined: Nov 24, 2010 12:09 pm
Full Name: Stanislas Borgilion
Contact:

Re: Datastore Latency Analysis monitor

Post by stanyb »

Hi Sergey,
thanks for you reply. We believe it's storage related, but we have to convince our storage vendor of this.I'm not very familiar with the Veeam MP reports, but I'll check them today.

Concerning kernel latency counters, I know that the 4 hosts, that are connected to the storage box on which we discover these issues, are sometimes giving these alarms.

rgds,
Stany
Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests