Datastore Latency Analysis monitor

nico.weytens · Post by **nico.weytens** » May 27, 2014 9:42 am this post

Daily we get several alerts from the Datastore Latency Analysis monitor. We are bugging our storage team about them when we see a pattern, but they claim at their end everything is just fine, that we are exaggerating the problem. There would only be sporadic/short latency issues that can safely be ignored.

I've examined the Product Knowledge for this monitor, and the possible overrides, but I'd like some extra info.
The Product Knowledge summary:
This monitor tracks threshold breaches for the following metric: maxDeviceLatency - the highest of maxDeviceReadLatency and maxDeviceWriteLatency
This is a 'Top N' monitor - the top hosts reporting latency, and their I/O to this datastore, will be listed in the alert description.
Possible overrides, with their default values:

Instance Count 5
Num Samples 1
Threshold1 40
Threshold2 80

I figure the Instance Count of 5 stands for the Top N values in the alert, while the thresholds represent 40ms warning and 80ms critical level. I'm somewhat confused on the Num Samples value though... The sample interval isn't mentioned anywhere: not in the monitor, nor in the maxDeviceLatency/maxDeviceReadLatency/maxDeviceWriteLatency collection rules.

Am I correct to assume the interval is the value we've set in our VES webportal for collection interval of the collectors? The default is 5mins, but we have it on 10.
So if we'd set the Num Samples value to 2, the monitor would only spawn an alert when the threshold is breached over 2 consecutive collections, in our case 10mins apart.

Any holes in my reasoning?

Post by **Alec King** » May 27, 2014 9:48 am this post

Hey Nico!

You are entirely correct

Num Samples defines how many over-threshold-triggers we need, before we generate an alert. And each sample is delivered on the poll schedule you defined in Veeam Extensions settings.

So if you override Num Samples to 2, then it will be 2 x 10 minutes = 20 minutes (in your configuration) before average latency generates an alert.

Cheers,
Alec

nico.weytens · Post by **nico.weytens** » May 27, 2014 10:38 am this post

alright, great

Still 2 remarks though...
1. isn't it 10mins apart, not 20? Because on a timeline it's sample-10min-sample-10min-sample-10min etc, I mean: 2 samples are 10mins apart
2. what do you mean with 'average' in before average latency generates an alert. This monitor doesn't make averages, does it? It samples taken breach the threshold, or they don't. No?
Or is the sample itself already an average from the 10min interval?
If the latter would be the case, then I don't see why our storage team can claim there are only short spikes. If an average latency over 10mins is over 40ms, then that's BAD!

*10min in our situation, the default is 5min

Post by **Alec King** » May 27, 2014 10:52 am this post

1. OK, so what I meant by "20 minutes" was ~20 minutes since sampling started. The timeline could be -
00.00 Collector sampling starts
00.02 high latency starts
sample for 10 mins...
00.10 deliver sample of >40 ms
sample for 10 mins...
00.20 deliver sample of >40 ms
If NumSamples = 2, then now we get an alert.
So, it could be ~20 minutes after high latency started (in my example, 18 minutes after). But you are correct, the time between samples is 10 minutes.

2. The latency metric is an average over the sample interval, we take "realtime" samples (in vCenter, that's every 20 seconds) and average those to deliver each data point.
So, if you get a sample of >40ms in SCOM; then in your case that means average latency over 10 minutes was >40 ms. And I agree with you - that's bad! That's why our default NumSamples setting is 1

I'd say, that MP is working correctly as designed to give you those latency alerts, and maybe you should talk with your storage team again.....

nico.weytens · Post by **nico.weytens** » May 28, 2014 6:24 am this post

OK, crystal clear, Alec. Thanks for that.

We'll take this up with our storage guys again.

keithkleiman · Post by **keithkleiman** » May 29, 2015 4:39 pm this post

Alec,

Some clarification the following comment:

"2. The latency metric is an average over the sample interval, we take "realtime" samples (in vCenter, that's every 20 seconds) and average those to deliver each data point."

So the "collection interval" in the "collector settings" of the web UI is a sample taken from an average of 20 second samples from vCenter? In other words...

If my "collection interval" in the collector settings is set to 15 minutes, then I am capturing a sample that is written back to SCOM every 15 minutes. That sample (taken every 15 min) is actually an average of 45 samples [15 (collection interval) x3 (vsphere samples per minute)] from vsphere.

So if I increase the sample to "2" in the "Veeam VMware: Datastore Latency Analysis" monitor, A performance sample will still be written every 15 minutes (per the collector settings), however an alert will not be generated by the monitor until 30 minutes has passed and both 15 min samples written to scom averages over the thresholds.

TIA,
Keith

Post by **Alec King** » May 29, 2015 4:47 pm this post

Hi Keith,

Yes you got it exactly

Cheers,
Alec

stanyb · Post by **stanyb** » Jan 19, 2016 8:35 am this post

Hi Nico,
last weeks we discover the same issue like you had mid 2014. Did your storage guys do something or did you find another way to solve it?

rgds,
Stany

sergey.g · Post by **sergey.g** » Jan 21, 2016 12:52 pm this post

Hi,

Hopefully Nico could reply to your question with his experience of dealing with the issue, but what I would recommend if we are talking about datastore latency alarms, is to check Datastore Traffic Analysis dashboard for the affected datastore, if you can spot a specific VM which has increased activity going around the time you receive latency alarm - it could be a root cause of the issue. If Datastore usage is within expected barriers, then probably it's a time to review storage configuration. However I would recommend also checking VM latency values in the Datastore Latency Analysis dashboard - if some group of VMs are more affected than others - it could be useful to check if they reside on the same host - it could be a host storage issues as well. There is a kernel latency counter on each host, so check this one too.

Hope this could be helpfull.
Thanks.

stanyb · Post by **stanyb** » Jan 22, 2016 8:03 am this post

Hi Sergey,
thanks for you reply. We believe it's storage related, but we have to convince our storage vendor of this.I'm not very familiar with the Veeam MP reports, but I'll check them today.

Concerning kernel latency counters, I know that the 4 hosts, that are connected to the storage box on which we discover these issues, are sometimes giving these alarms.

rgds,
Stany

R&D Forums

Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Re: Datastore Latency Analysis monitor

Who is online