Aggregation Options - 95th percentile

Post by **stevehughes** » Jul 25, 2018 3:47 am this post

I find my self looking for something more than just Min / Avg / Max in the aggregation options.

In my case it's for monitoring Datastore or vdisk latency. Consider a vdisk with latency that typically flickers around 10-30msec. A quick one-off latency spike to say 1 second is no big deal, whereas a steady-state latency jump to even say 50msec indicates an issue.

An average will tend to trip on the quick spike and generate unwanted warnings. So the challenge is to respond to a long-term increase in reading without being tripped by spikes.

One way I can think of to achieve this is to alert based on the minimum over the interval, but what I really want (I think) is 95th percentile logic that will just discard the abnormal readings and generate an alert based on the maximum of the remaining readings.

Any thoughts?

Steve

Shestakov · Post by **Shestakov** » Jul 25, 2018 10:38 am this post

Hello Steve,
I think in your case you need to use average aggregation which helps smooth effect of spikes.
Longer "time period" (observation interval) you choose, lesser spikes effect you have.
Nikita

Post by **stevehughes** » Jul 25, 2018 10:43 am this post

That's true, but I find that in order to avoid false latency trips I need to set the averaging interval very long or set the threshold higher than I would like. What I'm really looking for is an algorithm that discards the high spikes. Just putting it out there as a feature request. 95th percentile was designed to do just that. I see value in it, others may not agree.

Shestakov · Post by **Shestakov** » Jul 25, 2018 10:54 am this post

How do you know that those spikes are false?
Some people on the contrary want to identify those spikes with alarms.

Post by **stevehughes** » Jul 25, 2018 11:02 am this post

I don't think the spikes are false, and I agree that there are some instances where you would want to alert on the spikes, but in this instance I'm not interested in alerting on them and would love to find a way to just screen them out and just alert me on the remaining trend.

Shestakov · Post by **Shestakov** » Jul 25, 2018 1:59 pm this post

If spikes are real, what's the point to ignore them and use 95% of lower values?

Post by **stevehughes** » Jul 25, 2018 10:48 pm this post

I’ll try to explain. Maybe there is a better way to achieve what I’m looking for.

Our storage uses a mix of nearline spinners for capacity and SSD for performance. It aims to keep the hot data on the SSDs and it moves the colder data off to the spinners. The effect is that for a reasonable price we can offer the client a volume that is both large and also apparently very fast. In reality it’s very fast only for the most commonly used data, and it can be much slower if they access cold data, but that’s fine with us and it’s fine with the client.

A graph of latency for such a volume shows a baseline that is very low, usually around 1msec, but with peaks that commonly hit 50 msec, and can go as high as 1000msec if the spinners are unusually busy e.g. during backups. The thing is that we don’t really worry about the peaks because they are expected and are acceptable to both us and the client. What we do care about is that baseline. If the baseline latency rises from 1msec to 10msec the client will notice, and it is probably indicative of an issue we need to look at.

Trying to reliably trigger on an increase in the baseline latency without getting false trips from the peaks is proving difficult. I figure that the ideal way to approach this is to first filter out the high values that we aren’t interested in and then form an average from what is left.

Shestakov · Post by **Shestakov** » Jul 26, 2018 10:38 am this post

Thanks for the details.
So far there is no such an option. What I can offer is to suppress alarms during backup activity, so these spikes are not taken into account.

R&D Forums

Aggregation Options - 95th percentile

Re: Aggregation Options - 95th percentile

Re: Aggregation Options - 95th percentile

Re: Aggregation Options - 95th percentile

Re: Aggregation Options - 95th percentile

Re: Aggregation Options - 95th percentile

Re: Aggregation Options - 95th percentile

Re: Aggregation Options - 95th percentile

Who is online