-
- Service Provider
- Posts: 70
- Liked: 10 times
- Joined: Jul 27, 2016 1:39 am
- Full Name: Steve Hughes
- Contact:
Aggregation Options - 95th percentile
I find my self looking for something more than just Min / Avg / Max in the aggregation options.
In my case it's for monitoring Datastore or vdisk latency. Consider a vdisk with latency that typically flickers around 10-30msec. A quick one-off latency spike to say 1 second is no big deal, whereas a steady-state latency jump to even say 50msec indicates an issue.
An average will tend to trip on the quick spike and generate unwanted warnings. So the challenge is to respond to a long-term increase in reading without being tripped by spikes.
One way I can think of to achieve this is to alert based on the minimum over the interval, but what I really want (I think) is 95th percentile logic that will just discard the abnormal readings and generate an alert based on the maximum of the remaining readings.
Any thoughts?
Steve
In my case it's for monitoring Datastore or vdisk latency. Consider a vdisk with latency that typically flickers around 10-30msec. A quick one-off latency spike to say 1 second is no big deal, whereas a steady-state latency jump to even say 50msec indicates an issue.
An average will tend to trip on the quick spike and generate unwanted warnings. So the challenge is to respond to a long-term increase in reading without being tripped by spikes.
One way I can think of to achieve this is to alert based on the minimum over the interval, but what I really want (I think) is 95th percentile logic that will just discard the abnormal readings and generate an alert based on the maximum of the remaining readings.
Any thoughts?
Steve
-
- Veteran
- Posts: 7328
- Liked: 781 times
- Joined: May 21, 2014 11:03 am
- Full Name: Nikita Shestakov
- Location: Prague
- Contact:
Re: Aggregation Options - 95th percentile
Hello Steve,
I think in your case you need to use average aggregation which helps smooth effect of spikes.
Longer "time period" (observation interval) you choose, lesser spikes effect you have.
Nikita
I think in your case you need to use average aggregation which helps smooth effect of spikes.
Longer "time period" (observation interval) you choose, lesser spikes effect you have.
Nikita
-
- Service Provider
- Posts: 70
- Liked: 10 times
- Joined: Jul 27, 2016 1:39 am
- Full Name: Steve Hughes
- Contact:
Re: Aggregation Options - 95th percentile
That's true, but I find that in order to avoid false latency trips I need to set the averaging interval very long or set the threshold higher than I would like. What I'm really looking for is an algorithm that discards the high spikes. Just putting it out there as a feature request. 95th percentile was designed to do just that. I see value in it, others may not agree.
-
- Veteran
- Posts: 7328
- Liked: 781 times
- Joined: May 21, 2014 11:03 am
- Full Name: Nikita Shestakov
- Location: Prague
- Contact:
Re: Aggregation Options - 95th percentile
How do you know that those spikes are false?
Some people on the contrary want to identify those spikes with alarms.
Some people on the contrary want to identify those spikes with alarms.
-
- Service Provider
- Posts: 70
- Liked: 10 times
- Joined: Jul 27, 2016 1:39 am
- Full Name: Steve Hughes
- Contact:
Re: Aggregation Options - 95th percentile
I don't think the spikes are false, and I agree that there are some instances where you would want to alert on the spikes, but in this instance I'm not interested in alerting on them and would love to find a way to just screen them out and just alert me on the remaining trend.
-
- Veteran
- Posts: 7328
- Liked: 781 times
- Joined: May 21, 2014 11:03 am
- Full Name: Nikita Shestakov
- Location: Prague
- Contact:
Re: Aggregation Options - 95th percentile
If spikes are real, what's the point to ignore them and use 95% of lower values?
-
- Service Provider
- Posts: 70
- Liked: 10 times
- Joined: Jul 27, 2016 1:39 am
- Full Name: Steve Hughes
- Contact:
Re: Aggregation Options - 95th percentile
I’ll try to explain. Maybe there is a better way to achieve what I’m looking for.
Our storage uses a mix of nearline spinners for capacity and SSD for performance. It aims to keep the hot data on the SSDs and it moves the colder data off to the spinners. The effect is that for a reasonable price we can offer the client a volume that is both large and also apparently very fast. In reality it’s very fast only for the most commonly used data, and it can be much slower if they access cold data, but that’s fine with us and it’s fine with the client.
A graph of latency for such a volume shows a baseline that is very low, usually around 1msec, but with peaks that commonly hit 50 msec, and can go as high as 1000msec if the spinners are unusually busy e.g. during backups. The thing is that we don’t really worry about the peaks because they are expected and are acceptable to both us and the client. What we do care about is that baseline. If the baseline latency rises from 1msec to 10msec the client will notice, and it is probably indicative of an issue we need to look at.
Trying to reliably trigger on an increase in the baseline latency without getting false trips from the peaks is proving difficult. I figure that the ideal way to approach this is to first filter out the high values that we aren’t interested in and then form an average from what is left.
Our storage uses a mix of nearline spinners for capacity and SSD for performance. It aims to keep the hot data on the SSDs and it moves the colder data off to the spinners. The effect is that for a reasonable price we can offer the client a volume that is both large and also apparently very fast. In reality it’s very fast only for the most commonly used data, and it can be much slower if they access cold data, but that’s fine with us and it’s fine with the client.
A graph of latency for such a volume shows a baseline that is very low, usually around 1msec, but with peaks that commonly hit 50 msec, and can go as high as 1000msec if the spinners are unusually busy e.g. during backups. The thing is that we don’t really worry about the peaks because they are expected and are acceptable to both us and the client. What we do care about is that baseline. If the baseline latency rises from 1msec to 10msec the client will notice, and it is probably indicative of an issue we need to look at.
Trying to reliably trigger on an increase in the baseline latency without getting false trips from the peaks is proving difficult. I figure that the ideal way to approach this is to first filter out the high values that we aren’t interested in and then form an average from what is left.
-
- Veteran
- Posts: 7328
- Liked: 781 times
- Joined: May 21, 2014 11:03 am
- Full Name: Nikita Shestakov
- Location: Prague
- Contact:
Re: Aggregation Options - 95th percentile
Thanks for the details.
So far there is no such an option. What I can offer is to suppress alarms during backup activity, so these spikes are not taken into account.
So far there is no such an option. What I can offer is to suppress alarms during backup activity, so these spikes are not taken into account.
Who is online
Users browsing this forum: No registered users and 12 guests