-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to Prometheus histograms, not summaries #318
Conversation
Prometheus histograms / summaries are complex and hard to wrap one's head around. The difference between them is also quite subtle and confusing to the uninitiated. I'm sure I won't do a good job of explaining the difference in this commit message, so I'll just link to the docs[0]. The key difference for us is that summaries can't be aggregates, but histograms can. For example, if I'm looking at http_request_duration_seconds, we might want to know "what's the 95%ile request time for this controller action". We can answer this question for a particular application instance (i.e. a specific pod) with both histograms and summaries. If we want to know the 95%ile request time aggregated across all instances / pods however, we can only do that with histograms. This is because in summaries, quantiles are computed ahead of time by the prometheus client, so they can only see information for one particular app instance. Histograms on the other hand, defer the calculation of quantiles to query time, which means they can be aggregated (but are less precise). prometheus_exporter defaults to Summary[1], but in our case I think it makes more sense to default to Histogram. There may be some apps where we prefer Summary, so I've allowed it to be passed in as a configuration option. From the summary metrics we have at the moment, we can see that some controller actions take significantly longer than the 10 seconds which prometheus_exporter uses as it's default max bucket. Therefore, I've added a few more buckets so we can see the distribution between 10 and 50 seconds. [0] - https://prometheus.io/docs/practices/histograms/#quantiles [1] - /~https://github.com/discourse/prometheus_exporter#histogram-mode
(Draft while I test that this has the intended behvaiour in one of our app) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
AFAICT Summary solves a pretty rare/obscure use case so it's frustrating that the Prometheus docs give it so much salience. Seems to cause a lot of confusion even among library authors 😕 There's a reason B�rgmon never grew an equivalent of Summary (as far as I know) despite about a decade of use at massive scale on a wide range of different services/applications. I'd love to know what the original use case was that gave rise to Summary. Maybe someone had an inherently global metric (say, coming from an exporter, with no need to aggregate after scrape) and it was really important for them to have super accurate, arbitrary (i.e. not chosen in advance) quantiles? |
This is needed because I want to use a class from prometheus_exporter/metric as the default for a method parameter, which means I need to require the prometheus code before I use it. Previously these requires happened conditionally, so if `should_configure` was false, the prometheus code wouldn't be loaded at all. From a quick look at the prometheus exporter code, I don't think the requires have any side effects, so it should be fine to load them even when we're not configuring prometheus.
Hopefully not an actual problem, but just beware that moving the imports could have side effects (especially given that particular library's history!) Just something to look out for as this rolls out. |
http_request_duration_seconds was a prometheus summary. We switched to prometheus histograms in alphagov/govuk_app_config#318 The histogram metric is called http_request_duration_seconds_bucket (note the extra `_bucket`). Now that most of the apps don't use the http_request_duration_seconds metric, it's missing most of the app labels, making it the wrong place to query to get the list of apps. Instead, the http_requests_total metric has everything we need. I've also removed the "quantile" variable, as we're no longer using the summary metric which includes quantile as a label.
The old Request duration per App/Controller/Action panel was very noisy, and possibly inaccurate. Without an app / quantile specified (i.e. by default) it would try to show 2860 series (one series for every combination of app, controller, action _and_ quantile). Even with an app and quantile specified, some apps (e.g. whitehall) have a large number of controllers and actions, so the graph would still be unusably noisy. And even if you managed to find an interesting series, the query itself is still the average (mean) across pods of a number of summaries (medians, 95th percentiles etc.), which isn't a very meaningful statistic. I think it would be more useful to use the histograms (which we have now, thanks to alphagov/govuk_app_config#318) to show a heatmap. Heatmaps show the change in distribution over time, so they give us a visual indication both of how many requests are happening, and how many fall into each request duration bucket. In this case, by default we'll bundle all the requests to all the apps together, and allow segmenting by app using the variable on the dashboard. It is possible (and useful) to segment further (down to the controller / action), but we're probably trying to do too much with a single dashboard by trying that.
Prometheus histograms / summaries are complex and hard to wrap one's head around. The difference between them is also quite subtle and confusing to the uninitiated. I'm sure I won't do a good job of explaining the difference in this commit message, so I'll just link to the docs[0].
The key difference for us is that summaries can't be aggregates, but histograms can.
For example, if I'm looking at http_request_duration_seconds, we might want to know "what's the 95%ile request time for this controller action". We can answer this question for a particular application instance (i.e. a specific pod) with both histograms and summaries. If we want to know the 95%ile request time aggregated across all instances / pods however, we can only do that with histograms. This is because in summaries, quantiles are computed ahead of time by the prometheus client, so they can only see information for one particular app instance. Histograms on the other hand, defer the calculation of quantiles to query time, which means they can be aggregated (but are less precise).
prometheus_exporter defaults to Summary[1], but in our case I think it makes more sense to default to Histogram. There may be some apps where we prefer Summary, so I've allowed it to be passed in as a configuration option.
From the summary metrics we have at the moment, we can see that some controller actions take significantly longer than the 10 seconds which prometheus_exporter uses as it's default max bucket. Therefore, I've added a few more buckets so we can see the distribution between 10 and 50 seconds.
[0] - https://prometheus.io/docs/practices/histograms/#quantiles
[1] - /~https://github.com/discourse/prometheus_exporter#histogram-mode