Default to Prometheus histograms, not summaries #318

richardTowers · 2023-09-13T11:22:04Z

Prometheus histograms / summaries are complex and hard to wrap one's head around. The difference between them is also quite subtle and confusing to the uninitiated. I'm sure I won't do a good job of explaining the difference in this commit message, so I'll just link to the docs[0].

The key difference for us is that summaries can't be aggregates, but histograms can.

For example, if I'm looking at http_request_duration_seconds, we might want to know "what's the 95%ile request time for this controller action". We can answer this question for a particular application instance (i.e. a specific pod) with both histograms and summaries. If we want to know the 95%ile request time aggregated across all instances / pods however, we can only do that with histograms. This is because in summaries, quantiles are computed ahead of time by the prometheus client, so they can only see information for one particular app instance. Histograms on the other hand, defer the calculation of quantiles to query time, which means they can be aggregated (but are less precise).

prometheus_exporter defaults to Summary[1], but in our case I think it makes more sense to default to Histogram. There may be some apps where we prefer Summary, so I've allowed it to be passed in as a configuration option.

From the summary metrics we have at the moment, we can see that some controller actions take significantly longer than the 10 seconds which prometheus_exporter uses as it's default max bucket. Therefore, I've added a few more buckets so we can see the distribution between 10 and 50 seconds.

[0] - https://prometheus.io/docs/practices/histograms/#quantiles
[1] - /~https://github.com/discourse/prometheus_exporter#histogram-mode

Prometheus histograms / summaries are complex and hard to wrap one's head around. The difference between them is also quite subtle and confusing to the uninitiated. I'm sure I won't do a good job of explaining the difference in this commit message, so I'll just link to the docs[0]. The key difference for us is that summaries can't be aggregates, but histograms can. For example, if I'm looking at http_request_duration_seconds, we might want to know "what's the 95%ile request time for this controller action". We can answer this question for a particular application instance (i.e. a specific pod) with both histograms and summaries. If we want to know the 95%ile request time aggregated across all instances / pods however, we can only do that with histograms. This is because in summaries, quantiles are computed ahead of time by the prometheus client, so they can only see information for one particular app instance. Histograms on the other hand, defer the calculation of quantiles to query time, which means they can be aggregated (but are less precise). prometheus_exporter defaults to Summary[1], but in our case I think it makes more sense to default to Histogram. There may be some apps where we prefer Summary, so I've allowed it to be passed in as a configuration option. From the summary metrics we have at the moment, we can see that some controller actions take significantly longer than the 10 seconds which prometheus_exporter uses as it's default max bucket. Therefore, I've added a few more buckets so we can see the distribution between 10 and 50 seconds. [0] - https://prometheus.io/docs/practices/histograms/#quantiles [1] - /~https://github.com/discourse/prometheus_exporter#histogram-mode

richardTowers · 2023-09-13T11:22:25Z

(Draft while I test that this has the intended behvaiour in one of our app)

sengi

LGTM!

sengi · 2023-09-13T12:13:50Z

AFAICT Summary solves a pretty rare/obscure use case so it's frustrating that the Prometheus docs give it so much salience. Seems to cause a lot of confusion even among library authors 😕

There's a reason B�rgmon never grew an equivalent of Summary (as far as I know) despite about a decade of use at massive scale on a wide range of different services/applications.

I'd love to know what the original use case was that gave rise to Summary. Maybe someone had an inherently global metric (say, coming from an exporter, with no need to aggregate after scrape) and it was really important for them to have super accurate, arbitrary (i.e. not chosen in advance) quantiles?

This is needed because I want to use a class from prometheus_exporter/metric as the default for a method parameter, which means I need to require the prometheus code before I use it. Previously these requires happened conditionally, so if `should_configure` was false, the prometheus code wouldn't be loaded at all. From a quick look at the prometheus exporter code, I don't think the requires have any side effects, so it should be fine to load them even when we're not configuring prometheus.

sengi · 2023-09-14T10:15:53Z

Hopefully not an actual problem, but just beware that moving the imports could have side effects (especially given that particular library's history!) Just something to look out for as this rolls out.

http_request_duration_seconds was a prometheus summary. We switched to prometheus histograms in alphagov/govuk_app_config#318 The histogram metric is called http_request_duration_seconds_bucket (note the extra `_bucket`). Now that most of the apps don't use the http_request_duration_seconds metric, it's missing most of the app labels, making it the wrong place to query to get the list of apps. Instead, the http_requests_total metric has everything we need. I've also removed the "quantile" variable, as we're no longer using the summary metric which includes quantile as a label.

The old Request duration per App/Controller/Action panel was very noisy, and possibly inaccurate. Without an app / quantile specified (i.e. by default) it would try to show 2860 series (one series for every combination of app, controller, action _and_ quantile). Even with an app and quantile specified, some apps (e.g. whitehall) have a large number of controllers and actions, so the graph would still be unusably noisy. And even if you managed to find an interesting series, the query itself is still the average (mean) across pods of a number of summaries (medians, 95th percentiles etc.), which isn't a very meaningful statistic. I think it would be more useful to use the histograms (which we have now, thanks to alphagov/govuk_app_config#318) to show a heatmap. Heatmaps show the change in distribution over time, so they give us a visual indication both of how many requests are happening, and how many fall into each request duration bucket. In this case, by default we'll bundle all the requests to all the apps together, and allow segmenting by app using the variable on the dashboard. It is possible (and useful) to segment further (down to the controller / action), but we're probably trying to do too much with a single dashboard by trying that.

sengi approved these changes Sep 13, 2023

View reviewed changes

richardTowers added 2 commits September 14, 2023 10:18

Add release notes for prometheus change

d13c747

richardTowers marked this pull request as ready for review September 14, 2023 09:22

richardTowers requested a review from sengi September 14, 2023 09:22

sengi approved these changes Sep 14, 2023

View reviewed changes

richardTowers merged commit eec4ec5 into main Sep 14, 2023

richardTowers deleted the switch-default-aggregation branch September 14, 2023 10:15

chrisroos mentioned this pull request Sep 18, 2023

Bump govuk_app_config from 9.1.0 to 9.2.0 alphagov/authenticating-proxy#462

Merged

richardTowers mentioned this pull request Oct 4, 2023

Update request rates, errors, durations dashboard alphagov/govuk-helm-charts#1302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default to Prometheus histograms, not summaries #318

Default to Prometheus histograms, not summaries #318

richardTowers commented Sep 13, 2023

richardTowers commented Sep 13, 2023

sengi left a comment

sengi commented Sep 13, 2023 •

edited

Loading

sengi commented Sep 14, 2023

Default to Prometheus histograms, not summaries #318

Default to Prometheus histograms, not summaries #318

Conversation

richardTowers commented Sep 13, 2023

richardTowers commented Sep 13, 2023

sengi left a comment

Choose a reason for hiding this comment

sengi commented Sep 13, 2023 • edited Loading

sengi commented Sep 14, 2023

sengi commented Sep 13, 2023 •

edited

Loading