-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus request metrics tracked twice #1759
Comments
One more point I just realised for a possible fix, not tracking on the entrypoint level as described above, would also mean we would loose all 404 trackings. So the request tracking in general should have a second thought and probably some rework, so that we can track all requests (redirects, not founds, normal requests) and still have the ability to get some specific information per backend. Probably this requires to create a separate metric. Though as Traefik works a bit differently, this should be probably One more thing I would maybe do in this turn. Rename |
@marco-jantke can we close this issue? Related to #1968 |
No, we can not yet. Its still a valid issue. I will soonish come up with a follow up PR for the metrics, which will also tackle this issue. |
@marco-jantke any news on this issue ? |
Hi @mmatur, unfortunately not. I still plan to come up with a PR that improves Prometheus metrics and fixes this bug alongside. It was delayed so long because we kept finding issues with the implementation at my work. We are soon switching switching routing for our production services at and I would like to wait until I see that our current implementation of the Prometheus metrics works in production as expected and only afterwards open the PR here. Again, sorry for the long delay and that it might take a few weeks more until I can fix this issue 🐼 |
This commit extends the metrics Registry interface and extends it's usage in order to get more insightful metrics. It also renames Prometheus metrics and the service label (to either backend or entrypoint) in order to fix traefik#1759. In order to extend the Registry interface the MultiRegistry is used so that single metric providers do not necessarily have to implement all of the metrics, but only a subset. At the moment only the Prometheus implementation supports all new metrics, but extension of the other providers should be straightforward and can be done afterwards. Additionally the Prometheus exporting logic was adapted. The problem with it before was that the Prometheus client library is stateful and samples all tracked metrics on the client. This is a problem given the things you are tracking are dynamic. We have a metric traefik_backend_server_up which can be used to alert given too many servers of a backend become unhealthy. Now when Traefik learns about a new backend it will create the concrete metric with the labels for this backend. The problem is now that given the backend is undeployed, Traefik will still export the metric of the old backend because it holds the Prometheus metric state and it will only be gone after a restart of Traefik. This leads to wrong results when you calculate e.g. how many of your backends are healthy / all backends and thus breaks alerting. To fix this one has to implement a custom logic for the metric sampling. I got a confirmation for this approach on the Prometheus users google group from Prometheus maintainers. To fix the problem I implemented a configuration generation logic so that metrics that belong to a dynamic configuration (e.g. backends, entrypoints) get removed again eventually given they belong to an old configuration and are not tracked recently. https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc This PR also splits up the metric collection of requests, request durations and retries into separate metrics for entrypoints and backends. This will fix traefik#1759. Before there was only one metric traefik_requests_total which was tracked at the entrypoint level with the label service="$entrypointname" and at the backend level with service="$backendname". So at each request two traefik_requests_total metrics were trackend and so a calculation of e.g. sum(traefik_requests_total) would yield twice the requests that actually happened. This is fixed by splitting up the metric into two. For other metric providers this problem was the same and as I did not adapt those yet, they will only count the requests now against the backends. An extension of them should happen afterwards.
Do you want to request a feature or report a bug?
Bug
What did you do?
Setup a Traefik instance that provides Prometheus metrics.
What did you expect to see?
Correct recording of http requests / http request durations metrics with the
service
label set tobackend1
.What did you see instead?
Doubled recording of http requests / http request durations metrics. One time with the
service
labelhttp
(entrypoint) +backend1
.Output of
traefik version
: (What version of Traefik are you using?)What is your environment & configuration (arguments, toml, provider, platform, ...)?
Problem Description
At the moment the Traefik metrics middleware is included twice in the middleware chain. One time the metrics are included in the
startHTTPServers
here and a second time when loading the config here.This leads to the following generated metrics when requesting the
/metrics
endpoint.This is problematic, because effectively this means that when you aggregate the metrics in your Prometheus queries, without considering the labels, you will end up with twice as many requests as actually happen. Therefore the proper fix should be to remove the integration when starting the http servers. As one point to consider, this would mean that redirects are actually not tracked anymore. When tracking of the redirects is desired, I guess we should create a new metrics, to handle this cleanly.
Can you confirm that this is a bug for you and maybe give a recommendation on how to go for the fix?
The text was updated successfully, but these errors were encountered: