Prometheus request metrics tracked twice #1759

m3co-code · 2017-06-16T09:44:58Z

Do you want to request a feature or report a bug?

Bug

What did you do?

Setup a Traefik instance that provides Prometheus metrics.

What did you expect to see?

Correct recording of http requests / http request durations metrics with the service label set to backend1.

What did you see instead?

Doubled recording of http requests / http request durations metrics. One time with the service label http (entrypoint) + backend1.

Output of `traefik version`: (What version of Traefik are you using?)

Version:      dev
Codename:     cheddar
Go version:   go1.8.3
Built:        I don't remember exactly
OS/Arch:      darwin/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

# traefik.toml
logLevel = "DEBUG"
defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.http]
  address = ":8080"

[file]

# rules
[backends]
  [backends.backend1]
     [backends.backend1.servers.server1]
     url = "http://localhost:8081"
     [backends.backend1.servers.server2]
     url = "http://localhost:8082"
     [backends.backend1.servers.server3]
     url = "http://localhost:8083"

[frontends]
  [frontends.frontend1]
  backend = "backend1"
    [frontends.frontend1.routes.test_1]
    rule = "Path:/ok"

[retry]
attempts = 3

[web]
address = ":8000"
  [web.metrics.prometheus]
    Buckets=[0.1,0.3,1.2,5.0]

Problem Description

At the moment the Traefik metrics middleware is included twice in the middleware chain. One time the metrics are included in the startHTTPServers here and a second time when loading the config here.

This leads to the following generated metrics when requesting the /metrics endpoint.

# HELP traefik_backend_retries_total How many request retries happened in total.
# TYPE traefik_backend_retries_total counter
traefik_backend_retries_total{service="backend1"} 14

# HELP traefik_request_duration_seconds How long it took to process the request.
# TYPE traefik_request_duration_seconds histogram
traefik_request_duration_seconds_bucket{service="backend1",le="0.1"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="0.3"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="1.2"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="5"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="+Inf"} 7
traefik_request_duration_seconds_sum{service="backend1"} 0.036858719
traefik_request_duration_seconds_count{service="backend1"} 7
traefik_request_duration_seconds_bucket{service="http",le="0.1"} 7
traefik_request_duration_seconds_bucket{service="http",le="0.3"} 7
traefik_request_duration_seconds_bucket{service="http",le="1.2"} 7
traefik_request_duration_seconds_bucket{service="http",le="5"} 7
traefik_request_duration_seconds_bucket{service="http",le="+Inf"} 7
traefik_request_duration_seconds_sum{service="http"} 0.037084179
traefik_request_duration_seconds_count{service="http"} 7

# HELP traefik_requests_total How many HTTP requests processed, partitioned by status code and method.
# TYPE traefik_requests_total counter
traefik_requests_total{code="200",method="GET",service="backend1"} 7
traefik_requests_total{code="200",method="GET",service="http"} 7

This is problematic, because effectively this means that when you aggregate the metrics in your Prometheus queries, without considering the labels, you will end up with twice as many requests as actually happen. Therefore the proper fix should be to remove the integration when starting the http servers. As one point to consider, this would mean that redirects are actually not tracked anymore. When tracking of the redirects is desired, I guess we should create a new metrics, to handle this cleanly.

Can you confirm that this is a bug for you and maybe give a recommendation on how to go for the fix?

The text was updated successfully, but these errors were encountered:

m3co-code · 2017-06-16T13:00:32Z

One more point I just realised for a possible fix, not tracking on the entrypoint level as described above, would also mean we would loose all 404 trackings. So the request tracking in general should have a second thought and probably some rework, so that we can track all requests (redirects, not founds, normal requests) and still have the ability to get some specific information per backend. Probably this requires to create a separate metric. haproxy for example splits up the metrics here into frontend and backend.

Though as Traefik works a bit differently, this should be probably entrypoints and backends. E.g. traefik_entrypoint_requests_total with label entrypoint and traefik_backend_requests_total with label service. To keep compatibility we should probably stay with traefik_requests_total, which also makes sense semantically. WDYT about this approach?

One more thing I would maybe do in this turn. Rename service to server here. Then we would be according to the naming scheme that is used throughout the code. WDYT?

ldez · 2017-09-01T12:35:35Z

@marco-jantke can we close this issue? Related to #1968

m3co-code · 2017-09-01T14:29:57Z

No, we can not yet. Its still a valid issue. I will soonish come up with a follow up PR for the metrics, which will also tackle this issue.

mmatur · 2017-12-08T07:04:16Z

@marco-jantke any news on this issue ?

m3co-code · 2017-12-11T07:33:24Z

Hi @mmatur, unfortunately not. I still plan to come up with a PR that improves Prometheus metrics and fixes this bug alongside. It was delayed so long because we kept finding issues with the implementation at my work. We are soon switching switching routing for our production services at and I would like to wait until I see that our current implementation of the Prometheus metrics works in production as expected and only afterwards open the PR here. Again, sorry for the long delay and that it might take a few weeks more until I can fix this issue 🐼

m3co-code · 2017-12-14T11:30:49Z

@mmatur I finally did it! #2567

This commit extends the metrics Registry interface and extends it's usage in order to get more insightful metrics. It also renames Prometheus metrics and the service label (to either backend or entrypoint) in order to fix traefik#1759. In order to extend the Registry interface the MultiRegistry is used so that single metric providers do not necessarily have to implement all of the metrics, but only a subset. At the moment only the Prometheus implementation supports all new metrics, but extension of the other providers should be straightforward and can be done afterwards. Additionally the Prometheus exporting logic was adapted. The problem with it before was that the Prometheus client library is stateful and samples all tracked metrics on the client. This is a problem given the things you are tracking are dynamic. We have a metric traefik_backend_server_up which can be used to alert given too many servers of a backend become unhealthy. Now when Traefik learns about a new backend it will create the concrete metric with the labels for this backend. The problem is now that given the backend is undeployed, Traefik will still export the metric of the old backend because it holds the Prometheus metric state and it will only be gone after a restart of Traefik. This leads to wrong results when you calculate e.g. how many of your backends are healthy / all backends and thus breaks alerting. To fix this one has to implement a custom logic for the metric sampling. I got a confirmation for this approach on the Prometheus users google group from Prometheus maintainers. To fix the problem I implemented a configuration generation logic so that metrics that belong to a dynamic configuration (e.g. backends, entrypoints) get removed again eventually given they belong to an old configuration and are not tracked recently. https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc This PR also splits up the metric collection of requests, request durations and retries into separate metrics for entrypoints and backends. This will fix traefik#1759. Before there was only one metric traefik_requests_total which was tracked at the entrypoint level with the label service="$entrypointname" and at the backend level with service="$backendname". So at each request two traefik_requests_total metrics were trackend and so a calculation of e.g. sum(traefik_requests_total) would yield twice the requests that actually happened. This is fixed by splitting up the metric into two. For other metric providers this problem was the same and as I did not adapt those yet, they will only count the requests now against the backends. An extension of them should happen afterwards.

ldez added area/middleware/metrics kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. labels Jun 16, 2017

m3co-code mentioned this issue Jul 10, 2017

extend and refactor prometheus metrics #1850

Closed

ldez added the contributor/waiting-for-feedback label Sep 1, 2017

ldez assigned m3co-code Sep 1, 2017

ldez removed the contributor/waiting-for-feedback label Sep 30, 2017

m3co-code mentioned this issue Dec 14, 2017

Extend metrics and rebuild prometheus exporting logic #2567

Merged

traefiker closed this as completed in #2567 Jan 26, 2018

traefiker added this to the 1.6 milestone Jan 26, 2018

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus request metrics tracked twice #1759

Prometheus request metrics tracked twice #1759

m3co-code commented Jun 16, 2017

m3co-code commented Jun 16, 2017

ldez commented Sep 1, 2017

m3co-code commented Sep 1, 2017

mmatur commented Dec 8, 2017

m3co-code commented Dec 11, 2017

m3co-code commented Dec 14, 2017

Prometheus request metrics tracked twice #1759

Prometheus request metrics tracked twice #1759

Comments

m3co-code commented Jun 16, 2017

Do you want to request a feature or report a bug?

What did you do?

What did you expect to see?

What did you see instead?

Output of traefik version: (What version of Traefik are you using?)

What is your environment & configuration (arguments, toml, provider, platform, ...)?

Problem Description

m3co-code commented Jun 16, 2017

ldez commented Sep 1, 2017

m3co-code commented Sep 1, 2017

mmatur commented Dec 8, 2017

m3co-code commented Dec 11, 2017

m3co-code commented Dec 14, 2017

Output of `traefik version`: (What version of Traefik are you using?)