Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus request metrics tracked twice #1759

Closed
m3co-code opened this issue Jun 16, 2017 · 6 comments · Fixed by #2567
Closed

Prometheus request metrics tracked twice #1759

m3co-code opened this issue Jun 16, 2017 · 6 comments · Fixed by #2567
Assignees
Labels
area/middleware/metrics kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. status/5-frozen-due-to-age
Milestone

Comments

@m3co-code
Copy link
Contributor

Do you want to request a feature or report a bug?

Bug

What did you do?

Setup a Traefik instance that provides Prometheus metrics.

What did you expect to see?

Correct recording of http requests / http request durations metrics with the service label set to backend1.

What did you see instead?

Doubled recording of http requests / http request durations metrics. One time with the service label http (entrypoint) + backend1.

Output of traefik version: (What version of Traefik are you using?)

Version:      dev
Codename:     cheddar
Go version:   go1.8.3
Built:        I don't remember exactly
OS/Arch:      darwin/amd64

What is your environment & configuration (arguments, toml, provider, platform, ...)?

# traefik.toml
logLevel = "DEBUG"
defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.http]
  address = ":8080"

[file]

# rules
[backends]
  [backends.backend1]
     [backends.backend1.servers.server1]
     url = "http://localhost:8081"
     [backends.backend1.servers.server2]
     url = "http://localhost:8082"
     [backends.backend1.servers.server3]
     url = "http://localhost:8083"

[frontends]
  [frontends.frontend1]
  backend = "backend1"
    [frontends.frontend1.routes.test_1]
    rule = "Path:/ok"

[retry]
attempts = 3

[web]
address = ":8000"
  [web.metrics.prometheus]
    Buckets=[0.1,0.3,1.2,5.0]

Problem Description

At the moment the Traefik metrics middleware is included twice in the middleware chain. One time the metrics are included in the startHTTPServers here and a second time when loading the config here.

This leads to the following generated metrics when requesting the /metrics endpoint.

# HELP traefik_backend_retries_total How many request retries happened in total.
# TYPE traefik_backend_retries_total counter
traefik_backend_retries_total{service="backend1"} 14

# HELP traefik_request_duration_seconds How long it took to process the request.
# TYPE traefik_request_duration_seconds histogram
traefik_request_duration_seconds_bucket{service="backend1",le="0.1"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="0.3"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="1.2"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="5"} 7
traefik_request_duration_seconds_bucket{service="backend1",le="+Inf"} 7
traefik_request_duration_seconds_sum{service="backend1"} 0.036858719
traefik_request_duration_seconds_count{service="backend1"} 7
traefik_request_duration_seconds_bucket{service="http",le="0.1"} 7
traefik_request_duration_seconds_bucket{service="http",le="0.3"} 7
traefik_request_duration_seconds_bucket{service="http",le="1.2"} 7
traefik_request_duration_seconds_bucket{service="http",le="5"} 7
traefik_request_duration_seconds_bucket{service="http",le="+Inf"} 7
traefik_request_duration_seconds_sum{service="http"} 0.037084179
traefik_request_duration_seconds_count{service="http"} 7

# HELP traefik_requests_total How many HTTP requests processed, partitioned by status code and method.
# TYPE traefik_requests_total counter
traefik_requests_total{code="200",method="GET",service="backend1"} 7
traefik_requests_total{code="200",method="GET",service="http"} 7

This is problematic, because effectively this means that when you aggregate the metrics in your Prometheus queries, without considering the labels, you will end up with twice as many requests as actually happen. Therefore the proper fix should be to remove the integration when starting the http servers. As one point to consider, this would mean that redirects are actually not tracked anymore. When tracking of the redirects is desired, I guess we should create a new metrics, to handle this cleanly.

Can you confirm that this is a bug for you and maybe give a recommendation on how to go for the fix?

@ldez ldez added area/middleware/metrics kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. labels Jun 16, 2017
@m3co-code
Copy link
Contributor Author

One more point I just realised for a possible fix, not tracking on the entrypoint level as described above, would also mean we would loose all 404 trackings. So the request tracking in general should have a second thought and probably some rework, so that we can track all requests (redirects, not founds, normal requests) and still have the ability to get some specific information per backend. Probably this requires to create a separate metric. haproxy for example splits up the metrics here into frontend and backend.

Though as Traefik works a bit differently, this should be probably entrypoints and backends. E.g. traefik_entrypoint_requests_total with label entrypoint and traefik_backend_requests_total with label service. To keep compatibility we should probably stay with traefik_requests_total, which also makes sense semantically. WDYT about this approach?

One more thing I would maybe do in this turn. Rename service to server here. Then we would be according to the naming scheme that is used throughout the code. WDYT?

@ldez
Copy link
Contributor

ldez commented Sep 1, 2017

@marco-jantke can we close this issue? Related to #1968

@m3co-code
Copy link
Contributor Author

No, we can not yet. Its still a valid issue. I will soonish come up with a follow up PR for the metrics, which will also tackle this issue.

@mmatur
Copy link
Member

mmatur commented Dec 8, 2017

@marco-jantke any news on this issue ?

@m3co-code
Copy link
Contributor Author

Hi @mmatur, unfortunately not. I still plan to come up with a PR that improves Prometheus metrics and fixes this bug alongside. It was delayed so long because we kept finding issues with the implementation at my work. We are soon switching switching routing for our production services at and I would like to wait until I see that our current implementation of the Prometheus metrics works in production as expected and only afterwards open the PR here. Again, sorry for the long delay and that it might take a few weeks more until I can fix this issue 🐼

@m3co-code
Copy link
Contributor Author

@mmatur I finally did it! #2567

traefiker pushed a commit to m3co-code/traefik that referenced this issue Jan 26, 2018
This commit extends the metrics Registry interface and extends it's
usage in order to get more insightful metrics. It also renames
Prometheus metrics and the service label (to either backend or
entrypoint) in order to fix traefik#1759.

In order to extend the Registry interface the MultiRegistry is used so
that single metric providers do not necessarily have to implement all of
the metrics, but only a subset. At the moment only the
Prometheus implementation supports all new metrics, but extension of the
other providers should be straightforward and can be done afterwards.

Additionally the Prometheus exporting logic was adapted. The problem
with it before was that the Prometheus client library is stateful and
samples all tracked metrics on the client. This is a problem given the
things you are tracking are dynamic. We have a metric
traefik_backend_server_up which can be used to alert given too many
servers of a backend become unhealthy. Now when Traefik learns about a
new backend it will create the concrete metric with the labels for this
backend. The problem is now that given the backend is undeployed,
Traefik will still export the metric of the old backend because it holds
the Prometheus metric state and it will only be gone after a restart of
Traefik. This leads to wrong results when you calculate e.g. how many of
your backends are healthy / all backends and thus breaks alerting. To
fix this one has to implement a custom logic for the metric sampling. I
got a confirmation for this approach on the Prometheus users google
group from Prometheus maintainers. To fix the problem I implemented a
configuration generation logic so that metrics that belong to a dynamic
configuration (e.g. backends, entrypoints) get removed  again eventually
given they belong to an old configuration and are not tracked recently.
https://groups.google.com/forum/#!topic/prometheus-users/EbumSJYsJyc

This PR also splits up the metric collection of requests, request
durations and retries into separate metrics for entrypoints and
backends. This will fix traefik#1759. Before there was only one metric
traefik_requests_total which was tracked at the entrypoint level with
the label service="$entrypointname" and at the backend level with
service="$backendname". So at each request two traefik_requests_total
metrics were trackend and so a calculation of e.g.
sum(traefik_requests_total) would yield twice the requests that actually
happened. This is fixed by splitting up the metric into two. For other
metric providers this problem was the same and as I did not adapt those
yet, they will only count the requests now against the backends. An
extension of them should happen afterwards.
@traefiker traefiker added this to the 1.6 milestone Jan 26, 2018
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/middleware/metrics kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. status/5-frozen-due-to-age
Projects
None yet
4 participants