Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "Kubernetes connection error failed to decode watch event : unexpected EOF" every minute in Traefik log #732

Closed
kashook opened this issue Oct 12, 2016 · 20 comments
Labels
Milestone

Comments

@kashook
Copy link

kashook commented Oct 12, 2016

I am testing Traefik 1.1.0-rc1 as an ingress controller for a Kubernetes cluster. I've noticed that the log for Traefik has this error message showing up every minute on the dot:

time="2016-10-12T13:29:49Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.254.0.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 709.591937ms"

At the exact same instant, I notice a corresponding panic in the log of the Kubernetes API server when using Kubernetes 1.4+. (I've tested against 1.3.7, 1.3.8, 1.4.0, and 1.4.1. No panic shows in the log on the 1.3 versions, but I get the same error log from Traefik regardless).

This is the panic observed from Kubernetes 1.4.1:

Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: I1012 13:29:49.792607    9712 logs.go:41] http: panic serving 10.75.16.51:60576: kill connection/stream
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: goroutine 21581 [running]:
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.(*conn).serve.func1(0xc823f5d980)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:1389 +0xc1
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: panic(0x3601100, 0xc82024dc70)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/runtime/panic.go:443 +0x4e9
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: k8s.io/kubernetes/pkg/apiserver.(*baseTimeoutWriter).timeout(0xc8235d8ca0, 0x0, 0x0)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:321 +0x50f
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: k8s.io/kubernetes/pkg/apiserver.(*timeoutHandler).ServeHTTP(0xc820545520, 0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:201 +0x211
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: k8s.io/kubernetes/pkg/apiserver.MaxInFlightLimit.func1(0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:119 +0x11d
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.HandlerFunc.ServeHTTP(0xc822260db0, 0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:1618 +0x3a
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.serverHandler.ServeHTTP(0xc82227e800, 0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:2081 +0x19e
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.(*conn).serve(0xc823f5d980)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:1472 +0xf2e
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: created by net/http.(*Server).Serve
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:2137 +0x44e

It takes exactly one minute for the first error message to show up after the Traefik pod starts. I did a little bit of looking around and discovered the default timeout for requests to the Kubernetes api server appears to be 1 minute. It looks like Traefik initiates a request, and then it times out after 1 minute. If I get on the Kubernetes master and curl the /apis/extensions/v1beta1/ingresses endpoint it gives the expected information immediately with no error. I can also list ingress information with kubectl without error. I'm not sure which side is misbehaving, but since other tools seems to not have an issue requesting this information, it seemed like maybe Traefik was the right place to start. Note that despite this error, ingresses do still appear to show up and be accessible. I can provide more details about the setup of my cluster and how I've configured Traefik if needed.

@emilevauge
Copy link
Member

The problem is that traefik is opening long polling requests on kubernetes using watch APIs (which is probably not the case of most other tools). It seems Kubernetes is killing those requests periodically. Could you try to customize your config with --min-request-timeout ?

An optional field indicating the minimum number of seconds a handler must keep a request open before timing it out. Currently only honored by the watch request handler, which picks a randomized value above this number as the connection timeout, to spread out load. (default 1800)

http://kubernetes.io/docs/admin/kube-apiserver/

@kashook
Copy link
Author

kashook commented Oct 17, 2016

I noticed the --min-request-timeout option before I logged this issue, but it felt like since the default value is 1800 seconds that it must not be related since this message happens every 60 seconds (with the first time being 60 seconds after the traefik pod is started). I went ahead and tried setting it to 3600, but the result is exactly the same. I then tried setting it to 30. That did cause the error message in the Traefik log to start showing up every 30 seconds instead of every 60. It also had the effect of making the panics stop showing up in the API server log.

We have an internal tool that watches the cluster for ingress updates. Instead of using the Kubernetes http api directly, we use libraries from Kubernetes. I don't notice any panics on the API server or errors in our tool. I'm not sure what the Kubernetes libraries are doing differently to avoid the issue. (Here is an example usage of the Kubernetes libraries I'm talking about).

@emilevauge
Copy link
Member

If I get on the Kubernetes master and curl the /apis/extensions/v1beta1/ingresses endpoint it gives the expected information immediately with no error.

Could you retry with curl with /apis/extensions/v1beta1/ingresses?watch=true ?

@kashook
Copy link
Author

kashook commented Oct 21, 2016

I tried the curl you asked for. Note that this test was done against Kubernetes 1.4.4 which was just released.

I deleted my Traefik pods so that nothing would be causing panics in the API server log. I then followed the API server log in one terminal, and ran the curl in a second terminal. The curl initially spit out data about all the ingresses currently in the cluster (they showed as being added) and then hung (as expected). I then let it sit for 15 minutes. During the 15 minutes, no panics were observed in the API server log, and the curl just sat there.

I then did another test where I started the curl and let it sit for several minutes, and then added an ingress to the cluster. The curl printed data about the ingress being added. I deleted the ingress, and the curl printed data about the ingress being deleted.

@jonaz
Copy link
Contributor

jonaz commented Oct 27, 2016

I've also got this issue since upgrade from 1.3.7 to 1.4.4

@krancour
Copy link
Contributor

I'm also seeing this.

@TerraTech
Copy link
Contributor

TerraTech commented Oct 31, 2016

I am seeing this as well and added some visibility instrumentation in TerraTech@e631afe

I have a custom docker image with this patch included and rides on a busybox base to ease debugging:
https://hub.docker.com/r/terratech/traefik-dev/tags/
(based on: traefik-v1.1.0-rc3)

$ docker pull terratech/traefik-dev:FQ

The output from that image is:

time="2016-10-31T01:00:04Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 687.685323ms"
time="2016-10-31T01:01:05Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419097\",\"watch\":\"\"}"
time="2016-10-31T01:01:05Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 294.993311ms"
time="2016-10-31T01:02:05Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419225\",\"watch\":\"\"}"
time="2016-10-31T01:02:05Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 645.26959ms"
time="2016-10-31T01:03:06Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419355\",\"watch\":\"\"}"
time="2016-10-31T01:03:06Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 324.777677ms"
time="2016-10-31T01:04:06Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419484\",\"watch\":\"\"}"
time="2016-10-31T01:04:06Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 681.540863ms"
time="2016-10-31T01:05:07Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419612\",\"watch\":\"\"}"
time="2016-10-31T01:05:07Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 319.191714ms"

I also ran a curl test with:

$ curl -sk --cacert ca.pem --cert worker.pem --key worker-key.pem 'https://10.69.11.1/apis/extensions/v1beta1/ingresses?watch=true&resourceVersion=29419097'
*normal output and there were no problems with the watch or any errors from kube-apiserver and this has been running for several hours now.

kube-apiserver version:

$ curl -sk --cacert ca.pem --cert worker.pem --key worker-key.pem 'https://10.69.11.1/version'
{
  "major": "1",
  "minor": "4",
  "gitVersion": "v1.4.3+coreos.0",
  "gitCommit": "7819c84f25e8c661321ee80d6b9fa5f4ff06676f",
  "gitTreeState": "clean",
  "buildDate": "2016-10-17T21:19:17Z",
  "goVersion": "go1.6.3",
  "compiler": "gc",
  "platform": "linux/amd64"
}

@emilevauge
Copy link
Member

@TerraTech thanks a lot for investigating. Do you also confirm that you got panic in Kubernetes logs?

@TerraTech
Copy link
Contributor

@emilevauge correct, I didn't add them as my kube-apiserver logs pretty much mirrored the panic one in: #732 (comment)

@emilevauge
Copy link
Member

I finally reproduced this issue with curl (note resourceVersion=6508):

curl -s -v "http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=6508&watch="                                                                               1m
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET /api/v1/endpoints?resourceVersion=6508&watch= HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Transfer-Encoding: chunked
< Date: Wed, 02 Nov 2016 12:56:35 GMT
< Transfer-Encoding: chunked
< 
{"type":"MODIFIED","object":{"kind":"Endpoints","apiVersion":"v1","metadata":{"name":"kube-scheduler","namespace":"kube-system","selfLink":"/api/v1/namespaces/kube-system/endpoints/kube-scheduler","uid":"8465c101-a0f8-11e6-9bdc-484520ebc058","resourceVersion":"6509","creationTimestamp":"2016-11-02T12:33:11Z","annotations":{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"spectre\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2016-11-02T12:33:11Z\",\"renewTime\":\"2016-11-02T12:47:04Z\",\"leaderTransitions\":0}"}},"subsets":[]}}
...
{"type":"MODIFIED","object":{"kind":"Endpoints","apiVersion":"v1","metadata":{"name":"kube-scheduler","namespace":"kube-system","selfLink":"/api/v1/namespaces/kube-system/endpoints/kube-scheduler","uid":"8465c101-a0f8-11e6-9bdc-484520ebc058","resourceVersion":"7195","creationTimestamp":"2016-11-02T12:33:11Z","annotations":{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"spectre\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2016-11-02T12:33:11Z\",\"renewTime\":\"2016-11-02T12:57:33Z\",\"leaderTransitions\":0}"}},"subsets":[]}}
* transfer closed with outstanding read data remaining
* Closing connection 0

with this panic on Kubernetes apiserver:

2016-11-02T12:47:03.602023449Z I1102 12:47:03.601907       1 logs.go:41] http: panic serving 127.0.0.1:49428: kill connection/stream
2016-11-02T12:47:03.602062435Z goroutine 23651 [running]:
2016-11-02T12:47:03.602069034Z net/http.(*conn).serve.func1(0xc822884480)
2016-11-02T12:47:03.602073419Z  /usr/local/go/src/net/http/server.go:1389 +0xc1
2016-11-02T12:47:03.602077964Z panic(0x423d2c0, 0xc8203c5170)
2016-11-02T12:47:03.602082007Z  /usr/local/go/src/runtime/panic.go:443 +0x4e9
2016-11-02T12:47:03.602086549Z k8s.io/kubernetes/pkg/apiserver.(*baseTimeoutWriter).timeout(0xc823f41960, 0x0, 0x0)
2016-11-02T12:47:03.602091384Z  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:321 +0x50f
2016-11-02T12:47:03.602096430Z k8s.io/kubernetes/pkg/apiserver.(*timeoutHandler).ServeHTTP(0xc821a444a0, 0x7f07a1deaaf0, 0xc82360e5b0, 0xc8221cb0a0)
2016-11-02T12:47:03.602101298Z  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:201 +0x211
2016-11-02T12:47:03.602105993Z net/http.serverHandler.ServeHTTP(0xc8221eb780, 0x7f07a1deaaf0, 0xc82360e5b0, 0xc8221cb0a0)
2016-11-02T12:47:03.602110745Z  /usr/local/go/src/net/http/server.go:2081 +0x19e
2016-11-02T12:47:03.602115042Z net/http.(*conn).serve(0xc8228844 80)
2016-11-02T12:47:03.602119571Z  /usr/local/go/src/net/http/server.go:1472 +0xf2e
2016-11-02T12:47:03.602123929Z created by net/http.(*Server).Serve
2016-11-02T12:47:03.602128510Z  /usr/local/go/src/net/http/server.go:2137 +0x44e

This is due to Kubernetes default timeout set to 60s.
I will try to watch events using websocket instead.

@codablock
Copy link
Contributor

I also got this problem. The problem seems to be that "true" is missing in the watch urls.

This gives the reported errors and panics

curl -s -v "http://127.0.0.1:8080/apis/extensions/v1beta1/ingresses?resourceVersion=6081&watch="

This works fine and waits as long as specified in the kubernetes --min-request-timeout flag (default 1800s)

curl -s -v "http://127.0.0.1:8080/apis/extensions/v1beta1/ingresses?resourceVersion=6081&watch=true"

I tried to understand the kubernetes code when it comes to timeout handling. There is a default global timeout applied to all requests when it does not match one of known long running request url patterns. It also checks for watch=true. The global timeout is meant to be a panic, so this is expected behavior.

@codablock
Copy link
Contributor

I created a PR which adds the missing value to the requests. Currently it looks like I'm not getting these errors anymore (testing since a few minutes).

@emilevauge
Copy link
Member

Woah @codablock, I can't imagine this issue would have been caused by this 😲 ! Good catch!

@valentin2105
Copy link

How-to get the fixed version ? There is a experimental Docker image ?

Thanks

@TerraTech
Copy link
Contributor

@valentin2105 I've uploaded the patched version to my docker repo at:
docker pull terratech/traefik-dev:v1.1.0

The patch has been working pretty well for me so far.

@elghazal-a
Copy link

@TerraTech I've tested your docker image and I always encounter the same issue.

time="2016-11-25T10:34:05Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"http://172.17.4.100:8080/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 355.075717ms"

@TerraTech
Copy link
Contributor

@geniousphp Those errors will still pop up from time to time, but at least this patch has reduced the frequency of them by quite a bit. Prior to the patch, my logs show they were being emitted every minute.
#732 (comment)

@codablock
Copy link
Contributor

The errors should still happen, but much less regularly. It depends on the value provided to the apiserver with "--min-request-timeout" (default 1800). The actual timeout is randomly calculated and is always between min-request-timeout and 2*min-request-timeout.

@elghazal-a
Copy link

Okay thanks. @TerraTech I didn't notice that your patch is no more than an enhancement. I must have an other issue not related to that. Thanks

@emilevauge
Copy link
Member

Fixed by #874

@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants