Getting "Kubernetes connection error failed to decode watch event : unexpected EOF" every minute in Traefik log #732

kashook · 2016-10-12T18:27:29Z

I am testing Traefik 1.1.0-rc1 as an ingress controller for a Kubernetes cluster. I've noticed that the log for Traefik has this error message showing up every minute on the dot:

time="2016-10-12T13:29:49Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.254.0.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 709.591937ms"

At the exact same instant, I notice a corresponding panic in the log of the Kubernetes API server when using Kubernetes 1.4+. (I've tested against 1.3.7, 1.3.8, 1.4.0, and 1.4.1. No panic shows in the log on the 1.3 versions, but I get the same error log from Traefik regardless).

This is the panic observed from Kubernetes 1.4.1:

Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: I1012 13:29:49.792607    9712 logs.go:41] http: panic serving 10.75.16.51:60576: kill connection/stream
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: goroutine 21581 [running]:
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.(*conn).serve.func1(0xc823f5d980)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:1389 +0xc1
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: panic(0x3601100, 0xc82024dc70)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/runtime/panic.go:443 +0x4e9
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: k8s.io/kubernetes/pkg/apiserver.(*baseTimeoutWriter).timeout(0xc8235d8ca0, 0x0, 0x0)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:321 +0x50f
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: k8s.io/kubernetes/pkg/apiserver.(*timeoutHandler).ServeHTTP(0xc820545520, 0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:201 +0x211
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: k8s.io/kubernetes/pkg/apiserver.MaxInFlightLimit.func1(0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:119 +0x11d
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.HandlerFunc.ServeHTTP(0xc822260db0, 0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:1618 +0x3a
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.serverHandler.ServeHTTP(0xc82227e800, 0x7fce6fee43e8, 0xc82345c680, 0xc8222e8620)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:2081 +0x19e
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: net/http.(*conn).serve(0xc823f5d980)
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:1472 +0xf2e
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: created by net/http.(*Server).Serve
Oct 12 13:29:49 ip-10-75-16-50 kube-apiserver[9712]: /usr/local/go/src/net/http/server.go:2137 +0x44e

It takes exactly one minute for the first error message to show up after the Traefik pod starts. I did a little bit of looking around and discovered the default timeout for requests to the Kubernetes api server appears to be 1 minute. It looks like Traefik initiates a request, and then it times out after 1 minute. If I get on the Kubernetes master and curl the /apis/extensions/v1beta1/ingresses endpoint it gives the expected information immediately with no error. I can also list ingress information with kubectl without error. I'm not sure which side is misbehaving, but since other tools seems to not have an issue requesting this information, it seemed like maybe Traefik was the right place to start. Note that despite this error, ingresses do still appear to show up and be accessible. I can provide more details about the setup of my cluster and how I've configured Traefik if needed.

The text was updated successfully, but these errors were encountered:

emilevauge · 2016-10-17T10:05:15Z

The problem is that traefik is opening long polling requests on kubernetes using watch APIs (which is probably not the case of most other tools). It seems Kubernetes is killing those requests periodically. Could you try to customize your config with --min-request-timeout ?

An optional field indicating the minimum number of seconds a handler must keep a request open before timing it out. Currently only honored by the watch request handler, which picks a randomized value above this number as the connection timeout, to spread out load. (default 1800)

http://kubernetes.io/docs/admin/kube-apiserver/

kashook · 2016-10-17T14:59:16Z

I noticed the --min-request-timeout option before I logged this issue, but it felt like since the default value is 1800 seconds that it must not be related since this message happens every 60 seconds (with the first time being 60 seconds after the traefik pod is started). I went ahead and tried setting it to 3600, but the result is exactly the same. I then tried setting it to 30. That did cause the error message in the Traefik log to start showing up every 30 seconds instead of every 60. It also had the effect of making the panics stop showing up in the API server log.

We have an internal tool that watches the cluster for ingress updates. Instead of using the Kubernetes http api directly, we use libraries from Kubernetes. I don't notice any panics on the API server or errors in our tool. I'm not sure what the Kubernetes libraries are doing differently to avoid the issue. (Here is an example usage of the Kubernetes libraries I'm talking about).

emilevauge · 2016-10-21T12:29:06Z

If I get on the Kubernetes master and curl the /apis/extensions/v1beta1/ingresses endpoint it gives the expected information immediately with no error.

Could you retry with curl with /apis/extensions/v1beta1/ingresses?watch=true ?

kashook · 2016-10-21T14:06:21Z

I tried the curl you asked for. Note that this test was done against Kubernetes 1.4.4 which was just released.

I deleted my Traefik pods so that nothing would be causing panics in the API server log. I then followed the API server log in one terminal, and ran the curl in a second terminal. The curl initially spit out data about all the ingresses currently in the cluster (they showed as being added) and then hung (as expected). I then let it sit for 15 minutes. During the 15 minutes, no panics were observed in the API server log, and the curl just sat there.

I then did another test where I started the curl and let it sit for several minutes, and then added an ingress to the cluster. The curl printed data about the ingress being added. I deleted the ingress, and the curl printed data about the ingress being deleted.

jonaz · 2016-10-27T14:08:29Z

I've also got this issue since upgrade from 1.3.7 to 1.4.4

krancour · 2016-10-28T21:29:59Z

I'm also seeing this.

TerraTech · 2016-10-31T02:58:42Z

I am seeing this as well and added some visibility instrumentation in TerraTech@e631afe

I have a custom docker image with this patch included and rides on a busybox base to ease debugging:
https://hub.docker.com/r/terratech/traefik-dev/tags/
(based on: traefik-v1.1.0-rc3)

$ docker pull terratech/traefik-dev:FQ

The output from that image is:

time="2016-10-31T01:00:04Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 687.685323ms"
time="2016-10-31T01:01:05Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419097\",\"watch\":\"\"}"
time="2016-10-31T01:01:05Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 294.993311ms"
time="2016-10-31T01:02:05Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419225\",\"watch\":\"\"}"
time="2016-10-31T01:02:05Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 645.26959ms"
time="2016-10-31T01:03:06Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419355\",\"watch\":\"\"}"
time="2016-10-31T01:03:06Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 324.777677ms"
time="2016-10-31T01:04:06Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419484\",\"watch\":\"\"}"
time="2016-10-31T01:04:06Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 681.540863ms"
time="2016-10-31T01:05:07Z" level=info msg="[TT] URL: \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\", queryData: {\"resourceVersion\":\"29419612\",\"watch\":\"\"}"
time="2016-10-31T01:05:07Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"https://10.69.11.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 319.191714ms"

I also ran a curl test with:

$ curl -sk --cacert ca.pem --cert worker.pem --key worker-key.pem 'https://10.69.11.1/apis/extensions/v1beta1/ingresses?watch=true&resourceVersion=29419097'
*normal output and there were no problems with the watch or any errors from kube-apiserver and this has been running for several hours now.

kube-apiserver version:

$ curl -sk --cacert ca.pem --cert worker.pem --key worker-key.pem 'https://10.69.11.1/version'
{
  "major": "1",
  "minor": "4",
  "gitVersion": "v1.4.3+coreos.0",
  "gitCommit": "7819c84f25e8c661321ee80d6b9fa5f4ff06676f",
  "gitTreeState": "clean",
  "buildDate": "2016-10-17T21:19:17Z",
  "goVersion": "go1.6.3",
  "compiler": "gc",
  "platform": "linux/amd64"
}

emilevauge · 2016-11-01T16:25:51Z

@TerraTech thanks a lot for investigating. Do you also confirm that you got panic in Kubernetes logs?

TerraTech · 2016-11-01T19:53:39Z

@emilevauge correct, I didn't add them as my kube-apiserver logs pretty much mirrored the panic one in: #732 (comment)

emilevauge · 2016-11-02T13:23:57Z

I finally reproduced this issue with curl (note resourceVersion=6508):

curl -s -v "http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=6508&watch="                                                                               1m
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET /api/v1/endpoints?resourceVersion=6508&watch= HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Transfer-Encoding: chunked
< Date: Wed, 02 Nov 2016 12:56:35 GMT
< Transfer-Encoding: chunked
< 
{"type":"MODIFIED","object":{"kind":"Endpoints","apiVersion":"v1","metadata":{"name":"kube-scheduler","namespace":"kube-system","selfLink":"/api/v1/namespaces/kube-system/endpoints/kube-scheduler","uid":"8465c101-a0f8-11e6-9bdc-484520ebc058","resourceVersion":"6509","creationTimestamp":"2016-11-02T12:33:11Z","annotations":{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"spectre\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2016-11-02T12:33:11Z\",\"renewTime\":\"2016-11-02T12:47:04Z\",\"leaderTransitions\":0}"}},"subsets":[]}}
...
{"type":"MODIFIED","object":{"kind":"Endpoints","apiVersion":"v1","metadata":{"name":"kube-scheduler","namespace":"kube-system","selfLink":"/api/v1/namespaces/kube-system/endpoints/kube-scheduler","uid":"8465c101-a0f8-11e6-9bdc-484520ebc058","resourceVersion":"7195","creationTimestamp":"2016-11-02T12:33:11Z","annotations":{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"spectre\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2016-11-02T12:33:11Z\",\"renewTime\":\"2016-11-02T12:57:33Z\",\"leaderTransitions\":0}"}},"subsets":[]}}
* transfer closed with outstanding read data remaining
* Closing connection 0

with this panic on Kubernetes apiserver:

2016-11-02T12:47:03.602023449Z I1102 12:47:03.601907       1 logs.go:41] http: panic serving 127.0.0.1:49428: kill connection/stream
2016-11-02T12:47:03.602062435Z goroutine 23651 [running]:
2016-11-02T12:47:03.602069034Z net/http.(*conn).serve.func1(0xc822884480)
2016-11-02T12:47:03.602073419Z  /usr/local/go/src/net/http/server.go:1389 +0xc1
2016-11-02T12:47:03.602077964Z panic(0x423d2c0, 0xc8203c5170)
2016-11-02T12:47:03.602082007Z  /usr/local/go/src/runtime/panic.go:443 +0x4e9
2016-11-02T12:47:03.602086549Z k8s.io/kubernetes/pkg/apiserver.(*baseTimeoutWriter).timeout(0xc823f41960, 0x0, 0x0)
2016-11-02T12:47:03.602091384Z  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:321 +0x50f
2016-11-02T12:47:03.602096430Z k8s.io/kubernetes/pkg/apiserver.(*timeoutHandler).ServeHTTP(0xc821a444a0, 0x7f07a1deaaf0, 0xc82360e5b0, 0xc8221cb0a0)
2016-11-02T12:47:03.602101298Z  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:201 +0x211
2016-11-02T12:47:03.602105993Z net/http.serverHandler.ServeHTTP(0xc8221eb780, 0x7f07a1deaaf0, 0xc82360e5b0, 0xc8221cb0a0)
2016-11-02T12:47:03.602110745Z  /usr/local/go/src/net/http/server.go:2081 +0x19e
2016-11-02T12:47:03.602115042Z net/http.(*conn).serve(0xc8228844 80)
2016-11-02T12:47:03.602119571Z  /usr/local/go/src/net/http/server.go:1472 +0xf2e
2016-11-02T12:47:03.602123929Z created by net/http.(*Server).Serve
2016-11-02T12:47:03.602128510Z  /usr/local/go/src/net/http/server.go:2137 +0x44e

This is due to Kubernetes default timeout set to 60s.
I will try to watch events using websocket instead.

codablock · 2016-11-22T10:30:29Z

I also got this problem. The problem seems to be that "true" is missing in the watch urls.

This gives the reported errors and panics

curl -s -v "http://127.0.0.1:8080/apis/extensions/v1beta1/ingresses?resourceVersion=6081&watch="

This works fine and waits as long as specified in the kubernetes --min-request-timeout flag (default 1800s)

curl -s -v "http://127.0.0.1:8080/apis/extensions/v1beta1/ingresses?resourceVersion=6081&watch=true"

I tried to understand the kubernetes code when it comes to timeout handling. There is a default global timeout applied to all requests when it does not match one of known long running request url patterns. It also checks for watch=true. The global timeout is meant to be a panic, so this is expected behavior.

codablock · 2016-11-22T11:04:24Z

I created a PR which adds the missing value to the requests. Currently it looks like I'm not getting these errors anymore (testing since a few minutes).

emilevauge · 2016-11-22T12:39:04Z

Woah @codablock, I can't imagine this issue would have been caused by this 😲 ! Good catch!

valentin2105 · 2016-11-24T20:01:47Z

How-to get the fixed version ? There is a experimental Docker image ?

Thanks

TerraTech · 2016-11-25T03:31:07Z

@valentin2105 I've uploaded the patched version to my docker repo at:
docker pull terratech/traefik-dev:v1.1.0

The patch has been working pretty well for me so far.

elghazal-a · 2016-11-25T10:36:59Z

@TerraTech I've tested your docker image and I always encounter the same issue.

time="2016-11-25T10:34:05Z" level=error msg="Kubernetes connection error failed to decode watch event: GET \"http://172.17.4.100:8080/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 355.075717ms"

TerraTech · 2016-11-25T11:02:49Z

@geniousphp Those errors will still pop up from time to time, but at least this patch has reduced the frequency of them by quite a bit. Prior to the patch, my logs show they were being emitted every minute.
#732 (comment)

codablock · 2016-11-25T11:09:09Z

The errors should still happen, but much less regularly. It depends on the value provided to the apiserver with "--min-request-timeout" (default 1800). The actual timeout is randomly calculated and is always between min-request-timeout and 2*min-request-timeout.

elghazal-a · 2016-11-25T15:06:41Z

Okay thanks. @TerraTech I didn't notice that your patch is no more than an enhancement. I must have an other issue not related to that. Thanks

emilevauge · 2016-11-29T13:48:37Z

Fixed by #874

emilevauge added the investigation-needed label Oct 21, 2016

lachie83 mentioned this issue Nov 2, 2016

Add chart for Traefik-based ingress controller helm/charts#146

Merged

1 task

emilevauge added kind/enhancement a new or improved feature. and removed investigation-needed labels Nov 2, 2016

emilevauge added this to the 1.2 milestone Nov 2, 2016

jimmycuadra mentioned this issue Nov 9, 2016

Cannot connect to Kubernetes server failed to decode watch event #532

Closed

iwalz mentioned this issue Nov 21, 2016

ETCD Backend - frontend/backends missing #866

Closed

codablock mentioned this issue Nov 22, 2016

Fix missing value for k8s watch request parameter #871

Closed

codablock mentioned this issue Nov 22, 2016

Fix missing value for k8s watch request parameter #874

Merged

emilevauge closed this as completed Nov 29, 2016

traefik locked and limited conversation to collaborators Sep 1, 2019

traefiker added the status/5-frozen-due-to-age label Sep 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting "Kubernetes connection error failed to decode watch event : unexpected EOF" every minute in Traefik log #732

Getting "Kubernetes connection error failed to decode watch event : unexpected EOF" every minute in Traefik log #732

kashook commented Oct 12, 2016

emilevauge commented Oct 17, 2016

kashook commented Oct 17, 2016

emilevauge commented Oct 21, 2016

kashook commented Oct 21, 2016

jonaz commented Oct 27, 2016

krancour commented Oct 28, 2016

TerraTech commented Oct 31, 2016 •

edited

Loading

emilevauge commented Nov 1, 2016

TerraTech commented Nov 1, 2016

emilevauge commented Nov 2, 2016

codablock commented Nov 22, 2016

codablock commented Nov 22, 2016

emilevauge commented Nov 22, 2016

valentin2105 commented Nov 24, 2016

TerraTech commented Nov 25, 2016

elghazal-a commented Nov 25, 2016

TerraTech commented Nov 25, 2016

codablock commented Nov 25, 2016

elghazal-a commented Nov 25, 2016

emilevauge commented Nov 29, 2016

Getting "Kubernetes connection error failed to decode watch event : unexpected EOF" every minute in Traefik log #732

Getting "Kubernetes connection error failed to decode watch event : unexpected EOF" every minute in Traefik log #732

Comments

kashook commented Oct 12, 2016

emilevauge commented Oct 17, 2016

kashook commented Oct 17, 2016

emilevauge commented Oct 21, 2016

kashook commented Oct 21, 2016

jonaz commented Oct 27, 2016

krancour commented Oct 28, 2016

TerraTech commented Oct 31, 2016 • edited Loading

emilevauge commented Nov 1, 2016

TerraTech commented Nov 1, 2016

emilevauge commented Nov 2, 2016

codablock commented Nov 22, 2016

codablock commented Nov 22, 2016

emilevauge commented Nov 22, 2016

valentin2105 commented Nov 24, 2016

TerraTech commented Nov 25, 2016

elghazal-a commented Nov 25, 2016

TerraTech commented Nov 25, 2016

codablock commented Nov 25, 2016

elghazal-a commented Nov 25, 2016

emilevauge commented Nov 29, 2016

TerraTech commented Oct 31, 2016 •

edited

Loading