-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting "Kubernetes connection error failed to decode watch event : unexpected EOF" every minute in Traefik log #732
Comments
The problem is that traefik is opening long polling requests on kubernetes using watch APIs (which is probably not the case of most other tools). It seems Kubernetes is killing those requests periodically. Could you try to customize your config with
|
I noticed the We have an internal tool that watches the cluster for ingress updates. Instead of using the Kubernetes http api directly, we use libraries from Kubernetes. I don't notice any panics on the API server or errors in our tool. I'm not sure what the Kubernetes libraries are doing differently to avoid the issue. (Here is an example usage of the Kubernetes libraries I'm talking about). |
Could you retry with curl with |
I tried the curl you asked for. Note that this test was done against Kubernetes 1.4.4 which was just released. I deleted my Traefik pods so that nothing would be causing panics in the API server log. I then followed the API server log in one terminal, and ran the curl in a second terminal. The curl initially spit out data about all the ingresses currently in the cluster (they showed as being added) and then hung (as expected). I then let it sit for 15 minutes. During the 15 minutes, no panics were observed in the API server log, and the curl just sat there. I then did another test where I started the curl and let it sit for several minutes, and then added an ingress to the cluster. The curl printed data about the ingress being added. I deleted the ingress, and the curl printed data about the ingress being deleted. |
I've also got this issue since upgrade from 1.3.7 to 1.4.4 |
I'm also seeing this. |
I am seeing this as well and added some visibility instrumentation in TerraTech@e631afe I have a custom docker image with this patch included and rides on a busybox base to ease debugging: $ docker pull terratech/traefik-dev:FQ The output from that image is:
I also ran a curl test with: $ curl -sk --cacert ca.pem --cert worker.pem --key worker-key.pem 'https://10.69.11.1/apis/extensions/v1beta1/ingresses?watch=true&resourceVersion=29419097'
*normal output and there were no problems with the watch or any errors from kube-apiserver and this has been running for several hours now. kube-apiserver version: $ curl -sk --cacert ca.pem --cert worker.pem --key worker-key.pem 'https://10.69.11.1/version'
{
"major": "1",
"minor": "4",
"gitVersion": "v1.4.3+coreos.0",
"gitCommit": "7819c84f25e8c661321ee80d6b9fa5f4ff06676f",
"gitTreeState": "clean",
"buildDate": "2016-10-17T21:19:17Z",
"goVersion": "go1.6.3",
"compiler": "gc",
"platform": "linux/amd64"
} |
@TerraTech thanks a lot for investigating. Do you also confirm that you got panic in Kubernetes logs? |
@emilevauge correct, I didn't add them as my kube-apiserver logs pretty much mirrored the panic one in: #732 (comment) |
I finally reproduced this issue with
with this panic on Kubernetes
This is due to Kubernetes default timeout set to 60s. |
I also got this problem. The problem seems to be that "true" is missing in the watch urls. This gives the reported errors and panics
This works fine and waits as long as specified in the kubernetes --min-request-timeout flag (default 1800s)
I tried to understand the kubernetes code when it comes to timeout handling. There is a default global timeout applied to all requests when it does not match one of known long running request url patterns. It also checks for watch=true. The global timeout is meant to be a panic, so this is expected behavior. |
I created a PR which adds the missing value to the requests. Currently it looks like I'm not getting these errors anymore (testing since a few minutes). |
Woah @codablock, I can't imagine this issue would have been caused by this 😲 ! Good catch! |
How-to get the fixed version ? There is a experimental Docker image ? Thanks |
@valentin2105 I've uploaded the patched version to my docker repo at: The patch has been working pretty well for me so far. |
@TerraTech I've tested your docker image and I always encounter the same issue.
|
@geniousphp Those errors will still pop up from time to time, but at least this patch has reduced the frequency of them by quite a bit. Prior to the patch, my logs show they were being emitted every minute. |
The errors should still happen, but much less regularly. It depends on the value provided to the apiserver with "--min-request-timeout" (default 1800). The actual timeout is randomly calculated and is always between min-request-timeout and 2*min-request-timeout. |
Okay thanks. @TerraTech I didn't notice that your patch is no more than an enhancement. I must have an other issue not related to that. Thanks |
Fixed by #874 |
I am testing Traefik 1.1.0-rc1 as an ingress controller for a Kubernetes cluster. I've noticed that the log for Traefik has this error message showing up every minute on the dot:
At the exact same instant, I notice a corresponding panic in the log of the Kubernetes API server when using Kubernetes 1.4+. (I've tested against 1.3.7, 1.3.8, 1.4.0, and 1.4.1. No panic shows in the log on the 1.3 versions, but I get the same error log from Traefik regardless).
This is the panic observed from Kubernetes 1.4.1:
It takes exactly one minute for the first error message to show up after the Traefik pod starts. I did a little bit of looking around and discovered the default timeout for requests to the Kubernetes api server appears to be 1 minute. It looks like Traefik initiates a request, and then it times out after 1 minute. If I get on the Kubernetes master and curl the
/apis/extensions/v1beta1/ingresses
endpoint it gives the expected information immediately with no error. I can also list ingress information with kubectl without error. I'm not sure which side is misbehaving, but since other tools seems to not have an issue requesting this information, it seemed like maybe Traefik was the right place to start. Note that despite this error, ingresses do still appear to show up and be accessible. I can provide more details about the setup of my cluster and how I've configured Traefik if needed.The text was updated successfully, but these errors were encountered: