Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik sporadically failing when proxying requests #615

Closed
daltonmatos opened this issue Aug 12, 2016 · 12 comments
Closed

Traefik sporadically failing when proxying requests #615

daltonmatos opened this issue Aug 12, 2016 · 12 comments

Comments

@daltonmatos
Copy link

Hello,

I will show here a simple scenario that reproduces the problem that I'm having with traefik. I have a traefik instance, configured with 1 backend and proxying all requests to it. The backend is just a uWSGI instance serving a 1024 bytes static file.

First I will present the results and what I could observe and then I will show how to reproduce the problem.

Both systems are running on the same host and traefik is proxying content using 127.0.0.1. I'm using apache ab to generate the load.

Here is the current traefik config file:

graceTimeOut = 600
MaxIdleConnsPerHost = 100000

logLevel = "DEBUG"
defaultEntryPoints = ["http"]
[entryPoints]
[entryPoints.http]
address = ":90"

[web]
address = ":8082"

[file]
[backends]
  [backends.backend1]
    [backends.backend1.servers.server1]
    url = "http://127.0.0.1:9993"
    weight = 0
    [backends.backend1.LoadBalancer]
      method = "drr"
    [backends.backend1.maxconn]
      amount = 2048
      extractorfunc = "request.host"
[frontends]
  [frontends.frontend1]
  backend = "backend1"
  passHostHeader = true
    [frontends.frontend1.routes.test_1]
    rule = "Host:myhost"

I'm running this version of traefik, downloaded from github.

Version:      v1.0.2
Codename:     reblochon
Go version:   go1.6.2
Built:        2016-08-02_05:35:02PM
OS/Arch:      linux/amd64

And when I run apache ab, even with a low load I see sporadic failing requests:

The command: ab -c 2 -n 10 -H "Host: myhost" http://localhost:90/static/data

The result from apache ab:

Document Path:          /static/data
Document Length:        1024 bytes

Concurrency Level:      2
Time taken for tests:   0.016 seconds
Complete requests:      10
Failed requests:        1
   (Connect: 0, Receive: 0, Length: 1, Exceptions: 0)
Non-2xx responses:      1
Total transferred:      10849 bytes
HTML transferred:       9237 bytes
Requests per second:    630.28 [#/sec] (mean)
Time per request:       3.173 [ms] (mean)
Time per request:       1.587 [ms] (mean, across all concurrent requests)
Transfer rate:          667.76 [Kbytes/sec] received

And traefik logs shows:

DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.703455ms 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 2.42295ms 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.628134ms 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 2.788108ms 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.647956ms 
WARN[2016-08-12T18:01:14-03:00] Error forwarding to http://127.0.0.1:9993, err: net/http: HTTP/1 transport connection broken: readLoopPeekFailLocked: EOF 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.542621ms 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.591373ms 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.308187ms 
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.281541ms 

Interesting to note is that only 9 requests are shown in uWSGI logs, that is, the failed request didn't even make to the backend server.

Important to note is that I can run this same ab command directly to uWSGI instance and no requests fails. Even with higher loads, for example:

concurrency: 64, total requests: 2048

Document Path:          /static/data
Document Length:        1024 bytes

Concurrency Level:      64
Time taken for tests:   0.463 seconds
Complete requests:      2048
Failed requests:        0
Total transferred:      2275328 bytes
HTML transferred:       2097152 bytes
Requests per second:    4419.50 [#/sec] (mean)
Time per request:       14.481 [ms] (mean)
Time per request:       0.226 [ms] (mean, across all concurrent requests)
Transfer rate:          4794.98 [Kbytes/sec] received

0 failed requests when running directly to uWSGI server.

Do you have any insights of what could be happening here? At the same time that I think traefik has a problem I think that uWSGI might be also part of the problem, because if I switch uWSGI with nginx, all requests works as expected. But how to explain the fact that uWSGI responds to all requests when accessed directly (not proxied by traefik)?

Thanks a LOT for any insights you may have on this problem.

How to reproduce

Running uWSGI

You need to install uwsgi. You can do this wih pip install uwsgi. Create a static/ folder and create the 1024 bytes file inside it:

mkdir static
dd if=/dev/zero bs=1 count=1024 of=static/data

Run uwsgi with:
uwsgi --buffer-size 32768 --http-socket 127.0.0.1:9993 --processes 4 --http-timeout 60 --static-map2=/static=./

Running Traefik

Just run traefik with the provided config.

Running apache ab

ab -c 2 -n 10 -H "Host: myhost" http://localhost:90/static/data

@andersbetner
Copy link
Contributor

This is just my guess, but it seems like uwsgi closes a connection that traefik thinks should be active (connection: keep-alive).

There is an issue related to this in golang golang/go#15446 (resolved in golang 1.7 due out any day now...)

That issue only mentions POST requests, but the symptom you get is the same.

What happens if you disable keep-alive in uwsgi? Do you get the same result?

@daltonmatos
Copy link
Author

Hello @andersbetner, thanks for the insight. Actually my current uwsgi setup dos not add the Connection: keep-alive header. So my first results were gathered with keep-alive turned off.

I jusst ran a quick test and added the option --add-header "Connection: Keep-Alive" to my uwsgi instance. The header is being added corectly:

➜  ~ curl -siH http://localhost:31100/static/css/skins/skin.css | head -5
HTTP/1.1 200 OK
Content-Type: text/css
Content-Length: 5331
Last-Modified: Tue, 28 Jun 2016 17:19:21 GMT
Connection: Keep-Alive

I re-ran the ab test and still see failing requests, same behavior.

I also tried to compile traefik changing the FROM golang:1.6.3 line to FROM golang:1.7 (inside build.Dockerfile), but it did not work. =D

Indeed it will be a good test to dun traefik with Go 1.7.

Thanks!!

@daltonmatos
Copy link
Author

Hello all,

I did the following test:

Ran the same ab command line (2 concurent connections with 10 total requests), but this time I captured all the traffic with tcpdump. In this run I had 1 failed request.

Here is the tcpdump saved file: https://s3-us-west-2.amazonaws.com/daltonmatos-public/ab-lo.tcpdump

Note: In this test traefik is running on port 90 and uwsgi on port 9000.

I opened the saved tcpdump file in wireshark and applied these filters:

tcp.dstport == 90 && tcp.connection.syn Returns 10 results, that is, apache ab opened 10 connections to traefik.
tcp.dstport == 9000 && tcp.connection.syn Returns 9 results, that is, traefik is only connecting 9 times to uwsgi.
http && http.request.uri == "/static/collect/ok" && tcp.dstport == 9000 Returns 9 results, thats is, only 9 http requests were sent to uwsgi.

This shows me that traefik is missing one connection to my backend, in this case, uwsgi.

Any insight about what could cause this?

@emilevauge

@errm
Copy link
Contributor

errm commented Sep 12, 2016

@daltonmatos it looks to me like the issue linked to by @andersbetner is to blame. could you try your tests with containous/traefik:pr-641 as that image should have been built with go 1.7...

@daltonmatos
Copy link
Author

Ok @errm, I will test it again and report back. Thank you very much!

@daltonmatos
Copy link
Author

Hello @errm, I just ran the tests again. The error still occurs, but the error message is now much more understandable. Look:

This is the log of a fisrt run of 10 requests, concurrency of 2.

time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 768.454µs" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 804.515µs" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 659.699µs" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 866.266µs" 
time="2016-09-13T14:02:15Z" level=warning msg="Error forwarding to http://127.0.0.1:9000, err: http: server closed idle connection" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 637.583µs" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 386.634µs" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 654.521µs" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 430.565µs" 
time="2016-09-13T14:02:15Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 466.203µs"

With this error I can now search about how uwsgi behaves with idle connections.

On a second run (same parameters) there was new errors.

time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 488.213µs" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 794.536µs" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 451.204µs" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 582.514µs" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 367.537µs" 
time="2016-09-13T14:02:49Z" level=warning msg="Error forwarding to http://127.0.0.1:9000, err: EOF" 
time="2016-09-13T14:02:49Z" level=debug msg="New attempt 2 for request: /static/collect/ok" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 392.388µs" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 400.474µs" 
time="2016-09-13T14:02:49Z" level=warning msg="Error forwarding to http://127.0.0.1:9000, err: http: server closed idle connection" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 312.167µs" 
time="2016-09-13T14:02:49Z" level=debug msg="Round trip: http://127.0.0.1:9000, code: 200, duration: 480.283µs" 

Both runs resulted in 1 request failed, which means that traefik seems to be correctly retrying the request when this "EOF" error occurs.

About the idle connections, does it make sense to traefik detect this closed connection and automatically reconnect with the backend?

I will now search about uwsgi and idle connections.

Thanks a lot for the help!

Also, are there any plans to officially release traefik built with golang 1.7? Thanks!

@daltonmatos
Copy link
Author

Hey @errm and @andersbetner,

@andersbetner suggested earlier that keep alive could be the problem. At that time I "validated" this by only adding the Connection: keep-alive header in the responses but this is wrong since uwsgi is not really working with keep alive turned on, after all this is just a header like any other.

I now found out the --http-keepalive option of uwsgi (I found it searching about idle connections being closed) and now it all works!

I ran a test with 128 concurrent requests and a total of 10k requests, no errors!

At the same time that I thank you all for the help I also apologize for making you loose some time on this problem. Although the problem could have been solved before, the "idle connection" hint was the key to finally solve it.

Thank you very much!

Even so, I still hope to see traefik officially built on golang 1.7! ;-)

@andersbetner
Copy link
Contributor

@daltonmatos
No worries, I'm glad you solved your problem.

@daltonmatos
Copy link
Author

Just closing, since the problem is now solved.

Thanks all!

@nathan-osman
Copy link
Contributor

@daltonmatos I am having the exact same issue but uWSGI doesn't seem to have a --http-keepalive flag:

# uwsgi --http-keepalive
uwsgi: unrecognized option: http-keepalive
getopt_long() error

What version of uWSGI were you using?

@nathan-osman
Copy link
Contributor

nathan-osman commented Dec 3, 2016

Seems like the http plugin needs to be loaded in order for the --http-keepalive flag to work:

# uwsgi --plugin http --http-keepalive

However, even with that set, I'm still getting the same http: server closed idle connection error. Strangely enough, running ab with the keepalive (-k) flag produces the following output:

# ab -c 2 -n 100 -k http://myserver:80/static/img/photo/CYC5207.jpg
...
Concurrency Level:      2
Time taken for tests:   0.073 seconds
Complete requests:      100
Failed requests:        0
Keep-Alive requests:    0
Total transferred:      29774500 bytes
HTML transferred:       29763000 bytes

The Keep-Alive requests: 0 is leading me to believe that I'm still missing a setting or flag somewhere.

Attempting the requests with cURL (using the -v flag) reveals the following:

* Connection #0 to host myserver left intact
* Found bundle for host myserver: 0x55b52c8b9630 [can pipeline]
* Connection 0 seems to be dead!
* Closing connection 0

@jpic
Copy link

jpic commented Mar 2, 2017

@nathan-osman you need to add the http plugin, ie. --plugin python27,http --http-keepalive

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants