Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random 500 Error returned by Traefik #1926

Closed
karthimohan opened this issue Aug 7, 2017 · 18 comments · Fixed by vulcand/oxy#155
Closed

Random 500 Error returned by Traefik #1926

karthimohan opened this issue Aug 7, 2017 · 18 comments · Fixed by vulcand/oxy#155
Assignees
Labels
kind/enhancement a new or improved feature. priority/P2 need to be fixed in the future status/5-frozen-due-to-age
Milestone

Comments

@karthimohan
Copy link

karthimohan commented Aug 7, 2017

Report a bug

What version of Traefik are you using (traefik version)?

v1.3.0

What did you do?

We are using Traefik as a load balancer using Marathon provider in our Production

Below is our setup

Edge Proxy - Haproxy
Internal proxy Layer - Traefik
We use Haproxy as our edge proxy due to complex rewrites/ redirects and the backends are configured configured with Traefik servers.
Also all our internal service to service communication happens via Traefik.

Here is our traffic flow

API Call --> Haproxy --> Traefik Servers --> Docker containers

Initially when we introduced Traefik in Prod, we just switched the inter service communication that used to happen via internal haproxy load balancers, we did not see any issues.
So we decided to move the Edge Proxy Traffic via Traefik instead of sending directly to app containers. We made this change only to send 10% of our traffic to observe before we switch all traffic.

So out of 3 Edge Haproxy Servers, only one server sends traffic via Traefik and other 2 servers send directly to app containers.

What did you expect to see?

No Change in Behavior

What did you see instead?

After we made this change, we started observing random 500 Errors returned by the Haproxy server sending Traffic via Traefik server. But we dont see that issue on Proxy servers for similar api calls. We dont the see those failed requests reaching the Backend app containers as we dont see any events in the logs during this time

Log Event from Haproxy

Aug 6 22:56:52 localhost.localdomain haproxy[118832]: ::ffff:12:23:234:12:35449 [06/Aug/2017:22:56:51.741] HTTPS~ appname/traefik-80 345/0/0/231/576 500 157 - - ---- 35/34/0/0/0 0/0 {api.abc.cloud||||okhttp/3.4.2} "GET /v1.1/abc/c45f21be84824984b34c8538bd8b519a/abc?offset=1000&count=100 HTTP/1.1"

Log Event from Traefik

101.110.24.120 - - [06/Aug/2017:22:56:52 -0700] "GET /api/abc/c45f21be84824984b34c8538bd8b519a/abc HTTP/1.1" 500 21 "" "okhttp/3.4.2" 44609802 "appname" "http://hostname:31020" 229ms

What is your environment & configuration (arguments, toml, provider, platform, ...)?

configuration
################################################################
# Global configuration
################################################################

# Timeout in seconds.
# Duration to give active requests a chance to finish during hot-reloads
#
# Optional
# Default: 10
#
graceTimeOut = 10

# Traefik logs file
# If not defined, logs to stdout
#
# Optional
#
traefikLogsFile = "/var/log/traefik/traefik.log"


# Access logs file
#
# Optional
#
accessLogsFile = "/var/log/traefik/traefik-access.log"

# Log level
#
# Optional
# Default: "ERROR"
#
logLevel = "DEBUG"

# Backends throttle duration: minimum duration between 2 events from providers
# before applying a new configuration. It avoids unnecessary reloads if multiples events
# are sent in a short amount of time.
#
# Optional
# Default: "2s"
#
ProvidersThrottleDuration = 1


# If non-zero, controls the maximum idle (keep-alive) to keep per-host.  If zero, DefaultMaxIdleConnsPerHost is used.
# If you encounter 'too many open files' errors, you can either change this value, or change `ulimit` value.
#
# Optional
# Default: http.DefaultMaxIdleConnsPerHost
#
MaxIdleConnsPerHost = 200

# If set to true invalid SSL certificates are accepted for backends.
# Note: This disables detection of man-in-the-middle attacks so should only be used on secure backend networks.
# Optional
# Default: false
#
InsecureSkipVerify = false

# Entrypoints to be used by frontends that do not specify any entrypoint.
# Each frontend can specify its own entrypoints.
#
# Optional
# Default: ["http"]
#
defaultEntryPoints = ["http"]

################################################################
# Web configuration backend
################################################################

[web]
address = ":8080"

################################################################
# Mesos/Marathon configuration backend
################################################################

# Enable Marathon configuration backend
#
# Optional
#
#[marathon]
[marathon]


# Marathon server endpoint.
# You can also specify multiple endpoint for Marathon:
# endpoint := "http://10.241.1.71:8080,10.241.1.72:8080,10.241.1.73:8080"
#
# Required
#
endpoint = "http://101.211.1.11:8080,101.211.1.13:8080,101.211.1.13:8080"

# Enable watch Marathon changes
#
# Optional
#
watch = true

# Default domain used.
#
# Required
#
domain = "traefik.service.consul"

# Override default configuration template. For advanced users :)
#
# Optional
#
# filename = "marathon.tmpl"
filename = "/opt/traefik/conf/marathon.tmpl"

# Expose Marathon apps by default in traefik
#
# Optional
# Default: false
#
exposedByDefault = false


# Convert Marathon groups to subdomains
# Default behavior: /foo/bar/myapp => foo-bar-myapp.{defaultDomain}
# with groupsAsSubDomains enabled: /foo/bar/myapp => myapp.bar.foo.{defaultDomain}
#
# Optional
# Default: false
#
groupsAsSubDomains = false

# Enable Marathon basic authentication
#
# Optional
#
[marathon.basic]
httpBasicAuthUser = "xxx"
httpBasicPassword = "xxxxxxx"

# To enable more detailed statistics
[web.statistics]
   RecentErrors = 10
# To enable Traefik to export internal metrics to Prometheus
[web.metrics.prometheus]
   Buckets=[0.1,0.3,1.2,5.0]

# TLS client configuration. https://golang.org/pkg/crypto/tls/#Config
#
# Optional
#
# [marathon.TLS]
# InsecureSkipVerify = true


[retry]

We use Traefik Healthcheck and below is the Marathon Labels for one of the service where we see the failure

  "labels": {
    "consul": "servicename-5654",
    "external": "tag",
    "internal": "tag",
    "prod": "tag",
    "5654": "tag",
    "traefik.backend.healthcheck.path": "/healthcheck",
    "traefik.backend.healthcheck.interval": "10s",
    "traefik.frontend.rule": "PathPrefixStrip:/servicename",
    "traefik.backend": "-servicename",
    "traefik.portIndex": "0",
    "traefik.enable": "true"
  },

We are not able to identify a pattern. We were running Traefik serveres in C4.xlarge and changed the Instance Type to c4.2xlarge to check if any network bandwidth/ resource issue. Appreciate if your help in troubleshooting this issue.

@ldez
Copy link
Contributor

ldez commented Aug 7, 2017

Thanks for your interest in Traefik !

Could you try with the latest version (1.3.5) ?

@karthimohan
Copy link
Author

@idez I am testing version 1.3.5 in preprod. Will update this issue with my result in a day

@karthimohan
Copy link
Author

@ldez tested with 1.3.5 and it broke all the websocket connection related to this change #1930. So i can't say for sure if the version 1.3.5 fixed my original reported issue.

I tried the patch mentioned in #1930 and the wbsocket issue is fixed. So I am going to apply the patch in rest of my preprod stack and test.

Do you have any timelines on when #1930 will be merged and released?

@karthimohan
Copy link
Author

@ldez After applying the patch mentioned in #1930, i dont see the websocket issue but back to the old random failures on other http requests as i described in my bug.

@ldez ldez added kind/bug/confirmed a confirmed bug (reproducible). area/websocket kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. and removed status/0-needs-triage area/websocket kind/bug/confirmed a confirmed bug (reproducible). labels Aug 19, 2017
@ldez
Copy link
Contributor

ldez commented Aug 19, 2017

could you provide more logs? traefik --debug

@karthimohan
Copy link
Author

@ldez Are you looking for any specific events so i can filter out those events. Below the some of the log events for which we received 500 status code.

Haproxy
Aug 21 10:09:12 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:41450 [21/Aug/2017:10:08:12.613] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/abc/applications?offset=0 HTTP/1.1"
Aug 21 10:16:51 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:41818 [21/Aug/2017:10:15:51.368] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/abc/applications?offset=0 HTTP/1.1"
Aug 21 10:32:53 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:42757 [21/Aug/2017:10:31:53.376] HTTPS~ app/traefikserver-80 0/0/0/60004/60004 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/applications HTTP/1.1"
Aug 21 10:41:51 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:43500 [21/Aug/2017:10:40:51.598] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/abc/applications HTTP/1.1"
Aug 21 10:43:53 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:43625 [21/Aug/2017:10:42:53.851] HTTPS~ app/traefikserver-80 0/0/0/60001/60001 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud||||Apache-HttpClient/4.3.6 (java 1.5)} "GET /v1.1/api/3c19537140ed438f83ce423c4e6a4701/applications?offset=0&count=100 HTTP/1.1"
Aug 21 11:07:30 localhost.localdomain haproxy[31211]: ::ffff:127.0.0.1:45088 [21/Aug/2017:11:07:18.893] HTTPS~ app/traefikserver-80 0/0/0/12044/12044 500 157 - - ---- 1/1/0/0/0 0/0 {api.abc.cloud|||142|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.3} "PUT /v1.1/api/dt7e29b491d8fb4ad79ca9beccd7d21a96 HTTP/1.1"
Traefik-access.log
101.202.100.100 - - [21/Aug/2017:10:08:12 -0700] "GET /abc/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 153198 "appfrontendA" "http://hostname:31574" 60000ms
101.202.100.100 - - [21/Aug/2017:10:15:51 -0700] "GET /abc/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 153384 "appfrontendA" "http://hostname:31574" 60000ms
101.202.100.100 - - [21/Aug/2017:10:31:53 -0700] "GET /applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 153818 "appfrontendA" "http://hostname:31485" 60000ms
101.202.100.100 - - [21/Aug/2017:10:40:51 -0700] "GET /abc/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 154226 "appfrontendA" "http://hostname" 60000ms
101.202.100.100 - - [21/Aug/2017:10:42:53 -0700] "GET /api/3c19537140ed438f83ce423c4e6a4701/applications HTTP/1.1" 500 21 "" "Apache-HttpClient/4.3.6 (java 1.5)" 154308 "appfrontendA" "http://hostname:31574" 60000ms
101.202.100.100 - - [21/Aug/2017:11:07:18 -0700] "PUT api/dt7e29b491d8fb4ad79ca9beccd7d21a96 HTTP/1.1" 500 21 "https://URI/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36" 155003 "appfrontendA" "http://hostname:31485" 12042ms
Traefik Log
time="2017-08-21T10:09:12-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T10:16:51-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T10:32:53-07:00" level=warning msg="Error forwarding to http://hostname:31485, err: context canceled"
time="2017-08-21T10:41:51-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T10:43:53-07:00" level=warning msg="Error forwarding to http://hostname:31574, err: context canceled"
time="2017-08-21T11:07:30-07:00" level=warning msg="Error forwarding to http://hostname:31485, err: context canceled"

@emilevauge
Copy link
Member

emilevauge commented Aug 23, 2017

@karthimohan that's weird 🤔
Could you try commenting these sections ?

[web.statistics]
   RecentErrors = 10
# To enable Traefik to export internal metrics to Prometheus
[web.metrics.prometheus]
   Buckets=[0.1,0.3,1.2,5.0]

[retry]

@karthimohan
Copy link
Author

@emilevauge Not sure how relevant is to remove those. But I just did the suggested change in my preprod and going to monitor it for a day and update here

@warroyo
Copy link

warroyo commented Oct 27, 2017

I am seeing the context canceled error as well. this is extremely noticeable when under heavy load.

i ran a, performance test with wrk

wrk -t10 -c100 -d20s --latency

here are the results:

10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   527.92ms  218.30ms   1.89s    65.72%
    Req/Sec    19.46     11.15    60.00     82.91%
  Latency Distribution
     50%  517.02ms
     75%  684.67ms
     90%  800.57ms
     99%    1.06s
  3023 requests in 20.09s, 3.21MB read
  Socket errors: connect 0, read 0, write 0, timeout 99
Requests/sec:    150.48
Transfer/sec:    163.84KB

notice the 99 timeouts these are all time="2017-10-27T21:44:46Z" level=warning msg="Error forwarding to http://10.42.24.15:80, err: context canceled"

if i increase the connections it gets much worse sometimes to the point all requests are failing

are there some settings that need to be tweaked for heavier load?

not sure if this could be related #1849

@timoreimann
Copy link
Contributor

timoreimann commented Nov 20, 2017

One reason why this can apparently happen is due to disappearing clients (see also golang/go#20071). Is there a chance that Traefik is running out of capacity or backends cannot keep up with the request rate causing clients to eventually lose patience and prematurely terminate the request to Traefik?

@thuandt
Copy link

thuandt commented Dec 29, 2017

I got same problem. It's too weird. Most of all backend behind is ok but one of them always return 500 with err: context canceled.

wrk -t10 -c100 -d20s -H "Authorization: Basic xxx==" https://xyz.abc/services/health_check                                                       
Running 20s test @ https://xyz.abc/services/health_check                    
  10 threads and 100 connections               
  Thread Stats   Avg      Stdev     Max   +/- Stdev                                            
    Latency   513.82ms  141.49ms   1.29s    73.54%                                             
    Req/Sec    20.66     12.13    80.00     82.34%                                             
  3696 requests in 20.05s, 1.59MB read         
  Non-2xx or 3xx responses: 3696               
Requests/sec:    184.38                        
Transfer/sec:     81.03KB

I tried with v1.5.0-rc3 and 1.4 stable but same results

@mrnugget
Copy link

mrnugget commented Jan 11, 2018

We're experiencing the same issue: requests where the client disappears before a response could be sent are logged as errors.

The problem is that it's really easy to reproduce. You can just load a site in your browser that takes 1s+ to load and hit reload multiple times before the page is fully rendered to "reissue" the request. The client now disappeared for the first requests and traefik logs a 500.

I think it's debatable whether this should be reported as a status 500 error. IMHO it's not an error. But even if we classify it as an error, I don't think it should be reported as an "Internal Server Error".

Our alerting system (based on grafana + prometheus) goes nuts if users are impatient and hit reload multiple times.

@ortz
Copy link
Contributor

ortz commented Mar 14, 2018

hey, we're experiencing the same issue.
it seems that sometimes traefik returns a lot of 500 with the following message: "Error forwarding to http://IP:PORT, err: context canceled".
The servers weren't under unusual load and it's becoming OK after a couple of minutes.

any suggestions?

@nmengin nmengin added kind/enhancement a new or improved feature. priority/P2 need to be fixed in the future and removed status/0-needs-triage area/provider/marathon labels Mar 15, 2018
@nmengin nmengin removed the kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. label Mar 15, 2018
@hsmade
Copy link
Contributor

hsmade commented Jul 31, 2018

I suggest to start traefik with the following env var: GODEBUG="http2debug=2". Then look for 500s and check if you see a RST_STREAM right before that. These are requests that are being reset by the (web)client.

@coufalja
Copy link
Contributor

coufalja commented Aug 2, 2018

I think that issue lies in vulcand/oxy library. The default error handler does this:

func (e *StdHandler) ServeHTTP(w http.ResponseWriter, req *http.Request, err error) {
	statusCode := http.StatusInternalServerError
	if e, ok := err.(net.Error); ok {
		if e.Timeout() {
			statusCode = http.StatusGatewayTimeout
		} else {
			statusCode = http.StatusBadGateway
		}
	} else if err == io.EOF {
		statusCode = http.StatusBadGateway
	}
	w.WriteHeader(statusCode)
	w.Write([]byte(http.StatusText(statusCode)))
	log.Debugf("'%d %s' caused by: %v", statusCode, http.StatusText(statusCode), err)
}

so it for ContextCancelled error tries to write 500 SC which is then catched by metrics middleware in Traefik and reported as 500. The question is should it be fixed in Traefik by using custom ErrorHandler or in oxy? It drives our Monitoring crazy so I would like to help with PR I just want to know the preference where and how to fix it.

@kristinn
Copy link

kristinn commented Aug 7, 2018

I want to confirm @mrnugget's findings. I'm able to reproduce this error when running siege and cancelling it midway.

An example:

echo "GET https://example.com" | vegeta attack -duration=600s -rate=10 | tee results.bin | vegeta report -reporter=json | jq

I'm able to reproduce the issue if I CTRL+C midway through the stress test (the same effect as constantly refreshing a page in the browser before the client has received a response).

It doesn't seem to affect our users in any way, feels more like a "false alarm" by how Traefik is logging these errors.

@coufalja
Copy link
Contributor

coufalja commented Aug 8, 2018

Submitted vulcand/oxy#155 with the fix to oxy, and confirmed that it no longer reports those false alarms using wrk.

@traefiker
Copy link
Contributor

Closed by #3777.

@traefiker traefiker added this to the 1.6 milestone Aug 20, 2018
@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/enhancement a new or improved feature. priority/P2 need to be fixed in the future status/5-frozen-due-to-age
Projects
None yet
Development

Successfully merging a pull request may close this issue.