Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling GET requests on upstream retries logic makes upstream selection unreliable #5838

Closed
b4stien opened this issue Sep 26, 2023 · 23 comments · Fixed by #5841
Closed

Disabling GET requests on upstream retries logic makes upstream selection unreliable #5838

b4stien opened this issue Sep 26, 2023 · 23 comments · Fixed by #5841
Labels
help wanted 🆘 Extra attention is needed

Comments

@b4stien
Copy link

b4stien commented Sep 26, 2023

In a reverse proxy scenario I don't want Caddy to retry GET requests when an upstream has accepted the request from caddy. As far as I understood this is doable in Caddy by setting the following config:

"retry_match": [
  {
    "method": []
  }
]

My problem is that, doing so, I get intermittent no upstreams available from Caddy. I've put a full repro on this repository.

As no upstreams available suggest that the error occurred before a connection was established to the upstream, and reading this comment, I don't get why the request is not retried.

@francislavoie
Copy link
Member

The logic for this is here:

https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L1014-L1031

I don't understand why you would want to disable retries for GET, can you explain your usecase?

@mholt
Copy link
Member

mholt commented Sep 26, 2023

Thanks for the work to document this.

As no upstreams available suggest that the error occurred before a connection was established to the upstream

I'm not so sure about THAT necessarily; but it DOES at least mean that the load balancer could not find an upstream within parameters (healthy, not too many connections, etc) to use for the request.

I suppose I could run the repro -- thank you for providing that-- but if you have the (debug) logs handy, could you post them? I'm curious why the requests are failing in the first place, but I'm short on time 😅 so that may lead to faster resolution.

@mholt mholt added the help wanted 🆘 Extra attention is needed label Sep 26, 2023
@francislavoie
Copy link
Member

I think your max_requests configured to 1 on the upstream is what's causing it to be unavailable. Try without that.

@mholt
Copy link
Member

mholt commented Sep 26, 2023

I thought that too, but the try_duration is set for 20s. (Haven't had a chance to look at the code yet, been in graphics/frontend world today) -- I imagine it should try again after half a second and then again after another half a second, and by then the other request should be done... 🤔 at some point within those 20s you'd think it'd consider the upstream available again.

@francislavoie
Copy link
Member

I don't think it does retry after an interval @mholt because of the return false in the code I quoted above.

@mholt
Copy link
Member

mholt commented Sep 27, 2023

Oh, right -- so, I mean, I feel like that's working as intended then yeah?

Retries are disabled, the one upstream is unavailable, so... accurate error message. Right?

@b4stien
Copy link
Author

b4stien commented Sep 27, 2023

Here is the log for 10 iterations of the loop (10 set of 2 simultaneous requests):

2023/09/27 05:38:12.978	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:12.978	ERROR	http.log.error	no upstreams available	{"request": {"remote_ip": "::1", "remote_port": "65314", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "Connection": ["keep-alive"], "User-Agent": ["python-httpx/0.23.1"]}}, "duration": 0.000033292, "status": 503, "err_id": "6h672sx5x", "err_trace": "reverseproxy.(*Handler).proxyLoopIteration (reverseproxy.go:483)"}
2023/09/27 05:38:13.981	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.003431375, "request": {"remote_ip": "::1", "remote_port": "65313", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"]}}, "headers": {"Content-Length": ["2"], "Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:13 GMT"], "Content-Type": ["text/html; charset=utf-8"]}, "status": 200}
2023/09/27 05:38:14.504	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:14.505	ERROR	http.log.error	no upstreams available	{"request": {"remote_ip": "::1", "remote_port": "65318", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept-Encoding": ["gzip, deflate"], "Connection": ["keep-alive"], "User-Agent": ["python-httpx/0.23.1"], "Accept": ["*/*"]}}, "duration": 0.000167833, "status": 503, "err_id": "d22fhaq4j", "err_trace": "reverseproxy.(*Handler).proxyLoopIteration (reverseproxy.go:483)"}
2023/09/27 05:38:15.509	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.005102625, "request": {"remote_ip": "::1", "remote_port": "65317", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"]}}, "headers": {"Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:15 GMT"]}, "status": 200}
2023/09/27 05:38:16.038	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:16.039	ERROR	http.log.error	no upstreams available	{"request": {"remote_ip": "::1", "remote_port": "65322", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept-Encoding": ["gzip, deflate"], "Connection": ["keep-alive"], "User-Agent": ["python-httpx/0.23.1"], "Accept": ["*/*"]}}, "duration": 0.000080458, "status": 503, "err_id": "hd5em04qd", "err_trace": "reverseproxy.(*Handler).proxyLoopIteration (reverseproxy.go:483)"}
2023/09/27 05:38:17.044	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.005131333, "request": {"remote_ip": "::1", "remote_port": "65321", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"]}}, "headers": {"Date": ["Wed, 27 Sep 2023 05:38:17 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"]}, "status": 200}
2023/09/27 05:38:17.571	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:17.571	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:18.579	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.007792542, "request": {"remote_ip": "::1", "remote_port": "65326", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"]}}, "headers": {"Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:18 GMT"]}, "status": 200}
2023/09/27 05:38:19.583	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 2.011548916, "request": {"remote_ip": "::1", "remote_port": "65325", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"]}}, "headers": {"Date": ["Wed, 27 Sep 2023 05:38:19 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"]}, "status": 200}
2023/09/27 05:38:20.111	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:20.111	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:21.120	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.008337166, "request": {"remote_ip": "::1", "remote_port": "65332", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"]}}, "headers": {"Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:21 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"]}, "status": 200}
2023/09/27 05:38:22.124	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 2.012443208, "request": {"remote_ip": "::1", "remote_port": "65331", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"]}}, "headers": {"Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:22 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"]}, "status": 200}
2023/09/27 05:38:22.649	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:22.649	ERROR	http.log.error	no upstreams available	{"request": {"remote_ip": "::1", "remote_port": "65338", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept-Encoding": ["gzip, deflate"], "Connection": ["keep-alive"], "User-Agent": ["python-httpx/0.23.1"], "Accept": ["*/*"]}}, "duration": 0.000049625, "status": 503, "err_id": "fp9phbp6k", "err_trace": "reverseproxy.(*Handler).proxyLoopIteration (reverseproxy.go:483)"}
2023/09/27 05:38:23.655	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.005424459, "request": {"remote_ip": "::1", "remote_port": "65337", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-Host": ["localhost:8010"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-Proto": ["http"]}}, "headers": {"Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:23 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"]}, "status": 200}
2023/09/27 05:38:24.185	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:24.185	ERROR	http.log.error	no upstreams available	{"request": {"remote_ip": "::1", "remote_port": "65342", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Connection": ["keep-alive"], "User-Agent": ["python-httpx/0.23.1"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"]}}, "duration": 0.000077375, "status": 503, "err_id": "a401n3g8c", "err_trace": "reverseproxy.(*Handler).proxyLoopIteration (reverseproxy.go:483)"}
2023/09/27 05:38:25.190	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.004839625, "request": {"remote_ip": "::1", "remote_port": "65341", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"]}}, "headers": {"Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:25 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"]}, "status": 200}
2023/09/27 05:38:25.719	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:25.719	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:26.724	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.004529458, "request": {"remote_ip": "::1", "remote_port": "65346", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"]}}, "headers": {"Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:26 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"]}, "status": 200}
2023/09/27 05:38:27.733	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 2.013806083, "request": {"remote_ip": "::1", "remote_port": "65345", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"]}}, "headers": {"Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:27 GMT"]}, "status": 200}
2023/09/27 05:38:28.262	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:28.263	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:29.268	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.005228167, "request": {"remote_ip": "::1", "remote_port": "65352", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"]}}, "headers": {"Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:29 GMT"]}, "status": 200}
2023/09/27 05:38:30.276	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 2.012581625, "request": {"remote_ip": "::1", "remote_port": "65351", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"]}}, "headers": {"Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:30 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"]}, "status": 200}
2023/09/27 05:38:30.804	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:30.804	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:31.808	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.004579875, "request": {"remote_ip": "::1", "remote_port": "65357", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"]}}, "headers": {"Date": ["Wed, 27 Sep 2023 05:38:31 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"]}, "status": 200}
2023/09/27 05:38:32.816	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 2.011665417, "request": {"remote_ip": "::1", "remote_port": "65358", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"]}}, "headers": {"Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:32 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"]}, "status": 200}

@francislavoie
Copy link
Member

You haven't answered my question though:

I don't understand why you would want to disable retries for GET, can you explain your usecase?

That's key here to understanding the motivation here.

As-is, this seems to be working as intended, but you're configuring Caddy in a way that doesn't make sense.

@b4stien
Copy link
Author

b4stien commented Sep 27, 2023

I don't understand why you would want to disable retries for GET, can you explain your usecase?

I want GET requests to be handled exactly the same than POST/PUT/DELETE/etc. I want them to be:

  • Retried before a connection could be established to an upstream ;
  • Not retried after a connection has been established to an upstream.

I don't see why they should be handled differently than the rest of the methods.

Besides, in my test case, I do not understand the "glitchyness" of the process: why would it "sometimes" succeed in finding an upstream for both requests and why "sometimes" not. This looks like a bug to me (the rest is probably me misunderstanding the docs).

For the rest it might be a misunderstanding of the docs on my part, but broadly: I don't understand what is retried by default, what is not, and what is disabled when settings "retry_match": [{"method": []}]. I don't see why you would ever disable retry at the "upstream selection" step, regardless of the method. The documentation agrees, stating "If the connection to the upstream failed, a retry is always allowed.".

@francislavoie
Copy link
Member

francislavoie commented Sep 27, 2023

I don't see why they should be handled differently than the rest of the methods.

Because by the HTTP spec, GET is meant to be read-only and idempotent (which means safely retryable with the same result). They should not affect state on the server. If they do, then that's a bug in the application that needs to be fixed.

Besides, in my test case, I do not understand the "glitchyness" of the process: why would it "sometimes" succeed in finding an upstream for both requests and why "sometimes" not.

I think it's because of your max_requests: 1 configuration, which limits concurrent requests. If the first request is handled and responds before the second one comes in, then it'll succeed. If not and the second comes in while the first is still being handled, it'll fail.

Having such a restrictive limitation of 1 concurrent request seems very weird. Why would you ever want a server that can only handle one request at a time? That seems like an incredibly poor bottleneck. Browsers make multiple requests in parallel to load CSS/JS/image assets, so this would make it impossible to serve any kind of website for a single user, let alone multiple users.

I don't see why you would ever disable retry at the "upstream selection" step, regardless of the method. The documentation agrees, stating "If the connection to the upstream failed, a retry is always allowed.".

I'm still looking into this, I agree that seems probably wrong. I think we might have unintentionally broken this with a refactor a few versions ago (rework of the proxy loop).

@b4stien
Copy link
Author

b4stien commented Sep 27, 2023

Because by the HTTP spec, GET is meant to be read-only and idempotent. They should not affect state on the server. If they do, then that's a bug in the application that needs to be fixed.

This is why I specified "before a connection could be established to an upstream". Before a connection could be established to an upstream no request, regardless of its method, could after the state on the server... because it hasn't reached it.

But to answer another possible angle of your question which could be "why would you want GET requests not to be retried after they reached an upstream": some of them fail because of an upstream issue (think an external API that we're calling ourselves) and we don't want to blindly retry. In those scenarii our clients are in charge of the retry (we inform them with a "Retry-After" header).

I think it's because of your max_requests: 1 configuration, which limits concurrent requests. If the first request is handled and responds before the second one comes in, then it'll succeed. If not and the second comes in while the first is still being handled, it'll fail.

The first request will not be handled before the second comes in: in my test case the upstream server will sleep for 1s before answering (which gives plenty of time for the second request to comes in).

Having such a restrictive limitation of 1 concurrent request seems very weird. Why would you ever want a server that can only handle one request at a time? That seems like an incredibly poor bottleneck. Browsers make multiple requests in parallel to load CSS/JS/image assets, so this would make it impossible to serve any kind of website for a single user, let alone multiple users.

That's a pretty broad question to answer but I'll try:

  • Not every server is in charge of "browser workload", a significant proportion of them handle "machine to machine" scenarii which are radically different (basically every "API server" out there).
  • In these scenarii, limiting concurrency is pretty standard, especially if your server has to make requests to a traditional database like PostgreSQL. These databases do not accept arbitrary amount of connections, you have to set some limit somehow. Classic solutions are limiting incoming concurrency (what we're doing) or putting DB connections behind a threadpool, but everything is possible on this front.
  • Especially in our case, our web application is written in "synchronous Python" which has a specific stance regarding concurrency which is: no concurrency. We solve that by having multiple upstream servers (UNIX processes) by physical server (EC2 VMs): it works pretty well.

@francislavoie
Copy link
Member

Could you give #5841 a try? I think it might do what you expect.

@b4stien
Copy link
Author

b4stien commented Sep 28, 2023

It works! Thanks a lot for your work on this issue!

But re-reading this conversation and considering my test case, I think there are two issues:

  1. Requests not retried when the upstream selection failed, reverseproxy: Fix retries on "upstreams unavailable" error #5841 solves this!
  2. Concurrent upstream selection succeeding when it should not (because of max_requests: 1), this is not solved.

In my test case, the following caddy logs should not happen, there is a single upstream with max_request:1, this looks like a data race or bad coordination during the max_request enforcement:

2023/09/27 05:38:17.571	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:17.571	DEBUG	http.handlers.reverse_proxy	selected upstream	{"dial": "localhost:8011", "total_upstreams": 1}
2023/09/27 05:38:18.579	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 1.007792542, "request": {"remote_ip": "::1", "remote_port": "65326", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["localhost:8010"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"]}}, "headers": {"Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"], "Date": ["Wed, 27 Sep 2023 05:38:18 GMT"]}, "status": 200}
2023/09/27 05:38:19.583	DEBUG	http.handlers.reverse_proxy	upstream roundtrip	{"upstream": "localhost:8011", "duration": 2.011548916, "request": {"remote_ip": "::1", "remote_port": "65325", "client_ip": "::1", "proto": "HTTP/1.1", "method": "GET", "host": "localhost:8010", "uri": "/ok-1s", "headers": {"X-Forwarded-Host": ["localhost:8010"], "Accept": ["*/*"], "Accept-Encoding": ["gzip, deflate"], "X-Forwarded-For": ["::1"], "User-Agent": ["python-httpx/0.23.1"], "X-Forwarded-Proto": ["http"]}}, "headers": {"Date": ["Wed, 27 Sep 2023 05:38:19 GMT"], "Content-Type": ["text/html; charset=utf-8"], "Content-Length": ["2"], "Server": ["gunicorn"]}, "status": 200}

@b4stien
Copy link
Author

b4stien commented Sep 28, 2023

If you'd rather have this conversation in a new issue I can open a new one.

@mholt
Copy link
Member

mholt commented Sep 28, 2023

Oh, right -- this is indeed approximate currently -- it's not a data race because we use sync/atomic to access the count of requests, but we do read the count and then increment it after choosing it.

We could implement locking but I don't know the performance implications there.

Do you require an exact restriction? I'm not sure I've heard of that use case before.

@b4stien
Copy link
Author

b4stien commented Sep 28, 2023

Do you require an exact restriction?

Well I have mixed feelings on this question. It's not the end of the world if the max request count is "best effort" and is approximated, but:

  • I feel like if you're willing to limit the concurrent request count to your upstream you can afford a couple of milliseconds (1 or 2 at most, right?) lost in upstream selection.
  • Especially because it could be a net gain overall. In the test case above, if I had 2 upstreams instead of one, and if max_requests was enforced with a lock, both request would be answered within 1s (the upstream time to answer) + 1ms (the locking penalty). Without the max_request enforcement, if both request end up on the same upstream, the first one will be answered in 1s whereas the second one will take 2s (1s waiting for the first one to finish + 1s for its own processing time).
    • This "if" is not a hypothetical, in prod in my infra we have 8 to 12 upstreams (independent UNIX process) per server.

Either way, it should be documented there.

Thanks a lot for your time on this issue.

@francislavoie
Copy link
Member

francislavoie commented Sep 28, 2023

I feel like if you're willing to limit the concurrent request count to your upstream you can afford a couple of milliseconds (1 or 2 at most, right?) lost in upstream selection.

I'd argue that no, you wouldn't want any added latency if you set the number to something still reasonably high (like 100+). And that's the usecase we intended with this, we didn't expect anyone to try it with very low amounts like 1 (and IMO I still think that's extremely weird).

Also, we increase the request count even if max requests isn't configured, because we have the least_conn load balancing policy (among others) that make use of that information. So adding any kind of latency in general is a very bad idea (opt-in added latency is probably fine though, with an option that specifically warns about it).

Since you're using Python, shouldn't you be using something like gunicorn to allow for multiple concurrent requests via a worker model? This seems like it would totally be a non-issue if you just used the right tools for the job. It looks like your example uses Flask, and their docs recommend using gunicorn as well: https://flask.palletsprojects.com/en/2.3.x/deploying/gunicorn/

@mholt
Copy link
Member

mholt commented Sep 28, 2023

Well, I'm not entirely sure if it would add any noticeable latency. Locks don't tend to increase throughput, however...

I cannot think of a way to implement the lock right now in a way that would allow more than 1 request to be load-balanced at any given moment. At first I thought we could just lock within Available() so that if it would return true, then the counter would already be incremented while locked, rather than later when actually proxying. The problem is that some load balancing policies don't always use the first available upstream; they might add each one to a pool and then choose from that, so we would be incrementing the request counter even if it isn't proxied to.

If there's an elegant solution for this I'd be down to try it.

In the meantime we can document that it's approximate to maximize throughput.

@b4stien
Copy link
Author

b4stien commented Sep 28, 2023

I'd argue that no, you wouldn't want any added latency if you set the number to something still reasonably high (like 100+).

I agree.

And that's the usecase we intended with this, we didn't expect anyone to try it with very low amounts like 1 (and IMO I still think that's extremely weird).

For the first part, if you could document it that would be super useful to your users, otherwise others, like me, will "hold it wrong". For the second part, our industry is pretty vast, what is "extremely weird" to you could be someone else's day job.

Also, we increase the request count even if max requests isn't configured, because we have the least_conn load balancing policy (among others) that make use of that information. So adding any kind of latency in general is a very bad idea (opt-in added latency is probably fine though, with an option that specifically warns about it).

Understood.

Since you're using Python, shouldn't you be using something like gunicorn to allow for multiple concurrent requests via a worker model? This seems like it would totally be a non-issue if you just used the right tools for the job. It looks like your example uses Flask, and their docs recommend using gunicorn as well: https://flask.palletsprojects.com/en/2.3.x/deploying/gunicorn/

We do use gunicorn which is (almost) always living behind a reverse proxy (nginx, apache, haproxy, potentially caddy ?), you're forced to if you want TLS terminaison for instance, or protection against SLOWLORIS attacks for instance.

At our scale gunicorn comes with its tradeoffs, and we're moving away from the "gunicorn as a load balancer" style of deployment (but keeping the "gunicorn as a WSGI server part"). If you're interested this paper explains some of the reasons why. I'm pretty sure they, at Clubhouse, searched pretty hard the "right tool for the job" making it a "non-issue" ;).

Regardless, thank you for your work on #5841, hope it'll make its way into the mainline, it solves an important issue.

@francislavoie
Copy link
Member

Most people who use Python with Caddy tend to be using something like gunicorn, from what we've seen. So yes, that would be the way to go, IMO.

Trying to configure Caddy to do gunicorn's job via max_requests and such seems like a bad idea.

Regardless, thank you for your work on #5841, hope it'll make its way into the mainline, it solves an important issue.

👍

@mholt
Copy link
Member

mholt commented Sep 28, 2023

To be clear, I want the count to be exact. 💯 ... if we can find a good way to do it.

@b4stien
Copy link
Author

b4stien commented Sep 29, 2023

To be clear, I want the count to be exact. 💯 ... if we can find a good way to do it.

As I'm not familiar with caddy's sourcecode (yet), take the following with a huge grain of salt. It's only "food for thoughts" at this stage.

I guess the "ongoing requests count for a given upstream" is increased after a successful ack from an upstream (and decreased after the response has been handed out to caddy). It would explain why max_requests is approximate.

I think you could introduce another count, representing "possibly incoming requests count for a given upstream" which would be increased when selecting the upstream and decreased when the connection with the upstream is settled (whether it's a success or a failure doesn't matter).

When given a max_requests setting, the "increase" step for this new count could lock -> compare "ongoing requests count for a given upstream" + "possibly incoming requests count for a given upstream" to max_requests -> validate this upstream and increase the count if it's a success or go back to upstream selection otherwise -> unlock.

@mholt
Copy link
Member

mholt commented Sep 30, 2023

Yeah, I actually was thinking something similar, but I think it still gives us an approximation.

I'll think on it more. But I might be busy for a few weeks with a baby on the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted 🆘 Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants