-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix high memory usage in retry middleware #2740
Fix high memory usage in retry middleware #2740
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have reviewed this internally already, so I'm able to give my LGTM right away.
@marco-jantke could you rebase? 🎢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this is a very neat solution!
1be1406
to
999eaff
Compare
@ldez done. |
@@ -114,107 +100,69 @@ func (l RetryListeners) Retried(req *http.Request, attempt int) { | |||
} | |||
} | |||
|
|||
type retryResponseRecorder interface { | |||
type retryResponseWriter interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why rename to retryResponseRecorder
to retryResponseWriter
and make a errorPagesResponseRecorder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it because the current implementation does not record anything anymore. It writes directly through to the original response writer in case the request should not be retried, so it's no recorder anymore.
Regarding the errorPagesResponseRecorder
:
As the former response recorder was used also at the error handler, I moved it there. I think the error handler is not that critical when it comes to response sizes, e.g. no files will be transferred through it I guess, but there is also room for improvement for the future.
It's basically that the retryResponseRecorder
was re-used in the error pages middleware and that I didn't want to make changes there. As the error pages middleware is the last place where the original retryResponseRecorder
is used I renamed it accordingly.
@marco-jantke This looks great. Just one question, have you tested a case when you are downloading lets say 2Gb file from endpoint Current version of retry logic with cache handled this correctly because it caches all the data and then transmits only if, traefik got all the data from endpoint. But here it looks like streaming data (maybe I am missing something, haven't checked out and tested) and when endpoint restarts when you reach 1Gb, next retry transmits again 2Gb, since |
@iahmedov very good point. I have to say I am utterly surprised but the case you describe works exactly as expected! What did I do? I have a simple file configuration with one frontend and two backends which in turn are nginx servers that just serve files. I start both backends and Traefik and then request in the browser a file that is about 1.5 GB through Traefik. It requests server A and when the download is at 1 GB I just kill server A hard. You can see a short pause on the download in the browser and then it resumes it at the proper location. I also verified that the result is not corrupted and in the correct size etc. I am not sure what is making that work so well, but it should be some primitive of HTTP (chunked transfer maybe?) or tcp(?). |
After digging deeper and taking debug logs apart, I figured that this actually is working due to the chunked transfer encoding and that the client (in my case chromium) and backend server (nginx) initiate the chunked transfer properly and behave correctly. To gather more knowledge, can you verify whether this approach also works in your scenario? But knowing that there is a way to resume downloads with plain HTTP, without extra logic in the reverse proxy, I'd argue we should keep the proxy as simple as possible. |
I guess traefik should create some kind of rules which it follows when retrying:
Reason:
I think if retry issued after single byte streamed to downstream, then this is wrong already, you never know how backend behaves for retried request when it correctly handled request, but because of network error failed to deliver it to you. |
This is the case. Retries only happen when on the initial connection establishment a net error occurred.
As above.
We had this discussion about that in the past and concluded it's not safe to retry something that just delivered a 5xx header. In the end the state in the backend server could already partially changed. In case you think this is a valuable feature, though, please open a separate feature request for this where we can discuss it.
Basically same as above. You don't know in which environments Traefik will be used and whether everyone implements HTTP as it is suggested. Basically you can't guarantee that apps change no state on a GET. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👏
The retry middleware will be refactored in the next commit, so that it doesn't need the original retryResponseRecorder anymore. The retryResponseRecorder, however, was used also for the error pages and so I am moving it to the file and rename it accordingly in this commit.
999eaff
to
c8cbacd
Compare
This PR reduces the amount of RAM Traefik needs to transfer the response body to the client when the retry middleware is enabled. This is accomplished by rebuilding the response recorder for the retry middleware. The principle now is quite simple, the new response recorder does not have a temporary buffer anymore, but directly delegates the calls to the original response writer in case the response should not be retried. This means that the only "overhead" we have left is the 32KB buffer the standard library uses in it's
Write
calls on the http connection.As the former response recorder was used also at the error handler, I moved it there. I think the error handler is not that critical when it comes to response sizes, e.g. no files will be transferred through it I guess, but there is also room for improvement for the future.
To get also something started I created a very basic integration test now as well for retries. Could also be extended in the future with other things like websockets etc, but I didn't want to make this PR too huge and time is sparse :-)
Below I have to snapshots of the pprof heap utility. In both cases my setup looks the same. I am proxying through Traefik to a service that delivers a 1.5 GB file. At the moment in time where about 1.0 GB are download I called the /heap endpoint to see the difference. Please note the total values, in the old version there is 235.86MB in the new one only 2.06MB!
old version
new version