Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix high memory usage in retry middleware #2740

Merged
merged 3 commits into from
Jan 26, 2018

Conversation

m3co-code
Copy link
Contributor

This PR reduces the amount of RAM Traefik needs to transfer the response body to the client when the retry middleware is enabled. This is accomplished by rebuilding the response recorder for the retry middleware. The principle now is quite simple, the new response recorder does not have a temporary buffer anymore, but directly delegates the calls to the original response writer in case the response should not be retried. This means that the only "overhead" we have left is the 32KB buffer the standard library uses in it's Write calls on the http connection.

As the former response recorder was used also at the error handler, I moved it there. I think the error handler is not that critical when it comes to response sizes, e.g. no files will be transferred through it I guess, but there is also room for improvement for the future.

To get also something started I created a very basic integration test now as well for retries. Could also be extended in the future with other things like websockets etc, but I didn't want to make this PR too huge and time is sparse :-)

Below I have to snapshots of the pprof heap utility. In both cases my setup looks the same. I am proxying through Traefik to a service that delivers a 1.5 GB file. At the moment in time where about 1.0 GB are download I called the /heap endpoint to see the difference. Please note the total values, in the old version there is 235.86MB in the new one only 2.06MB!

old version

Fetching profile over HTTP from http://localhost:9090/debug/pprof/heap
Saved profile in /home/marco/pprof/pprof.traefik.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
File: traefik
Type: inuse_space
Time: Jan 23, 2018 at 11:51am (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 234.36MB, 99.36% of 235.86MB total
Dropped 28 nodes (cum <= 1.18MB)
Showing top 10 nodes out of 27
      flat  flat%   sum%        cum   cum%
  234.36MB 99.36% 99.36%   234.36MB 99.36%  bytes.makeSlice /usr/local/go/src/bytes/buffer.go
         0     0% 99.36%   234.36MB 99.36%  bytes.(*Buffer).Write /usr/local/go/src/bytes/buffer.go
         0     0% 99.36%   234.36MB 99.36%  bytes.(*Buffer).grow /usr/local/go/src/bytes/buffer.go
         0     0% 99.36%   234.36MB 99.36%  github.com/containous/traefik/middlewares.(*EmptyBackendHandler).ServeHTTP /home/marco/go/src/github.com/containous/traefik/middlewares/empty_backend_handler.go
         0     0% 99.36%   234.36MB 99.36%  github.com/containous/traefik/middlewares.(*HandlerSwitcher).ServeHTTP /home/marco/go/src/github.com/containous/traefik/middlewares/handlerSwitcher.go
         0     0% 99.36%   234.36MB 99.36%  github.com/containous/traefik/middlewares.(*Retry).ServeHTTP /home/marco/go/src/github.com/containous/traefik/middlewares/retry.go
         0     0% 99.36%   234.36MB 99.36%  github.com/containous/traefik/middlewares.(*StripPrefix).ServeHTTP /home/marco/go/src/github.com/containous/traefik/middlewares/stripPrefix.go
         0     0% 99.36%   234.36MB 99.36%  github.com/containous/traefik/middlewares.(*StripPrefix).serveRequest /home/marco/go/src/github.com/containous/traefik/middlewares/stripPrefix.go
         0     0% 99.36%   234.36MB 99.36%  github.com/containous/traefik/middlewares.(*retryResponseRecorderWithoutCloseNotify).Write /home/marco/go/src/github.com/containous/traefik/middlewares/retry.go
         0     0% 99.36%   234.36MB 99.36%  github.com/containous/traefik/middlewares/accesslog.(*SaveBackend).ServeHTTP /home/marco/go/src/github.com/containous/traefik/middlewares/accesslog/save_backend.go

new version

Fetching profile over HTTP from http://localhost:9090/debug/pprof/heap
Saved profile in /home/marco/pprof/pprof.traefik.alloc_objects.alloc_space.inuse_objects.inuse_space.011.pb.gz
File: traefik
Type: inuse_space
Time: Jan 23, 2018 at 11:49am (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2065.01kB, 100% of 2065.01kB total
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
 1536.84kB 74.42% 74.42%  1536.84kB 74.42%  reflect.mapassign /usr/local/go/src/runtime/hashmap.go
  528.17kB 25.58%   100%   528.17kB 25.58%  regexp.(*bitState).reset /usr/local/go/src/regexp/backtrack.go
         0     0%   100%  1536.84kB 74.42%  encoding/json.(*decodeState).object /usr/local/go/src/encoding/json/decode.go
         0     0%   100%  1536.84kB 74.42%  encoding/json.(*decodeState).unmarshal /usr/local/go/src/encoding/json/decode.go
         0     0%   100%  1536.84kB 74.42%  encoding/json.(*decodeState).value /usr/local/go/src/encoding/json/decode.go
         0     0%   100%  1536.84kB 74.42%  encoding/json.Unmarshal /usr/local/go/src/encoding/json/decode.go
         0     0%   100%  1536.84kB 74.42%  github.com/containous/traefik/configuration.init <autogenerated>
         0     0%   100%  1536.84kB 74.42%  github.com/containous/traefik/provider/kubernetes.init <autogenerated>
         0     0%   100%   528.17kB 25.58%  github.com/containous/traefik/vendor/github.com/containous/mux.(*Route).Match /home/marco/go/src/github.com/containous/traefik/vendor/github.com/containous/mux/route.go
         0     0%   100%   528.17kB 25.58%  github.com/containous/traefik/vendor/github.com/containous/mux.(*Router).Match /home/marco/go/src/github.com/containous/traefik/vendor/github.com/containous/mux/mux.go

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have reviewed this internally already, so I'm able to give my LGTM right away.

@errm errm self-requested a review January 23, 2018 22:06
@ldez
Copy link
Contributor

ldez commented Jan 24, 2018

@marco-jantke could you rebase? 🎢

Copy link
Contributor

@errm errm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this is a very neat solution!

@m3co-code m3co-code force-pushed the fix-high-memory-usage-in-retry branch from 1be1406 to 999eaff Compare January 25, 2018 08:01
@m3co-code
Copy link
Contributor Author

@ldez done.

@ldez ldez added the kind/bug/fix a bug fix label Jan 25, 2018
@@ -114,107 +100,69 @@ func (l RetryListeners) Retried(req *http.Request, attempt int) {
}
}

type retryResponseRecorder interface {
type retryResponseWriter interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rename to retryResponseRecorder to retryResponseWriter and make a errorPagesResponseRecorder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it because the current implementation does not record anything anymore. It writes directly through to the original response writer in case the request should not be retried, so it's no recorder anymore.

Regarding the errorPagesResponseRecorder:

As the former response recorder was used also at the error handler, I moved it there. I think the error handler is not that critical when it comes to response sizes, e.g. no files will be transferred through it I guess, but there is also room for improvement for the future.

It's basically that the retryResponseRecorder was re-used in the error pages middleware and that I didn't want to make changes there. As the error pages middleware is the last place where the original retryResponseRecorder is used I renamed it accordingly.

@iahmedov
Copy link

@marco-jantke This looks great.

Just one question, have you tested a case when you are downloading lets say 2Gb file from endpoint
and when it reached 1Gb, restart endpoint, is it going to continue correctly? (without, curl, wget options to retry)

Current version of retry logic with cache handled this correctly because it caches all the data and then transmits only if, traefik got all the data from endpoint. But here it looks like streaming data (maybe I am missing something, haven't checked out and tested) and when endpoint restarts when you reach 1Gb, next retry transmits again 2Gb, since rr.responseWriter not closed, in total you may receive 3Gb data when it was only 2Gb

@m3co-code
Copy link
Contributor Author

m3co-code commented Jan 26, 2018

@iahmedov very good point. I have to say I am utterly surprised but the case you describe works exactly as expected!

What did I do? I have a simple file configuration with one frontend and two backends which in turn are nginx servers that just serve files. I start both backends and Traefik and then request in the browser a file that is about 1.5 GB through Traefik. It requests server A and when the download is at 1 GB I just kill server A hard. You can see a short pause on the download in the browser and then it resumes it at the proper location. I also verified that the result is not corrupted and in the correct size etc.

I am not sure what is making that work so well, but it should be some primitive of HTTP (chunked transfer maybe?) or tcp(?).

@m3co-code
Copy link
Contributor Author

After digging deeper and taking debug logs apart, I figured that this actually is working due to the chunked transfer encoding and that the client (in my case chromium) and backend server (nginx) initiate the chunked transfer properly and behave correctly. To gather more knowledge, can you verify whether this approach also works in your scenario?

But knowing that there is a way to resume downloads with plain HTTP, without extra logic in the reverse proxy, I'd argue we should keep the proxy as simple as possible.

@iahmedov
Copy link

I guess traefik should create some kind of rules which it follows when retrying:

  • retry should never happen when data started streaming to downstream.
  • retry if traefik could not connect to endpoint
  • retry if 5xx error happened (should be configurable, maybe some one wrote API where even 5xx errors update database state, in which case retry can update data even more, think about simple counter)
  • retry if states are idempotent, (plain GET requests which doesn't change server state, also should be configurable to switch on or off)

Reason:

  • Data could be updated by request, if you try again, you can update it multiple times
  • Whenever single byte received by downstream who requested data from traefik, it must handle backend termination by itself, traefik should not handle this case, because it doesn't know about how backend works and its properties

I think if retry issued after single byte streamed to downstream, then this is wrong already, you never know how backend behaves for retried request when it correctly handled request, but because of network error failed to deliver it to you.

@m3co-code
Copy link
Contributor Author

retry should never happen when data started streaming to downstream.

This is the case. Retries only happen when on the initial connection establishment a net error occurred.

retry if traefik could not connect to endpoint

As above.

retry if 5xx error happened (should be configurable, maybe some one wrote API where even 5xx errors update database state, in which case retry can update data even more, think about simple counter)

We had this discussion about that in the past and concluded it's not safe to retry something that just delivered a 5xx header. In the end the state in the backend server could already partially changed. In case you think this is a valuable feature, though, please open a separate feature request for this where we can discuss it.

retry if states are idempotent, (plain GET requests which doesn't change server state, also should be configurable to switch on or off)

Basically same as above. You don't know in which environments Traefik will be used and whether everyone implements HTTP as it is suggested. Basically you can't guarantee that apps change no state on a GET.

Copy link
Contributor

@ldez ldez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👏

The retry middleware will be refactored in the next commit, so that it
doesn't need the original retryResponseRecorder anymore. The
retryResponseRecorder, however, was used also for the error pages and
so I am moving it to the file and rename it accordingly in this commit.
@traefiker traefiker force-pushed the fix-high-memory-usage-in-retry branch from 999eaff to c8cbacd Compare January 26, 2018 17:04
@traefiker traefiker merged commit ef4aa20 into traefik:master Jan 26, 2018
@m3co-code m3co-code deleted the fix-high-memory-usage-in-retry branch January 26, 2018 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants