Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse connections for S3 Get requests #247

Open
paulmach opened this issue Oct 15, 2014 · 11 comments
Open

Reuse connections for S3 Get requests #247

paulmach opened this issue Oct 15, 2014 · 11 comments

Comments

@paulmach
Copy link

For each bucket.Get() a new http.Client is created (https://github.com/crowdmob/goamz/blob/94ebb8df2d498469ee950fbc4e2ab3c901dea007/s3/s3.go#L1023) Also, the request is explicitly told to close after finish (https://github.com/crowdmob/goamz/blob/94ebb8df2d498469ee950fbc4e2ab3c901dea007/s3/s3.go#L1004). So each request requires a new TCP connection and this creates overhead as well as leaving many sockets in a TIME_WAIT state if you're doing a lot of connections.

I've "hacked in a fix" where the http.Client is only created once for each Bucket object. and then the req.Close is set to false. This fixes the issue and it does reuse connections properly. However, I'm not totally sure how to fix this in the context of this library, specifically the timeouts would not be fixed.

This may apply to all S3 requests, but I'm currently only interested in downloading lots of data fast. :) My use case is downloading small (10-50k) files from S3 to an ec2 instance. At rates above 200 qps I get lots of sockets in the TIME_WAIT state and I can't create new connections until then close out. This fix allowed me to max out at 400 qps (from 300) and the code CPU bound presumably the logic of processing the data.

@paulmach
Copy link
Author

For what it's worth, this is the commit I used to fix it. There is an option to turn on ReuseHTTPClient . I also modified my use of the library so I only update the timeouts at startup, before the first request.

strava@0013fca

Note that the connection will close based on the ReadTimeout, which seems to be 2 seconds if left as 0. Then next request made to the client after this timeout will fail and need to be retried. Still working on a fix for this.

@alimoeeny
Copy link
Contributor

@paulmach thanks,
If I remember correctly, there has been quite a few attempts to deal with the issues you are also working on.
I am no expert in this matter, but I have a question. Is it possible for the http client (or the connection to be less specific) to get into some intermediate state (due to network issues, or S3 server side glitches ...) that it just waits (or is blocked in some state) for a long time? What will happen if a single client is being reused? If this is possible, wouldn't it be safer to create new clients for each operation? or is it possible to have a pool of connections (maybe accessed through a channel ?)
Have you measured how much overhead does this recreation of the clients add?
In any case, please send you PR, and I am sure others who are better positioned to comment on this matter and have already worked on this issue will also chime in.

@paulmach
Copy link
Author

I just created a PR, let the discussion start :) https://github.com/crowdmob/goamz/pull/248

It looks like this issue has been attacked before (https://github.com/crowdmob/goamz/pull/171, https://github.com/crowdmob/goamz/issues/108) but the changes were reverted. I don't know why.

For starters, the golang net/http documentation states "Clients and Transports are safe for concurrent use by multiple goroutines and for efficiency should only be created once and re-used."

In my experience, using this library for a long time, s3 can have many issues. You can end up waiting forever, odd errors etc. so I've been setting low timeouts and implementing retries in my code. I think this change will only make that better as problems during connection creation will happen less often as there will less connections created overall.

The net/http Client maintains it's own connection pool with transport.MaxIdleConnsPerHost in the pool at anytime (default is 2). So every time you make a request, it checks the pool, if there isn't a connection is makes a new one. When you're done, if it's not closed, it puts it back, unless there are already too many waiting, then it just closes it.

The problem you can run into is you get a connection from the pool that originally had a ReadTimeout of 60 seconds, but now it only has 100ms left (it's been 59.9 seconds). During your 200ms request to s3 the connection will close and you'll get a io timeout error bubble up to the library user. However, if it's been 61 seconds since the connection was created, the http.Client will create a new connection.

Anyway, I've been digging into this quite a bit the last few days and that's the way I understand.

@alimoeeny
Copy link
Contributor

Let's make sure people who have been involved in those conversations are mentioned so they can contribute, I have @qpingu , @bmatsuo , @nabeken , @moorage who else?

@alimoeeny
Copy link
Contributor

Maybe among more recent contributors to /s3 also @cfstras , @jtblin , @drombosky , @dylanmei , @jigish sorry if I missed others.

@paulmach
Copy link
Author

So I've been running this code over the weekend. First, for bulk usage, this has fixed my socket issues. I'm doing 1000qps with my script using this branch no problem. Before I would run out of local TCP ports. So connection reuse is working.

For my other workload that is probably more typical (web request requires 5ish 10kb concurrent calls from s3), the results are mixed. Connection errors are down, obviously, since I'm making less of them. I still get a few errors, but they seem to fail fast and my retry code handles them fine (and the old/current version of this lib was getting these too, probably an s3 thing).

The bigger issue is sometimes the read can take 2+ seconds vs the average 100-200 milliseconds to complete the get. I think this due to the ReadTimeout I'm setting. I'm setting it to 15 minutes, so the s3 connections lives that long. However, when making a "bad" s3 request there is virtually no timeout. Previously I had the ReadTimeout set to 1 second and then my code would retry, bailing on these bad requests.

I'm not sure how to get a ReadTimeout, ConnectTime and a ConnectionTimeout. The net library docs talks about resetting the raw connection deadline timeout after requests, but that would require messing with net/http I think.

I'm going to implement a timeout in my code to handle this, but I'm not sure how to deal with this issue in the context of this library :(

@alimoeeny
Copy link
Contributor

Thanks for sharing your results, I wish I had something to say or help, but I don't. Please keep up the good work.

@meson10
Copy link

meson10 commented Oct 30, 2014

@paulmach So, at this moment, whats the eventual solution for this ?

@alimoeeny
Copy link
Contributor

I don't know but I suggest having a look at https://github.com/goamz/goamz/blob/master/aws/client.go [ this is not the crowdmob/goamz it is goamz/goamz ].

@meson10
Copy link

meson10 commented Oct 31, 2014

@alimoeeny Thanks for the pointer. But, I think I am missing something here.
Is there anything I can do to not face the "net/http: TLS handshake timeout" error ?
I tried using the crowdmod/goamz instead of goamz/goamz and then I run into "read tcp 54.231.244.8:443: connection reset by peer".
I suspect its something I am doing incorrectly at my end, else others would have faced the same issue.

@alimoeeny
Copy link
Contributor

Hi @meson10 sorry, I don't know, I must have confused two set of conversations and the threads got mixed in my mind. I hope I have not wasted much of your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants