Skip to content

Resource leak with aleph.http/create-connection and unreachable hosts #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
benmoss opened this issue Feb 17, 2015 · 12 comments
Closed

Resource leak with aleph.http/create-connection and unreachable hosts #152

benmoss opened this issue Feb 17, 2015 · 12 comments

Comments

@benmoss
Copy link

benmoss commented Feb 17, 2015

I'm not sure if this is really a problem with aleph or with manifold, but posting it here.

The original problem I had was in a test, where I spin up an http server in a daemon, run some tests against it, and then shut it down. The problem first manifested itself as the tests passing the first time, and then failing with timing out http requests on subsequent runs with the same JVM process. A lot of tracing later, and I've boiled it down to just this gist: https://gist.github.com/benmoss/79acf300d8d2ba573648.

When you run lookup on an unreachable host, like in the comments at the bottom, you'll see the traced aleph.http/create-connection getting called repeatedly with pauses in between. If you use a reachable host, it doesn't show this problem. My hunch is that something in timeout! isn't working properly in fully cancelling the repeat attempts on http/request.

@ztellman
Copy link
Collaborator

Yes, timeout! will affect the underlying response deferred, but not necessarily the entire request machinery. Have you tried setting the :pool-timeout, :connection-timeout, or :request-timeout on the request itself?

@benmoss
Copy link
Author

benmoss commented Feb 19, 2015

I hadn't, but trying them now with a timeout value of 1 on all 3 of them I am still seeing requests continue to go out via create-connection long after the function was called and the deferred resolved.

@ztellman
Copy link
Collaborator

Okay, I'll investigate further, thanks.
On Feb 18, 2015 5:01 PM, "Ben Moss" notifications@github.com wrote:

I hadn't, but trying them now with a timeout value of 1 on all 3 of them I
am still seeing requests continue to go out via create-connection long
after the function was called and the deferred resolved.


Reply to this email directly or view it on GitHub
#152 (comment).

@benmoss
Copy link
Author

benmoss commented Feb 19, 2015

From my investigation so far:

  • Triggering it by doing a get request to "0.0.0.0:6666", any local port with nothing running works
  • Doesn't actually have anything to do with manifold, all the code I have is now just
(defn lookup [url]
  (http/get url {:pool-timeout 1
                 :request-timeout 1
                 :connection-timeout 1}))
  • Seems to only occur upon repeated failed requests
  • the shouldIncrement and the generate fns seem to be called once per request the first two times my lookup fn is called, but then seem to be called 3-4 times on subsequent calls. shouldIncrement is always true. After some number of calls (6-7), it seems shouldIncrement and generate just keep getting called.

@ztellman
Copy link
Collaborator

I'm very close to having a fix for this, thanks for digging into it, though.

@ztellman
Copy link
Collaborator

This looks to have been some fallout from #140 which I should have caught. I've made it so that your original code, which applied timeout! to the response, will short-circuit acquiring a connection and sending the request if you do it in time.

I don't see any further room for this sort of issue, but it's possible there's still some corner case lurking. Please let me know if you see anything else.

@benmoss
Copy link
Author

benmoss commented Feb 19, 2015

Great, thank you!

@Gonzih
Copy link

Gonzih commented Mar 3, 2016

Hi guys,

Was this change released? I just spotted another issue that workers are leaking in the connection pool if in the middle of request I remove target server (so connection is not closed properly). After that num-workers stat fluctuates between two values constantly (even when there are no requests coming to the server). So in stats logs i see something like that:

connections.localhost-9002 -> 620
connections.localhost-9002 -> 557
connections.localhost-9002 -> 620
connections.localhost-9002 -> 557
connections.localhost-9002 -> 620
connections.localhost-9002 -> 557
connections.localhost-9002 -> 620
connections.localhost-9002 -> 557

I tried to use all timeout options without any success.
I was able to reproduce that by just creating a lot of http/get requests to some simple web server that was just doing Thread/sleep and blocking them. Before timeout happens I just send kill -9 to the target server to simulate failure and after some time not all connections are released. I started digging in to that because I spotted that resources are slowly leaking from my server until there ane no available workers left. Is this related to this issue? Should I create a separate issue? Am I missing something here?

Thanks a lot!

UPD

This is minimal code snippet that I was able to reproduce the issue with:

(defonce client-connection-pool
  (aleph/connection-pool
   {:response-executor
    (flow/utilization-executor
      0.9 512
      {:stats-callback (partial stats-callback :client)})
    :connections-per-host 7000
    :total-connections 8000
    :target-utilization 0.9
    :stats-callback connections-stats-callback ; dumps stats in to a file
    :connection-options {:keep-alive? false}}))

(->> 15000
     range
     (map (fn [_]
            (aleph/get "http://localhost:4567"
                       {:follow-redirects false
                        :throw-exceptions false
                        :pool client-connection-pool
                        :pool-timeout 1e4
                        :connection-timeout 1e4
                        :request-timeout 4e4})))
     doall
     (map (fn [_] nil)))

Server:

require 'sinatra'

get '/' do
  sleep 120
  "HI"
end

Killing ruby process in the middle of execution leaks workers in the pool.

@ztellman
Copy link
Collaborator

ztellman commented Mar 3, 2016

This fix was included in 0.4.0, so I don't think these two issues are the same.

I'm a little unclear on the nature of your issue, though. You mention num-workers, which I think relates to the response executor, but also the number of connections, which is unrelated to the threads that are actively processing the requests. Are the total connections for the pool being exhausted, or threads in the response executor?

@Gonzih
Copy link

Gonzih commented Mar 3, 2016

Connections for the connection pool are being exhausted.

@ztellman
Copy link
Collaborator

ztellman commented Mar 3, 2016

Okay, can you open a new issue for this? I'll take a look.

@Gonzih
Copy link

Gonzih commented Mar 3, 2016

created #217, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants