Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hackney hangs in connect #183

Closed
rlipscombe opened this issue Apr 20, 2015 · 35 comments
Closed

Hackney hangs in connect #183

rlipscombe opened this issue Apr 20, 2015 · 35 comments

Comments

@rlipscombe
Copy link

I'm using hackney in my system tests. Each suite starts a web server (implemented in node.js) and then I use hackney to talk to that web server.

After a number of tests have run, hackney simply hangs, and I have to wait for CT to kill the test suite.

I use httpc:get() to ensure that the web server is up before running the tests, so that's not the problem.

@rlipscombe
Copy link
Author

It appears to be blocked in gen_server:call(..., {checkout, ...}):

<0.1673.0> (ct@roger-pc) 5> erlang:process_display(pid(0,1669,0), backtrace).
Program counter: 0x00002b5471956410 (gen:do_call/4 + 392)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00002b549b939570 Return addr 0x00002b5473510d98 (gen_server:call/3 + 128)
y(0)     #Ref<0.0.0.45220>
y(1)     infinity
y(2)     {checkout,{"localhost",8000,hackney_tcp_transport},<0.1669.0>,#Ref<0.0.0.45218>}
y(3)     '$gen_call'
y(4)     <0.787.0>
y(5)     []

0x00002b549b9395a8 Return addr 0x00002b5498c58350 (hackney_pool:checkout/4 + 512)
y(0)     infinity
y(1)     {checkout,{"localhost",8000,hackney_tcp_transport},<0.1669.0>,#Ref<0.0.0.45218>}
y(2)     <0.787.0>
y(3)     Catch 0x00002b5473510d98 (gen_server:call/3 + 128)

0x00002b549b9395d0 Return addr 0x00002b5499406460 (hackney_connect:socket_from_pool/4 + 576)
y(0)     []
y(1)     "localhost"
y(2)     default
y(3)     hackney_tcp_transport
y(4)     8000
y(5)     #Ref<0.0.0.45218>

0x00002b549b939608 Return addr 0x00002b5499404f60 (hackney_connect:connect/5 + 720)
y(0)     {client,{1429,548571,153450},hackney_dummy_metrics,hackney_tcp_transport,"localhost",8000,<<14 bytes>>,[],nil,nil,#Ref<0.0.0.45218>,true,hackney_pool,infinity,false,5,false,5,nil,nil,nil,undefined,start,nil,normal,false,false,false,nil,waiting,nil,4096,<<0 bytes>>,[],undefined,nil,nil,nil,nil,undefined,nil}
y(1)     hackney_pool
y(2)     hackney_dummy_metrics
y(3)     default
y(4)     hackney_tcp_transport
y(5)     8000
y(6)     "localhost"

0x00002b549b939648 Return addr 0x00002b549907e288 (hackney:request/5 + 1024)
y(0)     []
y(1)     []
y(2)     []
y(3)     []
y(4)     []

@rlipscombe
Copy link
Author

...but the pool appears to have not noticed the call -- still sitting in gen_server:loop), or the caller didn't notice the response:

<0.1673.0> (ct@roger-pc) 6> erlang:process_display(pid(0,787,0), backtrace). 
Program counter: 0x00002b54735127c8 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00002b5498969d58 Return addr 0x00002b547197c470 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     infinity
y(2)     hackney_pool
y(3)     {state, ... LOTS OF STATE ELLIDED ...}
y(4)     <0.787.0>
y(5)     <0.117.0>

0x00002b5498969d90 Return addr 0x00000000008b3708 (<terminate process normally>)
y(0)     Catch 0x00002b547197c490 (proc_lib:init_p_do_apply/3 + 88)

@rlipscombe
Copy link
Author

Any ideas how I can debug this further?

@rlipscombe
Copy link
Author

Hmmm. If I reset hackney in init_per_suite (actually in a CTH), it seems to work fine:

application:stop(hackney),
timer:sleep(125),
{ok, _} = application:ensure_all_started(hackney),

I'd still like to have a proper solution, though...

Update

it seems to work fine

Spoke too soon 😞

Update update

Could be a red herring. I've been playing with CT timetrap and various other things that timeout. Restarting hackney appears to work.

@benoitc
Copy link
Owner

benoitc commented Apr 21, 2015

Are you correctly releasing the sockets in your tests? Ie. by reading the body?

Anyway there is a new pool system coming among other changes. I had to put that work in parentheses these last 2 weeks but I expect to release the new hackney version at the end of the week.

Amon changes the number of connection is not limited, the pool is a lot more simple on that point. Also lt improve the handling of ssl connections.

@rlipscombe
Copy link
Author

Are you correctly releasing the sockets in your tests? Ie. by reading the body?

I believe so. Is there any way to find out which sockets are still open / bodies still unread by poking around in the state? I've got a lot of tests, and it'd be useful to figure out which ones aren't cleaning up.

It'd be nice if -- for simple scenarios -- I didn't need to worry about it; #184

@rlipscombe
Copy link
Author

I guess my question wrt debugging this is:

  1. How does the pooling work? Has the pool process gone back to loop because it's waiting for a worker to announce that it's done? If so, that looks like pool starvation. If not, then there's a missed message somewhere (which is possible -- CT has a habit of sending/receiving messages in the process under test...)
  2. What, if anything, can I inspect to see the list of busy workers, or otherwise the state of the pool?

@benoitc
Copy link
Owner

benoitc commented Jun 16, 2015

Sorry somehow I didn't notice ay notification on that topic.

Pooling works that way in 1.x.x:

  1. A request is started, the socket is fetch from the pool or created if the max limit is not achieved
  2. The socket is released in the pool when it's read completely (HEAD requests, 204, 304, redirection when handled, or by reading the body)

If the process handling the request goes down the socket is closed.

To debug you can use the trace facility available in 1.1.0 or the metrics to check the numbe of sockets running, waiting in the pool, ..

Most of what I said above will stay true in the 2.x release, except that fetching a socket from the pool won't be blocking anymore when the limit is achieved. Instead a socket will always be created if the pool is empty. Also additional check has been added to test if the socket is closed or not before releasing it to the pool.

@benoitc benoitc added this to the 2.0.0 milestone Jun 16, 2015
hexedpackets added a commit to hexedpackets/docker-elixir that referenced this issue Aug 3, 2015
This is a temporary hackey workaround for benoitc/hackney#183
hexedpackets added a commit to hexedpackets/hauler that referenced this issue Aug 3, 2015
@zyro
Copy link
Contributor

zyro commented Aug 17, 2015

@benoitc Is using the with_body option on every request a decent way to ensure this happens?

The socket is released in the pool when it's read completely (HEAD requests, 204, 304, redirection when handled, or by reading the body)

@benoitc
Copy link
Owner

benoitc commented Aug 20, 2015

@zyro using this option will have the same result as getting the body. So yes once the body is fetch the socket is released to the pool.

@eproxus
Copy link

eproxus commented Sep 1, 2015

I had a similar problem, not getting bodies for all requests thus consuming the whole pool eventually. Would it be possible to raise an exception in this case in stead of timing out forever? That would have helped a lot in troubleshooting.

@ericmj
Copy link
Contributor

ericmj commented Sep 1, 2015

I've also had similar problems. I've worked around it by doing all requests without a pool.

@lexmag
Copy link
Contributor

lexmag commented Sep 1, 2015

Would it be possible to raise an exception in this case in stead of timing out forever? That would have helped a lot in troubleshooting.

I think we need to use anything else instead of infinity timeout value on checkout here.

The problem comes from noreply return here what makes gen_server:call/3 to hang with infinity timeout.

@ericmj
Copy link
Contributor

ericmj commented Sep 1, 2015

@lexmag That's not a solution to the underlying problem (if I understand the problem that is). The problem is that all sockets in the pool are busy because they hang somewhere, the pool is essentially empty forever. Adding a timeout to the checkout wont help because the checkout would never work until the pool is restarted.

Non-infinity timeouts are almost always the best idea, but with this design infinity is needed. If the call timeouts just as the pool replies with a socket you will have a leak.

@lexmag
Copy link
Contributor

lexmag commented Sep 1, 2015

@ericmj I'm actually not saying it'll solve this problem anyhow. 😄
I just replied how we could improve feedback to user. Sorry if I wasn't clear on this.

@benoitc
Copy link
Owner

benoitc commented Sep 1, 2015

right now the pool expect that all requests release or close their sockets . the max limit really means that you can't have more sockets out of the pool until you release them or notify the pool to ignore them (close). this is how it was redesigned years ago to answer to some users needs (not running out of fds).

I do not think it should be handle that way today. There are probably better solutions to fix it directly in your app. The coming version (first bits will land tonight) is fixing it, removing the barrier and only solving the following:

  • make the connection process more efficient and faster
  • handle more concurrent requests
  • improve the way we reuse the connections (in view of allowing you to handle the congestion problem more efficiently)

I will commit a branch later in the day. any tests will be much appreciated :)

@benoitc benoitc mentioned this issue Sep 12, 2015
@silviucpp
Copy link

Hello,

Any news on this problem or workaround . We are using the library in production and we have found a lot of processes blocked by this issue

Silviu

@benoitc
Copy link
Owner

benoitc commented Oct 2, 2015

@silviucpp do you let any requsts open? (ie not reading the socket? i have a coming patch but it took more time than expected.

@silviucpp
Copy link

No I don't :)

I just call :

hackney:request(post, Url, Headers, Payload, Options),

And this method is never return time to time. Looking to the process where is blocked I see
gen_server:call(..., {checkout, ...})

Silviu

@rlipscombe
Copy link
Author

hackney:request(post, Url, Headers, Payload, Options) leaves the request open. You need to read the request body by using either the with_body option, or by calling hackney:body(Ref).

@rlipscombe
Copy link
Author

fwiw, I finally found one spot in the system tests (mentioned in the original description) where we weren't correctly reading the body. Being able to control the timeout, or being able to inspect the pool statistics would be incredibly useful.

@silviucpp
Copy link

Thanks for feedback ! make sens we are reading the body only in case the result was 20x but not for errors :)

I will review my code

@silviucpp
Copy link

I checked my code and looks like:

{ok, StatusCode, _RespHeaders, ClientRef} = hackney:request(Method, Url, Headers, PayloadJson, Options),
{ok, Body} = hackney:body(ClientRef),

So the body is not executed only if:

1.the hackney throw an exception
2.the expression not match (this happen when you have connection time out).

I don't think in this case you need to call body/1...
So I'm not sure there is a problem with this into the code...

@benoitc
Copy link
Owner

benoitc commented Oct 2, 2015

@rlipscombe you can use the metrics but it can be definitely improved. Which kind of timeout? the connect timeout?

@silviucpp how many connections do you requests at the same time? Current implementation block at max pool size until a connection become available.

i will have a new connection system available for tests by tomorrow anyway. will update the thread asap with it

@silviucpp
Copy link

Hello,

My pool looks like:

ok = hackney_pool:start_pool(?HACKNEY_POOL, [{timeout, 150000}, {max_connections, 20}, {follow_redirect, true}, {max_redirect, 3}]),

I see processes freezed in gen_server:call(..., {checkout, ...}) for more than 10 days and I'm sure we never run 20 connections at the same time..

Also I see other connections running fine so only some of the get freezed..

The big timeout might be the problem ?

Silviu

@ericmj
Copy link
Contributor

ericmj commented Oct 5, 2015

@silviucpp I was seeing exactly the same with hex.

@benoitc
Copy link
Owner

benoitc commented Oct 6, 2015

which timeout are you talking about?

@silviucpp
Copy link

The one from start_tool {timeout, 150000}

@benoitc
Copy link
Owner

benoitc commented Oct 6, 2015

the default keepalive timeout can be changed in the application environment. It's the "timeout" setting.

@benoitc
Copy link
Owner

benoitc commented Oct 6, 2015

@silviucpp did you try to change it? also did you enable the metrics? i would be interrested by the pool metrics.

@silviucpp
Copy link

No I didn't tried to change this. I have metrics enabled in production using folsom. Let me know what exactly do you need.

Silviu

@benoitc
Copy link
Owner

benoitc commented Oct 6, 2015

if you can provide me a sample of the connect time and the metrics from the
pool (take_rate, no_socket, in_use_count) that would help me to figure
what's going on.

I will be blocked in a flight during the next 11h anyway so will have
plenty of time to work on it.

On Tue, 6 Oct 2015 at 08:54, Silviu Caragea notifications@github.com
wrote:

No I didn't tried to change this. I have metrics enabled in production
using folsom. Let me know what exactly do you need.

Silviu


Reply to this email directly or view it on GitHub
#183 (comment).

@silviucpp
Copy link

Hello,

On the default pool I see at this moment on the node where the stats are generated around 10 frezeed processes.

Stats are:

io:format(<<"~p">>,[folsom_metrics:get_metric_value([hackney_pool, default, take_rate ])]).
[]

io:format(<<"~p ~n">>,[folsom_metrics:get_metric_value([hackney_pool, default, no_socket ])]).    
74 

io:format(<<"~p">>,[folsom_metrics:get_metric_value([hackney_pool, default, in_use_count])]).     
[11,16,9,5,30,49,40,38,36,14,31,16,26,38,7,42,17,13,29,0,27,28,37,10,40,18,15,
 10,8,26,22,45,34,42,33,44,24,4,21,23,2,36,32,22,46,28,48,6,6,12,34,12,19,14,
 18,48,30,25,2,32,41,47,11,43,44,46,24,4,1,3,20,20,39,35]

io:format(<<"~p ~n">>,[folsom_metrics:get_metric_value([hackney_pool, default, free_count ])]).   
[-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
 -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
 -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1] 

io:format(<<"~p ~n">>,[folsom_metrics:get_metric_value([hackney_pool, default, queue_counter ])]).
[] 

@silviucpp
Copy link

Something interesting as wel. I removed my node from production and just looking to the following stats:

folsom_metrics:get_metric_value([hackney, nb_requests]).                                       
63
folsom_metrics:get_metric_value([hackney, total_requests]).        
62408
folsom_metrics:get_metric_value([hackney, finished_requests]).
62345

I'm sure at this moment there is no request in progress and I was expecting nb_request to be 0...
I just waited few minutes and still at 63

benoitc added a commit that referenced this issue Oct 25, 2015
pool is now working.

- connections are lazily handled
- fix #181
- fix #206
- fix #226
- fix #247
- fix #183

TODO:
-----

- For now HTTP requests don't work. it needs to be linked to the new
  connection system
- unitests needs to be added
- pool metrics
benoitc added a commit that referenced this issue Oct 25, 2015
pool is now working.

- connections are lazily handled
- fix #181
- fix #206
- fix #226
- fix #247
- fix #183

TODO:
-----

- For now HTTP requests don't work. it needs to be linked to the new
  connection system
- unitests needs to be added
- pool metrics
benoitc added a commit that referenced this issue Dec 7, 2015
pool is now working.

- connections are lazily handled
- fix #181
- fix #206
- fix #226
- fix #247
- fix #183

TODO:
-----

- For now HTTP requests don't work. it needs to be linked to the new
  connection system
- unitests needs to be added
- pool metrics
@benoitc benoitc modified the milestone: 2.0.0 Jan 25, 2016
@benoitc
Copy link
Owner

benoitc commented Jan 25, 2016

supersed by #276 . Thanks for all the feedback. WIP :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants