Hackney hangs in connect #183

rlipscombe · 2015-04-20T16:55:58Z

I'm using hackney in my system tests. Each suite starts a web server (implemented in node.js) and then I use hackney to talk to that web server.

After a number of tests have run, hackney simply hangs, and I have to wait for CT to kill the test suite.

I use httpc:get() to ensure that the web server is up before running the tests, so that's not the problem.

rlipscombe · 2015-04-20T17:00:04Z

It appears to be blocked in gen_server:call(..., {checkout, ...}):

<0.1673.0> (ct@roger-pc) 5> erlang:process_display(pid(0,1669,0), backtrace).
Program counter: 0x00002b5471956410 (gen:do_call/4 + 392)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00002b549b939570 Return addr 0x00002b5473510d98 (gen_server:call/3 + 128)
y(0)     #Ref<0.0.0.45220>
y(1)     infinity
y(2)     {checkout,{"localhost",8000,hackney_tcp_transport},<0.1669.0>,#Ref<0.0.0.45218>}
y(3)     '$gen_call'
y(4)     <0.787.0>
y(5)     []

0x00002b549b9395a8 Return addr 0x00002b5498c58350 (hackney_pool:checkout/4 + 512)
y(0)     infinity
y(1)     {checkout,{"localhost",8000,hackney_tcp_transport},<0.1669.0>,#Ref<0.0.0.45218>}
y(2)     <0.787.0>
y(3)     Catch 0x00002b5473510d98 (gen_server:call/3 + 128)

0x00002b549b9395d0 Return addr 0x00002b5499406460 (hackney_connect:socket_from_pool/4 + 576)
y(0)     []
y(1)     "localhost"
y(2)     default
y(3)     hackney_tcp_transport
y(4)     8000
y(5)     #Ref<0.0.0.45218>

0x00002b549b939608 Return addr 0x00002b5499404f60 (hackney_connect:connect/5 + 720)
y(0)     {client,{1429,548571,153450},hackney_dummy_metrics,hackney_tcp_transport,"localhost",8000,<<14 bytes>>,[],nil,nil,#Ref<0.0.0.45218>,true,hackney_pool,infinity,false,5,false,5,nil,nil,nil,undefined,start,nil,normal,false,false,false,nil,waiting,nil,4096,<<0 bytes>>,[],undefined,nil,nil,nil,nil,undefined,nil}
y(1)     hackney_pool
y(2)     hackney_dummy_metrics
y(3)     default
y(4)     hackney_tcp_transport
y(5)     8000
y(6)     "localhost"

0x00002b549b939648 Return addr 0x00002b549907e288 (hackney:request/5 + 1024)
y(0)     []
y(1)     []
y(2)     []
y(3)     []
y(4)     []

rlipscombe · 2015-04-20T17:03:01Z

...but the pool appears to have not noticed the call -- still sitting in gen_server:loop), or the caller didn't notice the response:

<0.1673.0> (ct@roger-pc) 6> erlang:process_display(pid(0,787,0), backtrace). 
Program counter: 0x00002b54735127c8 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00002b5498969d58 Return addr 0x00002b547197c470 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     infinity
y(2)     hackney_pool
y(3)     {state, ... LOTS OF STATE ELLIDED ...}
y(4)     <0.787.0>
y(5)     <0.117.0>

0x00002b5498969d90 Return addr 0x00000000008b3708 (<terminate process normally>)
y(0)     Catch 0x00002b547197c490 (proc_lib:init_p_do_apply/3 + 88)

rlipscombe · 2015-04-20T17:03:39Z

Any ideas how I can debug this further?

rlipscombe · 2015-04-21T07:10:00Z

Hmmm. If I reset hackney in init_per_suite (actually in a CTH), it seems to work fine:

application:stop(hackney),
timer:sleep(125),
{ok, _} = application:ensure_all_started(hackney),

I'd still like to have a proper solution, though...

Update

it seems to work fine

Spoke too soon 😞

Update update

Could be a red herring. I've been playing with CT timetrap and various other things that timeout. Restarting hackney appears to work.

benoitc · 2015-04-21T09:22:54Z

Are you correctly releasing the sockets in your tests? Ie. by reading the body?

Anyway there is a new pool system coming among other changes. I had to put that work in parentheses these last 2 weeks but I expect to release the new hackney version at the end of the week.

Amon changes the number of connection is not limited, the pool is a lot more simple on that point. Also lt improve the handling of ssl connections.

rlipscombe · 2015-04-21T10:57:59Z

Are you correctly releasing the sockets in your tests? Ie. by reading the body?

I believe so. Is there any way to find out which sockets are still open / bodies still unread by poking around in the state? I've got a lot of tests, and it'd be useful to figure out which ones aren't cleaning up.

It'd be nice if -- for simple scenarios -- I didn't need to worry about it; #184

rlipscombe · 2015-04-23T07:15:51Z

I guess my question wrt debugging this is:

How does the pooling work? Has the pool process gone back to loop because it's waiting for a worker to announce that it's done? If so, that looks like pool starvation. If not, then there's a missed message somewhere (which is possible -- CT has a habit of sending/receiving messages in the process under test...)
What, if anything, can I inspect to see the list of busy workers, or otherwise the state of the pool?

benoitc · 2015-06-16T08:18:54Z

Sorry somehow I didn't notice ay notification on that topic.

Pooling works that way in 1.x.x:

A request is started, the socket is fetch from the pool or created if the max limit is not achieved
The socket is released in the pool when it's read completely (HEAD requests, 204, 304, redirection when handled, or by reading the body)

If the process handling the request goes down the socket is closed.

To debug you can use the trace facility available in 1.1.0 or the metrics to check the numbe of sockets running, waiting in the pool, ..

Most of what I said above will stay true in the 2.x release, except that fetching a socket from the pool won't be blocking anymore when the limit is achieved. Instead a socket will always be created if the pool is empty. Also additional check has been added to test if the socket is closed or not before releasing it to the pool.

This is a temporary hackey workaround for benoitc/hackney#183

Temp workaround for benoitc/hackney#183

zyro · 2015-08-17T19:06:43Z

@benoitc Is using the with_body option on every request a decent way to ensure this happens?

The socket is released in the pool when it's read completely (HEAD requests, 204, 304, redirection when handled, or by reading the body)

benoitc · 2015-08-20T22:17:30Z

@zyro using this option will have the same result as getting the body. So yes once the body is fetch the socket is released to the pool.

eproxus · 2015-09-01T08:51:11Z

I had a similar problem, not getting bodies for all requests thus consuming the whole pool eventually. Would it be possible to raise an exception in this case in stead of timing out forever? That would have helped a lot in troubleshooting.

ericmj · 2015-09-01T10:41:44Z

I've also had similar problems. I've worked around it by doing all requests without a pool.

lexmag · 2015-09-01T10:49:18Z

Would it be possible to raise an exception in this case in stead of timing out forever? That would have helped a lot in troubleshooting.

I think we need to use anything else instead of infinity timeout value on checkout here.

The problem comes from noreply return here what makes gen_server:call/3 to hang with infinity timeout.

ericmj · 2015-09-01T11:51:08Z

@lexmag That's not a solution to the underlying problem (if I understand the problem that is). The problem is that all sockets in the pool are busy because they hang somewhere, the pool is essentially empty forever. Adding a timeout to the checkout wont help because the checkout would never work until the pool is restarted.

Non-infinity timeouts are almost always the best idea, but with this design infinity is needed. If the call timeouts just as the pool replies with a socket you will have a leak.

lexmag · 2015-09-01T12:06:11Z

@ericmj I'm actually not saying it'll solve this problem anyhow. 😄
I just replied how we could improve feedback to user. Sorry if I wasn't clear on this.

benoitc · 2015-09-01T13:33:28Z

right now the pool expect that all requests release or close their sockets . the max limit really means that you can't have more sockets out of the pool until you release them or notify the pool to ignore them (close). this is how it was redesigned years ago to answer to some users needs (not running out of fds).

I do not think it should be handle that way today. There are probably better solutions to fix it directly in your app. The coming version (first bits will land tonight) is fixing it, removing the barrier and only solving the following:

make the connection process more efficient and faster
handle more concurrent requests
improve the way we reuse the connections (in view of allowing you to handle the congestion problem more efficiently)

I will commit a branch later in the day. any tests will be much appreciated :)

silviucpp · 2015-10-01T21:33:40Z

Hello,

Any news on this problem or workaround . We are using the library in production and we have found a lot of processes blocked by this issue

Silviu

benoitc · 2015-10-02T03:33:51Z

@silviucpp do you let any requsts open? (ie not reading the socket? i have a coming patch but it took more time than expected.

silviucpp · 2015-10-02T05:38:21Z

No I don't :)

I just call :

hackney:request(post, Url, Headers, Payload, Options),

And this method is never return time to time. Looking to the process where is blocked I see
gen_server:call(..., {checkout, ...})

Silviu

rlipscombe · 2015-10-02T06:48:08Z

hackney:request(post, Url, Headers, Payload, Options) leaves the request open. You need to read the request body by using either the with_body option, or by calling hackney:body(Ref).

rlipscombe · 2015-10-02T06:50:59Z

fwiw, I finally found one spot in the system tests (mentioned in the original description) where we weren't correctly reading the body. Being able to control the timeout, or being able to inspect the pool statistics would be incredibly useful.

silviucpp · 2015-10-02T07:15:34Z

Thanks for feedback ! make sens we are reading the body only in case the result was 20x but not for errors :)

I will review my code

silviucpp · 2015-10-02T07:29:57Z

I checked my code and looks like:

{ok, StatusCode, _RespHeaders, ClientRef} = hackney:request(Method, Url, Headers, PayloadJson, Options),
{ok, Body} = hackney:body(ClientRef),

So the body is not executed only if:

1.the hackney throw an exception
2.the expression not match (this happen when you have connection time out).

I don't think in this case you need to call body/1...
So I'm not sure there is a problem with this into the code...

benoitc · 2015-10-02T19:35:16Z

@rlipscombe you can use the metrics but it can be definitely improved. Which kind of timeout? the connect timeout?

@silviucpp how many connections do you requests at the same time? Current implementation block at max pool size until a connection become available.

i will have a new connection system available for tests by tomorrow anyway. will update the thread asap with it

silviucpp · 2015-10-05T22:07:32Z

Hello,

My pool looks like:

ok = hackney_pool:start_pool(?HACKNEY_POOL, [{timeout, 150000}, {max_connections, 20}, {follow_redirect, true}, {max_redirect, 3}]),

I see processes freezed in gen_server:call(..., {checkout, ...}) for more than 10 days and I'm sure we never run 20 connections at the same time..

Also I see other connections running fine so only some of the get freezed..

The big timeout might be the problem ?

Silviu

ericmj · 2015-10-05T23:12:43Z

@silviucpp I was seeing exactly the same with hex.

benoitc · 2015-10-06T05:05:39Z

which timeout are you talking about?

silviucpp · 2015-10-06T05:08:17Z

The one from start_tool {timeout, 150000}

benoitc · 2015-10-06T05:09:29Z

the default keepalive timeout can be changed in the application environment. It's the "timeout" setting.

benoitc · 2015-10-06T05:11:41Z

@silviucpp did you try to change it? also did you enable the metrics? i would be interrested by the pool metrics.

silviucpp · 2015-10-06T06:54:02Z

No I didn't tried to change this. I have metrics enabled in production using folsom. Let me know what exactly do you need.

Silviu

benoitc · 2015-10-06T10:01:30Z

if you can provide me a sample of the connect time and the metrics from the
pool (take_rate, no_socket, in_use_count) that would help me to figure
what's going on.

I will be blocked in a flight during the next 11h anyway so will have
plenty of time to work on it.

On Tue, 6 Oct 2015 at 08:54, Silviu Caragea notifications@github.com
wrote:

No I didn't tried to change this. I have metrics enabled in production
using folsom. Let me know what exactly do you need.

Silviu

—
Reply to this email directly or view it on GitHub
#183 (comment).

silviucpp · 2015-10-06T17:33:58Z

Hello,

On the default pool I see at this moment on the node where the stats are generated around 10 frezeed processes.

Stats are:

io:format(<<"~p">>,[folsom_metrics:get_metric_value([hackney_pool, default, take_rate ])]).
[]

io:format(<<"~p ~n">>,[folsom_metrics:get_metric_value([hackney_pool, default, no_socket ])]).    
74 

io:format(<<"~p">>,[folsom_metrics:get_metric_value([hackney_pool, default, in_use_count])]).     
[11,16,9,5,30,49,40,38,36,14,31,16,26,38,7,42,17,13,29,0,27,28,37,10,40,18,15,
 10,8,26,22,45,34,42,33,44,24,4,21,23,2,36,32,22,46,28,48,6,6,12,34,12,19,14,
 18,48,30,25,2,32,41,47,11,43,44,46,24,4,1,3,20,20,39,35]

io:format(<<"~p ~n">>,[folsom_metrics:get_metric_value([hackney_pool, default, free_count ])]).   
[-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
 -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
 -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1] 

io:format(<<"~p ~n">>,[folsom_metrics:get_metric_value([hackney_pool, default, queue_counter ])]).
[]

silviucpp · 2015-10-06T17:43:30Z

Something interesting as wel. I removed my node from production and just looking to the following stats:

folsom_metrics:get_metric_value([hackney, nb_requests]).                                       
63
folsom_metrics:get_metric_value([hackney, total_requests]).        
62408
folsom_metrics:get_metric_value([hackney, finished_requests]).
62345

I'm sure at this moment there is no request in progress and I was expecting nb_request to be 0...
I just waited few minutes and still at 63

pool is now working. - connections are lazily handled - fix #181 - fix #206 - fix #226 - fix #247 - fix #183 TODO: ----- - For now HTTP requests don't work. it needs to be linked to the new connection system - unitests needs to be added - pool metrics

benoitc · 2016-01-25T10:48:08Z

supersed by #276 . Thanks for all the feedback. WIP :)

benoitc added this to the 2.0.0 milestone Jun 16, 2015

benoitc added enhancement working on it labels Jun 16, 2015

hexedpackets added a commit to hexedpackets/docker-elixir that referenced this issue Aug 3, 2015

Restart hackney before each HTTP request.

59bc937

This is a temporary hackey workaround for benoitc/hackney#183

hexedpackets added a commit to hexedpackets/hauler that referenced this issue Aug 3, 2015

Restart hackney at the start of each :recreate RPC call.

62fe6ad

Temp workaround for benoitc/hackney#183

benoitc mentioned this issue Sep 12, 2015

hackney 2 #243

Closed

benoitc modified the milestone: 2.0.0 Jan 25, 2016

benoitc closed this as completed Jan 25, 2016

mkaszubowski mentioned this issue Sep 25, 2017

Problems with hackney getsentry/sentry-elixir#216

Closed

Hackney hangs in connect #183

Hackney hangs in connect #183

Comments

rlipscombe commented Apr 20, 2015

rlipscombe commented Apr 20, 2015

rlipscombe commented Apr 20, 2015

rlipscombe commented Apr 20, 2015

rlipscombe commented Apr 21, 2015

benoitc commented Apr 21, 2015

rlipscombe commented Apr 21, 2015

rlipscombe commented Apr 23, 2015

benoitc commented Jun 16, 2015

zyro commented Aug 17, 2015

benoitc commented Aug 20, 2015

eproxus commented Sep 1, 2015

ericmj commented Sep 1, 2015

lexmag commented Sep 1, 2015

ericmj commented Sep 1, 2015

lexmag commented Sep 1, 2015

benoitc commented Sep 1, 2015

silviucpp commented Oct 1, 2015

benoitc commented Oct 2, 2015

silviucpp commented Oct 2, 2015

rlipscombe commented Oct 2, 2015

rlipscombe commented Oct 2, 2015

silviucpp commented Oct 2, 2015

silviucpp commented Oct 2, 2015

benoitc commented Oct 2, 2015

silviucpp commented Oct 5, 2015

ericmj commented Oct 5, 2015

benoitc commented Oct 6, 2015

silviucpp commented Oct 6, 2015

benoitc commented Oct 6, 2015

benoitc commented Oct 6, 2015

silviucpp commented Oct 6, 2015

benoitc commented Oct 6, 2015

silviucpp commented Oct 6, 2015

silviucpp commented Oct 6, 2015

benoitc commented Jan 25, 2016