Crash when fetching from slow HTTP server: shotgun:wait_response/3 not exported #96

kennethlakin · 2015-09-19T11:45:43Z

This bug can be triggered by doing the following:

Find a HTTP server that takes many seconds to connect, or simply find one with a single IP address, and add an iptables DROP rule for that server
Compile shotgun e5ae062
Start an erlang shell, passing in the proper -pa args
Run the following, replacing very.slow.website.or.blocked.website with the FQDN or IP of the server in question:

application:ensure_all_started(shotgun).
{ok, Pid}=shotgun:open("very.slow.website.or.blocked.website", 80).
TestFun=fun(F) ->
            case shotgun:get(Pid, "/") of 
              {ok, Response} -> io:format("ok~n");
              {error, {timeout, _}} -> F(F)
            end
        end.
TestFun(TestFun).
f().

Lightly redacted, but otherwise complete conversation with an Erlang shell follows:

Erlang/OTP 18 [erts-7.0] [source] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.0  (abort with ^G)
(erl-node@host)1> application:ensure_all_started(shotgun).
{ok,[ranch,crypto,cowlib,asn1,public_key,ssl,gun,shotgun]}
(erl-node@host)2> {ok, Pid}=shotgun:open("very.slow.website.or.blocked.website", 80).
{ok,<0.68.0>}
(erl-node@host)3> TestFun=fun(F) ->
(erl-node@host)3>             case shotgun:get(Pid, "/") of 
(erl-node@host)3>               {ok, Response} -> io:format("ok~n");
(erl-node@host)3>               {error, {timeout, _}} -> F(F)
(erl-node@host)3>             end
(erl-node@host)3>         end.
#Fun<erl_eval.6.54118792>
(erl-node@host)4> TestFun(TestFun).

=ERROR REPORT==== 19-Sep-2015::04:13:11 ===
** State machine <0.68.0> terminating 
** Last message in was {'$gen_sync_event',
                           {<0.41.0>,#Ref<0.0.2.128>},
                           {get,{"/",[],[]}}}
** When State == wait_response
**      Data  == #{async => false,
                   async_mode => binary,
                   buffer => <<>>,
                   data => <<>>,
                   from => {<0.41.0>,#Ref<0.0.2.122>},
                   handle_event => undefined,
                   headers => undefined,
                   pid => <0.69.0>,
                   responses => {[],[]},
                   status_code => undefined,
                   stream => #Ref<0.0.2.123>}
** Reason for termination = 
** {'function not exported',
       [{shotgun,wait_response,
            [{get,{"/",[],[]}},
             {<0.41.0>,#Ref<0.0.2.128>},
             #{async => false,
               async_mode => binary,
               buffer => <<>>,
               data => <<>>,
               from => {<0.41.0>,#Ref<0.0.2.122>},
               handle_event => undefined,
               headers => undefined,
               pid => <0.69.0>,
               responses => {[],[]},
               status_code => undefined,
               stream => #Ref<0.0.2.123>}],
            []},
        {gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
** exception error: no case clause matching {error,{{undef,[{shotgun,wait_response,
                                                                     [{get,{"/",[],[]}},
                                                                      {<0.41.0>,#Ref<0.0.2.128>},
                                                                      #{async => false,
                                                                        async_mode => binary,
                                                                        buffer => <<>>,
                                                                        data => <<>>,
                                                                        from => {<0.41.0>,#Ref<0.0.2.122>},
                                                                        handle_event => undefined,
                                                                        headers => undefined,
                                                                        pid => <0.69.0>,
                                                                        responses => {[],[]},
                                                                        status_code => undefined,
                                                                        stream => #Ref<0.0.2.123>}],
                                                                     []},
                                                            {gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,518}]},
                                                            {proc_lib,init_p_do_apply,3,
                                                                      [{file,"proc_lib.erl"},{line,239}]}]},
                                                    {gen_fsm,sync_send_event,
                                                             [<0.68.0>,{get,{"/",[],[]}},5000]}}}
(erl-node@host)5> f().
ok
(erl-node@host)6>

After making and exporting this zero-effort change in shotgun.erl

wait_response(Event, _, StateData) ->
  wait_response(Event, StateData).

I still get an error, which is as follows:

=ERROR REPORT==== 19-Sep-2015::04:26:54 ===
** State machine <0.68.0> terminating 
** Last message in was {'$gen_sync_event',
                           {<0.41.0>,#Ref<0.0.2.128>},
                           {get,{"/",[],[]}}}
** When State == wait_response
**      Data  == #{async => false,
                   async_mode => binary,
                   buffer => <<>>,
                   data => <<>>,
                   from => {<0.41.0>,#Ref<0.0.2.122>},
                   handle_event => undefined,
                   headers => undefined,
                   pid => <0.69.0>,
                   responses => {[],[]},
                   status_code => undefined,
                   stream => #Ref<0.0.2.123>}
** Reason for termination = 
** {unexpected,{get,{"/",[],[]}}}
** exception error: no case clause matching {error,{{unexpected,{get,{"/",[],[]}}},
                                                    {gen_fsm,sync_send_event,
                                                             [<0.68.0>,{get,{"/",[],[]}},5000]}}}

If I handle that case in the test program, then it looks like either shotgun or gun has crashed, as every subsequent call to shotgun:get returns a tuple that looks like:

{error,{noproc,{gen_fsm,sync_send_event,
                                   [<0.69.0>,{get,{"/",[],[]}},5000]}}}

The text was updated successfully, but these errors were encountered:

kennethlakin · 2015-09-19T12:00:41Z

I ran into this because I work with an Internet link that is sometimes flaky. I would like to have a chance to detect a bunch of failed connections and restart -just- the link, rather than having to contain the crashing client, and restart it and the Internet link.

(httpc:request also behaves badly in at least one circumstance; hanging forever despite being handed a timeout value. It is this behavior that set me on the search for a replacement.)

jfacorro · 2015-10-08T18:07:08Z

@kennethlakin Sorry for the late response. This is definitely a bug, but I'm not sure what the correct behavior should be though. Do you have any suggestions?

This commit does two things: * Addresses upstream issue inaka#96. * Changes shotgun:open to optionally accept a map containing Ranch transport options (keyed to transport_opts) and/or a gun:await_up timeout (keyed to timeout). The default values are [] and 5000, respectively. Issue inaka#96 is partially caused by calling gun:request before gun:open has succeeded. shotgun:open now *blocks* until either gun establishes a connection, or the user-specified timeout. If the timeout is reached, the FSM will be stopped with reason 'gun_open_timeout'. If gun fails while attempting to establish a connection, the FSM will be stopped with reason 'gun_open_failed'. See the comments near the code that handles this stuff for some somewhat important information. In addition to the changes surrounding the use of gun:await_up, one can now pass options through to the underlying Ranch transport. This allows you to -for instance- bind a particular shotgun connection to a particular interface on a multi-homed machine. Existing users of open/2 or open/3 will not be broken by this change.

kennethlakin · 2015-10-11T08:03:42Z

This is definitely a bug, but I'm not sure what the correct behavior should be though. Do you have any suggestions?

I would expect shotgun:request to return an error of some kind, rather than having the FSM blow up. How should this be done? Sadly, I have no idea. I spent a little bit of time looking at both shotgun and gun's code, before coming up with a partial solution that does _not_ solve the root problem.

Figuring out the root cause of the error I reported will take a fair bit of work. When I was investigating, it looked like part of the problem was related to calling gun:request before gun:open had succeeded. Calling shotgun:request a second time after the previous request failed with a timeout causes the failure I reported.

I have made some changes that make it easier to fail to call shotgun:request before gun:open has succeeded:

shotgun:open now blocks on gun:await_up to ensure that gun:open succeeds. If gun:open fails shotgun:open returns {error, gun_open_failed}. If gun:open fails to complete before the specified timeout, shotgun:open returns {error, gun_open_timeout}. I'm not sure how useful this level of error reporting granularity will turn out to be, but a failure of gun to connect does ensure that the shotgun FSM shuts down.

PR #99 contains these changes, along with a few other useful modifications.

I hope to soon get some extra time, so I can see if I can strike at the root cause of the problem.

kennethlakin · 2015-10-13T01:07:23Z

Okay. This might already be obvious, but here's what I've found.

Maybe I'm wrong, but the root cause appears to be our ability to send HTTP verb events to the shotgun FSM when it's in a state other than at_rest. It looks like the FSM-killing crash I reported happens when the FSM receives a GET request while it is in the wait_response state. The first crash I reported happens when the FSM gets a GET request while it's in the wait_response state. I suspect that if we were to send a GET request while we were in any other state (except for at_rest), we would just crash. (It looks like the other state handlers have function heads that only cover gun messages or 'DOWN' messages.)

So, I'm just spitballing here, and haven't given this a lot of thought, but what if these changes were made?

Add function heads to each state (that's not at_rest) that handle all of the available HTTP verbs.
- In these function heads, reply to the sender with an error that tells the sender to wait for a bit and retry the request, 'cause we're not in a state to handle it right now.
Modify request/6 to understand this error and loop (maybe with a small sleep) until it gets a success reply.

Alternatively, what if those newly-added function heads appended the request to a request list in the State variable? at_rest could then be modified to check this list and -I guess- drain it before handling any other tasks. I can't immediately figure out how this would work, seeing as how shotgun:request ~~and the HTTP verb calls~~ blocks and returns the result. Maybe ~~they~~ it would perform an async call internally, and wait for wait_response to send the result? Perhaps at_rest would be in charge of populating the from variable in the FSM state from the oldest request in the requests list in the FSM state.

Does either one of these modifications sound reasonable? Like I said, I haven't spent all that much time thinking about them.

Oh. The kinda crummy graphviz diagram that I created to help understand the flow of the shotgun FSM is attached. Rectangles represent function calls (or returning {stop, Reason}). Ellipses represent FSM states. Arrow labels represent either a high-level description of the conditions for the transition, or the most pertinent arguments or conditions that lead to the transition.

I can provide the dot file if needed, but I'm not sure that it'll be any clearer than the graph.

kennethlakin · 2015-10-14T08:43:16Z

Okay. PR #100 fixes this issue.

As alluded to in the PR, I still have some concerns surrounding leftover mailbox messages from timed-out requests, and our inability to cancel in-progress gun requests. The documentation for gun:cancel states that there's no way to cancel an HTTP 1.1 stream, but gun will stop reporting events for it if you "cancel" a stream. This could be combined with a purge of the timed-out request from the shotgun work queue and a reset back to the at_rest work checking state. Or, one could maybe handle the gen_fsm call made in shotgun:request in a child process that is either killed by the exception from the timeout, or reports back if the request succeeds.

I created and ran a little test program that exercises the additions to shotgun:open and the shotgun FSM fixes, and demonstrates that shotgun now behaves more like I would expect. Here's an annotated conversation with an erlang shell from the run. Lengthy, useless spew is elided.

1> application:ensure_all_started(shotgun).
{ok,[ranch,cowlib,asn1,public_key,ssl,gun,shotgun]}
2> %Test program executed
Got timeout from shotgun:get.
Got timeout from shotgun:get.
%At this point, I remove the iptables DROP rule. A few seconds later...
Response get!
ok
3> os:cmd("sudo -k iptables -D OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction), 
3> flush().
Shell got {#Ref<0.0.2.443>,
%spew elided
Shell got {#Ref<0.0.2.447>,
%spew elided
ok
4> f().
ok
5> application:stop(shotgun).
ok
6> 
=INFO REPORT==== 14-Oct-2015::01:18:01 ===
    application: shotgun
    exited: stopped
    type: temporary
6>

The little test program follows:

%This is intended to be pasted into an Erlang shell
%that already has shotgun and its libraries loaded.

application:ensure_all_started(shotgun).
IptablesAction="DROP",
Host="example.com",
Timeout=timer:seconds(5), %infinity works.
%%Test that our open timeout works correctly, or test
%%that our open failure detection code works correctly:
%os:cmd("sudo -k iptables -A OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
case shotgun:open(Host, 80, #{timeout => Timeout}) of
  {ok, Pid} ->
    %%Test that our get failure paths work as expected.
    os:cmd("sudo -k iptables -A OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
    TestFun=fun(F) ->
                case shotgun:get(Pid, "/", #{}, #{timeout => Timeout}) of 
                  {ok, Response} -> 
                    io:format("Response get!~n");
                  {error, {closed, Message}} -> 
                    io:format("Gun connection was closed. Message ~p~n", [Message]);
                  {error, {timeout, _}} -> 
                    io:format("Got timeout from shotgun:get.~n"),
                    F(F)
                end
            end,
    TestFun(TestFun);
   {error, gun_open_failed} -> bah_gun_open_failed;
   {error, gun_open_timeout} -> bah_gun_open_timeout
end.
os:cmd("sudo -k iptables -D OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
flush().
f().
application:stop(shotgun).
application:stop(gun).

jfacorro · 2015-11-02T16:20:34Z

@kennethlakin We are closing this issue. We can address the messages from timed-out requests in issue #109.

jfacorro added the bug label Oct 8, 2015

kennethlakin mentioned this issue Oct 11, 2015

Changes to open and check_uri #99

Closed

kennethlakin mentioned this issue Oct 14, 2015

shotgun FSM fixes #100

Merged

jfacorro mentioned this issue Nov 2, 2015

Correctly handle the messages from time-out requests #109

Open

jfacorro closed this as completed Nov 2, 2015

jfacorro mentioned this issue Dec 23, 2015

call to undefined function shotgun:wait_response/3 #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when fetching from slow HTTP server: shotgun:wait_response/3 not exported #96

Crash when fetching from slow HTTP server: shotgun:wait_response/3 not exported #96

kennethlakin commented Sep 19, 2015

kennethlakin commented Sep 19, 2015

jfacorro commented Oct 8, 2015

kennethlakin commented Oct 11, 2015

kennethlakin commented Oct 13, 2015

kennethlakin commented Oct 14, 2015

jfacorro commented Nov 2, 2015

Crash when fetching from slow HTTP server: shotgun:wait_response/3 not exported #96

Crash when fetching from slow HTTP server: shotgun:wait_response/3 not exported #96

Comments

kennethlakin commented Sep 19, 2015

kennethlakin commented Sep 19, 2015

jfacorro commented Oct 8, 2015

kennethlakin commented Oct 11, 2015

kennethlakin commented Oct 13, 2015

kennethlakin commented Oct 14, 2015

jfacorro commented Nov 2, 2015