-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when fetching from slow HTTP server: shotgun:wait_response/3 not exported #96
Comments
I ran into this because I work with an Internet link that is sometimes flaky. I would like to have a chance to detect a bunch of failed connections and restart -just- the link, rather than having to contain the crashing client, and restart it and the Internet link. (httpc:request also behaves badly in at least one circumstance; hanging forever despite being handed a timeout value. It is this behavior that set me on the search for a replacement.) |
@kennethlakin Sorry for the late response. This is definitely a bug, but I'm not sure what the correct behavior should be though. Do you have any suggestions? |
This commit does two things: * Addresses upstream issue inaka#96. * Changes shotgun:open to optionally accept a map containing Ranch transport options (keyed to transport_opts) and/or a gun:await_up timeout (keyed to timeout). The default values are [] and 5000, respectively. Issue inaka#96 is partially caused by calling gun:request before gun:open has succeeded. shotgun:open now *blocks* until either gun establishes a connection, or the user-specified timeout. If the timeout is reached, the FSM will be stopped with reason 'gun_open_timeout'. If gun fails while attempting to establish a connection, the FSM will be stopped with reason 'gun_open_failed'. See the comments near the code that handles this stuff for some somewhat important information. In addition to the changes surrounding the use of gun:await_up, one can now pass options through to the underlying Ranch transport. This allows you to -for instance- bind a particular shotgun connection to a particular interface on a multi-homed machine. Existing users of open/2 or open/3 will not be broken by this change.
I would expect shotgun:request to return an error of some kind, rather than having the FSM blow up. How should this be done? Sadly, I have no idea. I spent a little bit of time looking at both shotgun and gun's code, before coming up with a partial solution that does _not_ solve the root problem. Figuring out the root cause of the error I reported will take a fair bit of work. When I was investigating, it looked like part of the problem was related to calling gun:request before gun:open had succeeded. Calling shotgun:request a second time after the previous request failed with a timeout causes the failure I reported. I have made some changes that make it easier to fail to call shotgun:request before gun:open has succeeded: shotgun:open now blocks on gun:await_up to ensure that gun:open succeeds. If gun:open fails shotgun:open returns {error, gun_open_failed}. If gun:open fails to complete before the specified timeout, shotgun:open returns {error, gun_open_timeout}. I'm not sure how useful this level of error reporting granularity will turn out to be, but a failure of gun to connect does ensure that the shotgun FSM shuts down. PR #99 contains these changes, along with a few other useful modifications. I hope to soon get some extra time, so I can see if I can strike at the root cause of the problem. |
Okay. This might already be obvious, but here's what I've found. Maybe I'm wrong, but the root cause appears to be our ability to send HTTP verb events to the shotgun FSM when it's in a state other than at_rest. It looks like the FSM-killing crash I reported happens when the FSM receives a GET request while it is in the wait_response state. The first crash I reported happens when the FSM gets a GET request while it's in the wait_response state. I suspect that if we were to send a GET request while we were in any other state (except for at_rest), we would just crash. (It looks like the other state handlers have function heads that only cover gun messages or 'DOWN' messages.) So, I'm just spitballing here, and haven't given this a lot of thought, but what if these changes were made?
Alternatively, what if those newly-added function heads appended the request to a request list in the State variable? at_rest could then be modified to check this list and -I guess- drain it before handling any other tasks. I can't immediately figure out how this would work, seeing as how shotgun:request Does either one of these modifications sound reasonable? Like I said, I haven't spent all that much time thinking about them. Oh. The kinda crummy graphviz diagram that I created to help understand the flow of the shotgun FSM is attached. Rectangles represent function calls (or returning {stop, Reason}). Ellipses represent FSM states. Arrow labels represent either a high-level description of the conditions for the transition, or the most pertinent arguments or conditions that lead to the transition. I can provide the dot file if needed, but I'm not sure that it'll be any clearer than the graph. |
Okay. PR #100 fixes this issue. As alluded to in the PR, I still have some concerns surrounding leftover mailbox messages from timed-out requests, and our inability to cancel in-progress gun requests. The documentation for gun:cancel states that there's no way to cancel an HTTP 1.1 stream, but gun will stop reporting events for it if you "cancel" a stream. This could be combined with a purge of the timed-out request from the shotgun work queue and a reset back to the at_rest work checking state. Or, one could maybe handle the gen_fsm call made in shotgun:request in a child process that is either killed by the exception from the timeout, or reports back if the request succeeds. I created and ran a little test program that exercises the additions to shotgun:open and the shotgun FSM fixes, and demonstrates that shotgun now behaves more like I would expect. Here's an annotated conversation with an erlang shell from the run. Lengthy, useless spew is elided.
The little test program follows: %This is intended to be pasted into an Erlang shell
%that already has shotgun and its libraries loaded.
application:ensure_all_started(shotgun).
IptablesAction="DROP",
Host="example.com",
Timeout=timer:seconds(5), %infinity works.
%%Test that our open timeout works correctly, or test
%%that our open failure detection code works correctly:
%os:cmd("sudo -k iptables -A OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
case shotgun:open(Host, 80, #{timeout => Timeout}) of
{ok, Pid} ->
%%Test that our get failure paths work as expected.
os:cmd("sudo -k iptables -A OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
TestFun=fun(F) ->
case shotgun:get(Pid, "/", #{}, #{timeout => Timeout}) of
{ok, Response} ->
io:format("Response get!~n");
{error, {closed, Message}} ->
io:format("Gun connection was closed. Message ~p~n", [Message]);
{error, {timeout, _}} ->
io:format("Got timeout from shotgun:get.~n"),
F(F)
end
end,
TestFun(TestFun);
{error, gun_open_failed} -> bah_gun_open_failed;
{error, gun_open_timeout} -> bah_gun_open_timeout
end.
os:cmd("sudo -k iptables -D OUTPUT -d " ++ Host ++ " -j " ++ IptablesAction),
flush().
f().
application:stop(shotgun).
application:stop(gun). |
@kennethlakin We are closing this issue. We can address the messages from timed-out requests in issue #109. |
This bug can be triggered by doing the following:
Lightly redacted, but otherwise complete conversation with an Erlang shell follows:
After making and exporting this zero-effort change in shotgun.erl
I still get an error, which is as follows:
If I handle that case in the test program, then it looks like either shotgun or gun has crashed, as every subsequent call to shotgun:get returns a tuple that looks like:
The text was updated successfully, but these errors were encountered: