-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.8.0 short-time till "leave" leads to EOF #2880
Comments
Note the test is just
|
Gentle ping on this. |
We are probably closing down the web server too soon - this is a bit of a race condition since the agent is shutting down, so we need a clean way to keep the agent up until the OK is sent back from the leave request. |
This might already be in #3037 Lines 1029 to 1046 in 92fb316
|
Gentle ping on this. |
This one is a bit of a challenge. I've recently changed the code so that we shut down the external endpoints down before we shutdown the internal ones. since you are triggering a |
When the agent is triggered to shutdown via an external 'consul leave' command delivered via the HTTP API then the client expects to receive a response when the agent is down. This creates a race on when to shutdown the agent itself like the RPC server, the checks and the state and the external endpoints like DNS and HTTP. Ideally, the external endpoints should be shutdown before the internal state but if the goal is to respond reliably that the agent is down then this is not possible. This patch splits the agent shutdown into two parts implemented in a single method to keep it simple and unambiguos for the caller. The first stage shuts down the internal state, checks, RPC server, ... synchronously and then triggers the shutdown of the external endpoints asychronously. This way the caller is guaranteed that the internal state services are down when Shutdown returns and there remains enough time to send a response. Fixes #2880
When the agent is triggered to shutdown via an external 'consul leave' command delivered via the HTTP API then the client expects to receive a response when the agent is down. This creates a race on when to shutdown the agent itself like the RPC server, the checks and the state and the external endpoints like DNS and HTTP. Ideally, the external endpoints should be shutdown before the internal state but if the goal is to respond reliably that the agent is down then this is not possible. This patch splits the agent shutdown into two parts implemented in a single method to keep it simple and unambiguos for the caller. The first stage shuts down the internal state, checks, RPC server, ... synchronously and then triggers the shutdown of the external endpoints asychronously. This way the caller is guaranteed that the internal state services are down when Shutdown returns and there remains enough time to send a response. Fixes #2880
@ilovezfs could you check whether this fixes it for you please? |
pls wait. I was too quick |
Same thing even if they're in separate windows. |
not for me:
|
The window where I started
|
Asking colleagues to verify. My working hypothesis is that this is on your machine. What is your shell? |
🤦♂️ bash-3.2 ... |
Tried with bash 3.2 and 4.4. |
Colleagues can't repro as well. Which macOS version are you using? |
10.11. Going to try on a 10.12 box .... |
I'm on 10.12.5 |
ruby version? |
Same deal on 10.12:
|
Just for grins. Can you test with |
|
Can you DM me on Twitter? |
Are you on IRC? |
not really. Would prefer something with audio, FaceTime, Skype. |
skype doesn't work for me. If you send me an email to frank at hashicorp.com then I'll send you a link. |
I'm off to lunch now. Back in 30 min. |
It seems to be non-deterministic. About a third of the time I see the behavior you're describing. |
OK, then we'll leave it as is. I'll ask for some more feedback internally. |
When the agent is triggered to shutdown via an external 'consul leave' command delivered via the HTTP API then the client expects to receive a response when the agent is down. This creates a race on when to shutdown the agent itself like the RPC server, the checks and the state and the external endpoints like DNS and HTTP. This patch splits the shutdown process into two parts: * shutdown the agent * shutdown the endpoints (http and dns) They can be executed multiple times, concurrently and in any order but should be executed first agent, then endpoints to provide consistent behavior across all use cases. Both calls have to be executed for a proper shutdown. This could be partially hidden in a single function but would introduce some magic that happens behind the scenes which one has to know of but isn't obvious. Fixes #2880
When the agent is triggered to shutdown via an external 'consul leave' command delivered via the HTTP API then the client expects to receive a response when the agent is down. This creates a race on when to shutdown the agent itself like the RPC server, the checks and the state and the external endpoints like DNS and HTTP. This patch splits the shutdown process into two parts: * shutdown the agent * shutdown the endpoints (http and dns) They can be executed multiple times, concurrently and in any order but should be executed first agent, then endpoints to provide consistent behavior across all use cases. Both calls have to be executed for a proper shutdown. This could be partially hidden in a single function but would introduce some magic that happens behind the scenes which one has to know of but isn't obvious. Fixes #2880
@ilovezfs our current assumption is that somewhere in the logging path we're buffering something. I'll keep looking. |
@magiconair no problem. I am very happy the primary issue here was addressed because it will fix our CI for the |
The following issue did not affect 0.7.5, but it does affect 0.8.0 and HEAD.
The error is "Error leaving: Put http://127.0.0.1:8500/v1/agent/leave: EOF"
If I crank the sleep up to 30 seconds in the test before running
consul leave
, then it exits gracefully, as follows:The text was updated successfully, but these errors were encountered: