This repository has been archived by the owner on Jan 8, 2024. It is now read-only.
internal/runner: make Accept resilient to the server going down #3097
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This starts by addressing only the initial job stream opening. This still does not add any resilience after the job stream is open (i.e. when a job is running).
I'm opening this PR early since this required some changes to how we track the state of the underlying runner config and it is complex enough on the surface that I didn't want to complicate a future PR. Additionally, this PR on its own still incrementally makes the runner more resilient to failure modes.
This PR builds on #3087 and adds the following scenarios the runner can now withstand:
Accept
.Accept
(its a stream, so while the initial handshake is blocking)Accept
. This is usually due to the server going down, but the behavior that the runner sees is the same.Runner State from Booleans to Monotonic Ints
A core change made here is the change from a set of booleans to represent state to monotonic unsigned ints. This lets us still retain the boolean-like behavior (anything greater than 0 is "true") while also giving us some "happens after" semantics we can check. This is important because on disconnect, we need to detect when we're reconnected after the disconnect. Before, we only had a true/false of whether we were connected, and it was a race to determine if the "true" was still "true" because we hadn't yet detected its "false." Numbers solve this.
As a bonus, changing to this method let us remove a few cases that required special handling :)