Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

internal/runner: make Accept resilient to the server going down #3097

Merged
merged 3 commits into from
Mar 15, 2022

Conversation

mitchellh
Copy link
Contributor

@mitchellh mitchellh commented Mar 14, 2022

This starts by addressing only the initial job stream opening. This still does not add any resilience after the job stream is open (i.e. when a job is running).

I'm opening this PR early since this required some changes to how we track the state of the underlying runner config and it is complex enough on the surface that I didn't want to complicate a future PR. Additionally, this PR on its own still incrementally makes the runner more resilient to failure modes.

This PR builds on #3087 and adds the following scenarios the runner can now withstand:

  • The server goes down after runner start but prior to Accept.
  • The server goes down during Accept (its a stream, so while the initial handshake is blocking)
  • The runner registration is disconnected at some point prior to or during Accept. This is usually due to the server going down, but the behavior that the runner sees is the same.
    • This last one in particular is nice to resolve since the error was not obvious. The error returned to users was "runner not found". This is no longer possible.

Runner State from Booleans to Monotonic Ints

A core change made here is the change from a set of booleans to represent state to monotonic unsigned ints. This lets us still retain the boolean-like behavior (anything greater than 0 is "true") while also giving us some "happens after" semantics we can check. This is important because on disconnect, we need to detect when we're reconnected after the disconnect. Before, we only had a true/false of whether we were connected, and it was a race to determine if the "true" was still "true" because we hadn't yet detected its "false." Numbers solve this.

As a bonus, changing to this method let us remove a few cases that required special handling :)

@mitchellh mitchellh requested review from evanphx, izaaklauer and a team March 14, 2022 18:52
@github-actions github-actions bot added the core label Mar 14, 2022
@mitchellh mitchellh added the pr/no-changelog No automatic changelog entry required for this pull request label Mar 14, 2022
This starts by addressing only the initial job stream opening. This
required some changes to how we track the state of the underlying runner
registration. This changes the state tracking from a bool to a monotonic
integer.
Copy link
Member

@briancain briancain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, just a few minor requests for fixups!

internal/runner/accept.go Outdated Show resolved Hide resolved
return err
retry := false
var client pb.Waypoint_RunnerJobStreamClient
for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need any kind of timeout awareness to prevent infinite loop retries? Maybe infinite loop retries is by design!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to reasons I put in the comment, this is very difficult to solve right now. I'm proposing just keeping this as a TODO and doing this another time. For now, it would retry forever past the AcceptTimeout setting but I think that's okay, that's only true so long as the server remains down.

internal/runner/runner.go Outdated Show resolved Hide resolved
@mitchellh mitchellh merged commit 99de91c into main Mar 15, 2022
@mitchellh mitchellh deleted the f-runner-accept-resil branch March 15, 2022 16:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core pr/no-changelog No automatic changelog entry required for this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants