internal/runner: make Accept resilient to the server going down #3097

mitchellh · 2022-03-14T18:52:13Z

This starts by addressing only the initial job stream opening. This still does not add any resilience after the job stream is open (i.e. when a job is running).

I'm opening this PR early since this required some changes to how we track the state of the underlying runner config and it is complex enough on the surface that I didn't want to complicate a future PR. Additionally, this PR on its own still incrementally makes the runner more resilient to failure modes.

This PR builds on #3087 and adds the following scenarios the runner can now withstand:

The server goes down after runner start but prior to Accept.
The server goes down during Accept (its a stream, so while the initial handshake is blocking)
The runner registration is disconnected at some point prior to or during Accept. This is usually due to the server going down, but the behavior that the runner sees is the same.
- This last one in particular is nice to resolve since the error was not obvious. The error returned to users was "runner not found". This is no longer possible.

Runner State from Booleans to Monotonic Ints

A core change made here is the change from a set of booleans to represent state to monotonic unsigned ints. This lets us still retain the boolean-like behavior (anything greater than 0 is "true") while also giving us some "happens after" semantics we can check. This is important because on disconnect, we need to detect when we're reconnected after the disconnect. Before, we only had a true/false of whether we were connected, and it was a race to determine if the "true" was still "true" because we hadn't yet detected its "false." Numbers solve this.

As a bonus, changing to this method let us remove a few cases that required special handling :)

This starts by addressing only the initial job stream opening. This required some changes to how we track the state of the underlying runner registration. This changes the state tracking from a bool to a monotonic integer.

briancain

Looks pretty good to me, just a few minor requests for fixups!

internal/runner/accept.go

briancain · 2022-03-14T23:30:03Z

internal/runner/accept.go

- return err
+ retry := false
+ var client pb.Waypoint_RunnerJobStreamClient
+ for {


Does this need any kind of timeout awareness to prevent infinite loop retries? Maybe infinite loop retries is by design!

Due to reasons I put in the comment, this is very difficult to solve right now. I'm proposing just keeping this as a TODO and doing this another time. For now, it would retry forever past the AcceptTimeout setting but I think that's okay, that's only true so long as the server remains down.

internal/runner/runner.go

mitchellh requested review from evanphx, izaaklauer and a team March 14, 2022 18:52

github-actions bot added the core label Mar 14, 2022

mitchellh added the pr/no-changelog No automatic changelog entry required for this pull request label Mar 14, 2022

internal/runner: make Accept resilient to the server going down

0e43ad1

This starts by addressing only the initial job stream opening. This required some changes to how we track the state of the underlying runner registration. This changes the state tracking from a bool to a monotonic integer.

mitchellh force-pushed the f-runner-accept-resil branch from 187c00b to 0e43ad1 Compare March 14, 2022 19:00

vercel bot temporarily deployed to Preview March 14, 2022 19:00 Inactive

briancain approved these changes Mar 14, 2022

View reviewed changes

internal/runner: update comments for state changes

1c69066

vercel bot temporarily deployed to Preview March 14, 2022 23:52 Inactive

internal/runner: TODO around accept timeout

ee9d016

vercel bot temporarily deployed to Preview March 15, 2022 00:16 Inactive

mitchellh merged commit 99de91c into main Mar 15, 2022

mitchellh deleted the f-runner-accept-resil branch March 15, 2022 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/runner: make Accept resilient to the server going down #3097

internal/runner: make Accept resilient to the server going down #3097

mitchellh commented Mar 14, 2022 •

edited

Loading

briancain left a comment

briancain Mar 14, 2022

mitchellh Mar 15, 2022

internal/runner: make Accept resilient to the server going down #3097

internal/runner: make Accept resilient to the server going down #3097

Conversation

mitchellh commented Mar 14, 2022 • edited Loading

Runner State from Booleans to Monotonic Ints

briancain left a comment

Choose a reason for hiding this comment

briancain Mar 14, 2022

Choose a reason for hiding this comment

mitchellh Mar 15, 2022

Choose a reason for hiding this comment

mitchellh commented Mar 14, 2022 •

edited

Loading