Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Waypoint server panics when runner is forgotten before stopped #3448

Closed
demophoon opened this issue Jun 10, 2022 · 1 comment · Fixed by #3756
Closed

Waypoint server panics when runner is forgotten before stopped #3448

demophoon opened this issue Jun 10, 2022 · 1 comment · Fixed by #3756
Assignees
Labels
bug Something isn't working core/runner

Comments

@demophoon
Copy link
Contributor

Describe the bug
When a Waypoint runner is forgotten before it is stopped, when that runner is stopped the Waypoint server panics.

I believe this happens because when the runner is forgotten, the runner record is removed from boltdb. Once the Waypoint runner is stopped the server notices and attempts to mark the runner as offline. During that process, server attempts to fetch the runner from BoltDB.

func (s *State) runnerOffline(dbTxn *bolt.Tx, memTxn *memdb.Txn, id string) error {
r, err := s.runnerById(dbTxn, id)
if status.Code(err) == codes.NotFound {
r = nil
err = nil
}
if err != nil {
return err
}

If either r is set to nil or codes.NotFound is the error that is returned, r is nil by the time we attempt to determine what Kind of runner we are dealing with.

switch r.Kind.(type) {

In this r = nil case the server panics and exits.

Steps to Reproduce
Steps to reproduce the behavior.

  1. Start a Waypoint server
  2. Start a preadopted Waypoint runner
export WAYPOINT_SERVER_ADDR=localhost:9701
export WAYPOINT_SERVER_TLS=true
export WAYPOINT_SERVER_TLS_SKIP_VERIFY=true
export WAYPOINT_SERVER_TOKEN=$(waypoint user token) # May need to be two commands
waypoint runner agent -id=testrunner
  1. Forget the Waypoint runner in a new terminal
waypoint runner forget testrunner
  1. Stop the Waypoint runner

At this point the server will have panicked.

Expected behavior
The server should continue as if the runner had always been forgotten in the first place.

Waypoint Platform Versions

  • Waypoint CLI Version: v0.8.1
  • Waypoint Server Platform and Version: docker v0.8.2, but non-platform specific
@briancain briancain added bug Something isn't working and removed new labels Jun 15, 2022
@briancain
Copy link
Member

briancain commented Jun 15, 2022

It'll be useful to audit the runner code path for runners when a runner is forgotten/deleted from bolt. There are likely bugs here since the adoption flow is recently new, so there are probably places that we need to fix up some bad behaviors. edit: It might also be worth to see if there's any runner logic to extract into something more generic outside of the boltdb server implementation.

@briancain briancain added this to the 0.8.y milestone Jun 15, 2022
@krantzinator krantzinator removed this from the 0.9.y milestone Jul 13, 2022
demophoon added a commit that referenced this issue Aug 29, 2022
Before this commit when Waypoint was determining whether or not to
remove a Runner from boltdb it was possible for runner to be nil at the
time we attempted to determine what type of Runner the Runner was. This
caused the server to panic as soon as the runner became unavailable.

This commit fixes the panic by avoiding the runner from being set to nil
by instead initializing an empty runner variable so that if a runner is
not found the type can still be determined and the runner cleaned up.

Fixes #3448
@demophoon demophoon self-assigned this Aug 29, 2022
demophoon added a commit that referenced this issue Aug 29, 2022
Before this commit when Waypoint was determining whether or not to
remove a Runner from boltdb it was possible for runner to be nil at the
time we attempted to determine what type of Runner the Runner was. This
caused the server to panic as soon as the runner became unavailable.

This commit fixes the panic by avoiding the runner from being set to nil
by instead initializing an empty runner variable so that if a runner is
not found the type can still be determined and the runner cleaned up.

Fixes #3448
demophoon added a commit that referenced this issue Aug 30, 2022
Before this commit when Waypoint was determining whether or not to
remove a Runner from boltdb it was possible for runner to be nil at the
time we attempted to determine what type of Runner the Runner was. This
caused the server to panic as soon as the runner became unavailable.

This commit fixes the panic by checking if we received a runner from the
database before determining its type.

Fixes #3448
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working core/runner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants