Removes the instance database record when provision fails #836

bnaecker · 2022-03-30T17:58:18Z

In the case that instance provision fails in the final saga node, the
actual booting of the Propolis zone, there was previously a no-op undo
action. This leaves the instance record in the database, perpetually in
a "starting" state. It can't be moved out of that state, because that
requires a full instance-ensure request to the sled-agent, which tries
that last action again, which fails, and ...

This adds an actual undo action, which sets the instance state in the
database to failed, and then deletes. That state change is needed
because we can't delete instances in the "starting" state. State changes
are normally only made in response to the sled agent observing a state
change in the actual instance, but is valid in this case since there
is no such instance.

In the case that instance provision fails in the final saga node, the actual booting of the Propolis zone, there was previously a no-op undo action. This leaves the instance record in the database, perpetually in a "starting" state. It can't be moved out of that state, because that requires a full instance-ensure request to the sled-agent, which tries that last action again, which fails, and ... This adds an actual undo action, which sets the instance state in the database to failed, and then deletes. That state change is needed because we can't delete instances in the "starting" state. State changes are normally only made in response to the sled agent observing a state change in the actual instance, but is valid in this case since there _is_ no such instance.

bnaecker · 2022-03-30T17:59:56Z

It's not really easy to write an integration test for this, for a few reasons. I was however able to run tests using the Oxide CLI, which show that the record has in fact been removed.

bnaecker@feldspar : ~/cli $ cargo run -- instance list -o o -p p
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/oxide instance list -o o -p p`
 id | name | description | hostname | memory | ncpus | project_id | run_state | time_created | time_modified | time_run_state_updated |
----+------+-------------+----------+--------+-------+------------+-----------+--------------+---------------+------------------------

bnaecker@feldspar : ~/cli $ cargo run -- instance create -o o -p p vm0
    Finished dev [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/oxide instance create -o o -p p vm0`
instance description:: vm0
instance hostname:: vm0
instance memory:: 4
instance ncpus:: 2
✘ Oxide API internal error: Internal Server Error
bnaecker@feldspar : ~/cli $ cargo run -- instance list -o o -p p
    Finished dev [unoptimized + debuginfo] target(s) in 0.15s
     Running `target/debug/oxide instance list -o o -p p`
 id | name | description | hostname | memory | ncpus | project_id | run_state | time_created | time_modified | time_run_state_updated |
----+------+-------------+----------+--------+-------+------------+-----------+--------------+---------------+------------------------

smklein

Github isn't making commenting on this specific line easy for me, but...

... shouldn't the sic_delete_instance_record undo action in CreateInstanceRecord be responsible for actually doing the deletion of the instance?

I bring this up because having an undo action for the last operation in a saga seems a little funny - my mental model for actions is that they should "do one thing in the state graph (or not, and fail)". This means that the last action generally either "happens or it doesn't", and usually there isn't anyone who can call "undo" on it.

Anyway - we mentioned this in chat, but figured I'd post it here for visibility

bnaecker · 2022-03-30T19:46:55Z

For further context -- I had basically fixed this earlier, but not rebased the branch I was working on on top of that fix. That's in #810, for the record.

bnaecker · 2022-03-30T19:48:45Z

Abandoning.

Crucible changes are: Allow read only activation with less than three downstairs (#1608) Tweaks to automatic flush (#1613) Update Rust crate twox-hash to v2 (#1547) Remove `LastFlushAck` (#1603) Correctly print 'connecting' state (#1612) Make live-repair part of invariants checks (#1610) Simplify mend region selection (#1606) Generic read test for crutest (#1609) Always remove skipped jobs from dependencies (#1604) Add libsqlite3-dev install step to Github Actions CI (#1607) Move Nexus notification to standalone task (#1584) DTrace cleanup. (#1602) Reset completed work Downstairs on a `Barrier` operation (#1601) Upstairs state machine refactoring (3/3) (#1577) Propolis changes are: Wire up initial support for AMD perf counters build: upgrade tokio to 1.40.0 (#836) build: explicitly install libsqlite3-dev in CI (#834) add JSON output format to cpuid-gen (#832)

Crucible changes are: Allow read only activation with less than three downstairs (#1608) Tweaks to automatic flush (#1613) Update Rust crate twox-hash to v2 (#1547) Remove `LastFlushAck` (#1603) Correctly print 'connecting' state (#1612) Make live-repair part of invariants checks (#1610) Simplify mend region selection (#1606) Generic read test for crutest (#1609) Always remove skipped jobs from dependencies (#1604) Add libsqlite3-dev install step to Github Actions CI (#1607) Move Nexus notification to standalone task (#1584) DTrace cleanup. (#1602) Reset completed work Downstairs on a `Barrier` operation (#1601) Upstairs state machine refactoring (3/3) (#1577) Propolis changes are: Wire up initial support for AMD perf counters build: upgrade tokio to 1.40.0 (#836) build: explicitly install libsqlite3-dev in CI (#834) add JSON output format to cpuid-gen (#832) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

bnaecker requested a review from smklein March 30, 2022 17:58

smklein reviewed Mar 30, 2022

View reviewed changes

bnaecker closed this Mar 30, 2022

bnaecker deleted the remove-instance-on-provision-failure branch March 30, 2022 19:48

leftwo mentioned this pull request Jan 27, 2025

Update Propolis and Crucible #7403

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Removes the instance database record when provision fails #836

Removes the instance database record when provision fails #836

Uh oh!

bnaecker commented Mar 30, 2022

Uh oh!

bnaecker commented Mar 30, 2022

Uh oh!

smklein left a comment

Uh oh!

bnaecker commented Mar 30, 2022

Uh oh!

bnaecker commented Mar 30, 2022

Uh oh!

Uh oh!

Removes the instance database record when provision fails #836

Removes the instance database record when provision fails #836

Uh oh!

Conversation

bnaecker commented Mar 30, 2022

Uh oh!

bnaecker commented Mar 30, 2022

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

bnaecker commented Mar 30, 2022

Uh oh!

bnaecker commented Mar 30, 2022

Uh oh!

Uh oh!