-
Notifications
You must be signed in to change notification settings - Fork 45
Removes the instance database record when provision fails #836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the case that instance provision fails in the final saga node, the actual booting of the Propolis zone, there was previously a no-op undo action. This leaves the instance record in the database, perpetually in a "starting" state. It can't be moved out of that state, because that requires a full instance-ensure request to the sled-agent, which tries that last action again, which fails, and ... This adds an actual undo action, which sets the instance state in the database to failed, and then deletes. That state change is needed because we can't delete instances in the "starting" state. State changes are normally only made in response to the sled agent observing a state change in the actual instance, but is valid in this case since there _is_ no such instance.
It's not really easy to write an integration test for this, for a few reasons. I was however able to run tests using the Oxide CLI, which show that the record has in fact been removed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github isn't making commenting on this specific line easy for me, but...
... shouldn't the sic_delete_instance_record
undo action in CreateInstanceRecord
be responsible for actually doing the deletion of the instance?
I bring this up because having an undo action for the last operation in a saga seems a little funny - my mental model for actions is that they should "do one thing in the state graph (or not, and fail)". This means that the last action generally either "happens or it doesn't", and usually there isn't anyone who can call "undo" on it.
Anyway - we mentioned this in chat, but figured I'd post it here for visibility
For further context -- I had basically fixed this earlier, but not rebased the branch I was working on on top of that fix. That's in #810, for the record. |
Abandoning. |
Crucible changes are: Allow read only activation with less than three downstairs (#1608) Tweaks to automatic flush (#1613) Update Rust crate twox-hash to v2 (#1547) Remove `LastFlushAck` (#1603) Correctly print 'connecting' state (#1612) Make live-repair part of invariants checks (#1610) Simplify mend region selection (#1606) Generic read test for crutest (#1609) Always remove skipped jobs from dependencies (#1604) Add libsqlite3-dev install step to Github Actions CI (#1607) Move Nexus notification to standalone task (#1584) DTrace cleanup. (#1602) Reset completed work Downstairs on a `Barrier` operation (#1601) Upstairs state machine refactoring (3/3) (#1577) Propolis changes are: Wire up initial support for AMD perf counters build: upgrade tokio to 1.40.0 (#836) build: explicitly install libsqlite3-dev in CI (#834) add JSON output format to cpuid-gen (#832)
Crucible changes are: Allow read only activation with less than three downstairs (#1608) Tweaks to automatic flush (#1613) Update Rust crate twox-hash to v2 (#1547) Remove `LastFlushAck` (#1603) Correctly print 'connecting' state (#1612) Make live-repair part of invariants checks (#1610) Simplify mend region selection (#1606) Generic read test for crutest (#1609) Always remove skipped jobs from dependencies (#1604) Add libsqlite3-dev install step to Github Actions CI (#1607) Move Nexus notification to standalone task (#1584) DTrace cleanup. (#1602) Reset completed work Downstairs on a `Barrier` operation (#1601) Upstairs state machine refactoring (3/3) (#1577) Propolis changes are: Wire up initial support for AMD perf counters build: upgrade tokio to 1.40.0 (#836) build: explicitly install libsqlite3-dev in CI (#834) add JSON output format to cpuid-gen (#832) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
In the case that instance provision fails in the final saga node, the
actual booting of the Propolis zone, there was previously a no-op undo
action. This leaves the instance record in the database, perpetually in
a "starting" state. It can't be moved out of that state, because that
requires a full instance-ensure request to the sled-agent, which tries
that last action again, which fails, and ...
This adds an actual undo action, which sets the instance state in the
database to failed, and then deletes. That state change is needed
because we can't delete instances in the "starting" state. State changes
are normally only made in response to the sled agent observing a state
change in the actual instance, but is valid in this case since there
is no such instance.