Skip to content

Removes the instance database record when provision fails #836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

bnaecker
Copy link
Collaborator

In the case that instance provision fails in the final saga node, the
actual booting of the Propolis zone, there was previously a no-op undo
action. This leaves the instance record in the database, perpetually in
a "starting" state. It can't be moved out of that state, because that
requires a full instance-ensure request to the sled-agent, which tries
that last action again, which fails, and ...

This adds an actual undo action, which sets the instance state in the
database to failed, and then deletes. That state change is needed
because we can't delete instances in the "starting" state. State changes
are normally only made in response to the sled agent observing a state
change in the actual instance, but is valid in this case since there
is no such instance.

In the case that instance provision fails in the final saga node, the
actual booting of the Propolis zone, there was previously a no-op undo
action. This leaves the instance record in the database, perpetually in
a "starting" state. It can't be moved out of that state, because that
requires a full instance-ensure request to the sled-agent, which tries
that last action again, which fails, and ...

This adds an actual undo action, which sets the instance state in the
database to failed, and then deletes. That state change is needed
because we can't delete instances in the "starting" state. State changes
are normally only made in response to the sled agent observing a state
change in the actual instance, but is valid in this case since there
_is_ no such instance.
@bnaecker bnaecker requested a review from smklein March 30, 2022 17:58
@bnaecker
Copy link
Collaborator Author

It's not really easy to write an integration test for this, for a few reasons. I was however able to run tests using the Oxide CLI, which show that the record has in fact been removed.

bnaecker@feldspar : ~/cli $ cargo run -- instance list -o o -p p
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/oxide instance list -o o -p p`
 id | name | description | hostname | memory | ncpus | project_id | run_state | time_created | time_modified | time_run_state_updated |
----+------+-------------+----------+--------+-------+------------+-----------+--------------+---------------+------------------------

bnaecker@feldspar : ~/cli $ cargo run -- instance create -o o -p p vm0
    Finished dev [unoptimized + debuginfo] target(s) in 0.13s
     Running `target/debug/oxide instance create -o o -p p vm0`
instance description:: vm0
instance hostname:: vm0
instance memory:: 4
instance ncpus:: 2
✘ Oxide API internal error: Internal Server Error
bnaecker@feldspar : ~/cli $ cargo run -- instance list -o o -p p
    Finished dev [unoptimized + debuginfo] target(s) in 0.15s
     Running `target/debug/oxide instance list -o o -p p`
 id | name | description | hostname | memory | ncpus | project_id | run_state | time_created | time_modified | time_run_state_updated |
----+------+-------------+----------+--------+-------+------------+-----------+--------------+---------------+------------------------

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github isn't making commenting on this specific line easy for me, but...

... shouldn't the sic_delete_instance_record undo action in CreateInstanceRecord be responsible for actually doing the deletion of the instance?

I bring this up because having an undo action for the last operation in a saga seems a little funny - my mental model for actions is that they should "do one thing in the state graph (or not, and fail)". This means that the last action generally either "happens or it doesn't", and usually there isn't anyone who can call "undo" on it.

Anyway - we mentioned this in chat, but figured I'd post it here for visibility

@bnaecker
Copy link
Collaborator Author

For further context -- I had basically fixed this earlier, but not rebased the branch I was working on on top of that fix. That's in #810, for the record.

@bnaecker
Copy link
Collaborator Author

Abandoning.

@bnaecker bnaecker closed this Mar 30, 2022
@bnaecker bnaecker deleted the remove-instance-on-provision-failure branch March 30, 2022 19:48
leftwo pushed a commit that referenced this pull request Jan 27, 2025
Crucible changes are:
Allow read only activation with less than three downstairs (#1608)
Tweaks to automatic flush (#1613)
Update Rust crate twox-hash to v2 (#1547)
Remove `LastFlushAck` (#1603)
Correctly print 'connecting' state (#1612)
Make live-repair part of invariants checks (#1610)
Simplify mend region selection (#1606)
Generic read test for crutest (#1609)
Always remove skipped jobs from dependencies (#1604)
Add libsqlite3-dev install step to Github Actions CI (#1607)
Move Nexus notification to standalone task (#1584)
DTrace cleanup. (#1602)
Reset completed work Downstairs on a `Barrier` operation (#1601)
Upstairs state machine refactoring (3/3) (#1577)

Propolis changes are:
Wire up initial support for AMD perf counters
build: upgrade tokio to 1.40.0 (#836)
build: explicitly install libsqlite3-dev in CI (#834)
add JSON output format to cpuid-gen (#832)
leftwo added a commit that referenced this pull request Jan 27, 2025
Crucible changes are:
Allow read only activation with less than three downstairs (#1608)
Tweaks to automatic flush (#1613)
Update Rust crate twox-hash to v2 (#1547)
Remove `LastFlushAck` (#1603)
Correctly print 'connecting' state (#1612)
Make live-repair part of invariants checks (#1610) Simplify mend region
selection (#1606)
Generic read test for crutest (#1609)
Always remove skipped jobs from dependencies (#1604) Add libsqlite3-dev
install step to Github Actions CI (#1607) Move Nexus notification to
standalone task (#1584) DTrace cleanup. (#1602)
Reset completed work Downstairs on a `Barrier` operation (#1601)
Upstairs state machine refactoring (3/3) (#1577)

Propolis changes are:
Wire up initial support for AMD perf counters
build: upgrade tokio to 1.40.0 (#836)
build: explicitly install libsqlite3-dev in CI (#834) add JSON output
format to cpuid-gen (#832)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants