Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Don't fail instances in create saga unwind #7437

Merged
merged 4 commits into from
Jan 30, 2025

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jan 30, 2025

When the instance_create saga's sic_create_instance_record action
unwinds, it executes the compensating action
sic_delete_instance_record.

This action moves the instance's state to Failed prior to actually
calling into project_delete_instance to delete it:

let runtime_state = db::model::InstanceRuntimeState {
nexus_state: db::model::InstanceState::Failed,
// Must update the generation, or the database query will fail.
//
// The runtime state of the instance record is only changed as a result
// of the successful completion of the saga, or in this action during
// saga unwinding. So we're guaranteed that the cached generation in the
// saga log is the most recent in the database.
gen: db::model::Generation::from(db_instance.runtime_state.gen.next()),
..db_instance.runtime_state
};
let updated =
datastore.instance_update_runtime(&instance_id, &runtime_state).await?;
if !updated {
warn!(
osagactx.log(),
"failed to update instance runtime state from creating to failed",
);
}
// Actually delete the record.
datastore.project_delete_instance(&opctx, &authz_instance).await?;

This is because we presently only allow instances to be deleted when
they are in the Stopped or Failed states, as noted here:

// We currently only support deleting an instance if it is stopped or
// failed, so update the state accordingly to allow deletion.

Because we must first transition the instance to Failed in order to
delete it, there is an intermediate state when the instance
record created by an unwinding saga exists but is in the Failed
state. Instances in the Failed state are eligible to be restarted by
the instance_reincarnation background task. That task queries for all
Failed`` instances and creates instance_start` sagas to attempt to
restart them.

Therefore, if the instance_reincarnation background task runs during
the window of time between when an unwinding instance_create saga
marks the instance record as Failed and when it actually attempts to
call project_delete_instance, the instance can transition to
Starting by the new instance-start saga. This results in the attempt
to delete it failing, causing the unwinding saga to get stuck. This is
not great --- it causes a test flake (see #7326), but it's actually a
real bug, as it can result in a saga unable to unwind.

This commit fixes this by moving most of project_delete_instance to a
new function, project_delete_instance_in_state, which accepts a list
of states in which the instance may be deleted as an argument.
project_delete_instance now calls that function with the "normal" list
of states, but unwinding instance-create sagas are additionally able to
allow the instance record to be deleted while it's Creating in a
single atomic database operation, avoiding the transient Failed state.
This fixes #7326.

Unfortunately, it's quite challenging to have a regression test for
this, because it would require interrupting the unwinding saga's
sic_delete_instance_record mid-activation, which we don't really have
a nice mechanism for.

There's a comment in the instance_create sagas sic_delete_instance_record` compensating action that states we should be
looking up the instance record to delete by ID rather than by project and
name. The comment references issue #1536.

While I was here changing this action, I figured I'd go ahead and change
this as well. My assumption is that the previous thing predates the
LookupPath::instance_id method?

hawkw added 2 commits January 29, 2025 16:08
When the `instance_create` saga's `sic_create_instance_record` action
unwinds, it executes the compensating action
[`sic_delete_instance_record`][1].

This action moves the instance's state to `Failed` prior to actually
calling into `project_delete_instance` to delete it:
https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L1010-L1033

This is because we presently only allow instances to be deleted when
they are in the `Stopped` or `Failed` states, as noted here:
https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L987-L988

Because we must first transition the instance to `Failed` in order to
delete it, there is an intermediate state when the instance
record created by an unwinding saga exists but is in the `Failed`
state. Instances in the `Failed` state are eligible to be restarted by
the `instance_reincarnation` background task. That task queries for all
`Failed`` instances and creates `instance_start` sagas to attempt to
restart them.

Therefore, if the `instance_reincarnation` background task runs during
the window of time between when an unwinding `instance_create` saga
marks the instance record as `Failed` and when it actually attempts to
call `project_delete_instance`, the instance can transition to
`Starting` by the new instance-start saga. This results in the attempt
to delete it failing, causing the unwinding saga to get stuck. This is
not great --- it causes a test flake (see #7326), but it's actually a
real bug, as it can result in a saga unable to unwind.

This commit fixes this by moving most of `project_delete_instance` to a
new function, `project_delete_instance_in_state`, which accepts a list
of states in which the instance may be deleted as an argument.
`project_delete_instance` now calls that function with the "normal" list
of states, but unwinding instance-create sagas are additionally able to
allow the instance record to be deleted while it's `Creating` in a
single atomic database operation, avoiding the transient `Failed` state.
This fixes #7326.

Unfortunately, it's quite challenging to have a regression test for
this, because it would require interrupting the unwinding saga's
`sic_delete_instance_record` mid-activation, which we don't really have
a nice mechanism for.

[1]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L970
But, this means that there's an intermediate state when the instance
record created by an unwinding saga exists but is in the `Failed` state,
making it eligible to be restarted by the `instance_reincarnation`
background task. That task queries for all `Failed` instances and
creates `instance_start` sagas to attempt to restart them.
There's a comment in the `instance_create` saga`s
`sic_delete_instance_record` compensating action that states it would be
nicer to lookup the instance record to delete by ID rather than by
project and name. The comment references issue
#1536.

While I was here changing this action, I figured I'd go ahead and change
this as well. My assumption is that the previous thing predates the
`LookupPath::instance_id` method?
@hawkw hawkw requested review from smklein and gjcolombo January 30, 2025 00:53
let result = LookupPath::new(&opctx, &datastore)
.project_id(params.project_id)
.instance_name(&instance_name)
.instance_id(instance_id.into_untyped_uuid())
.fetch()
.await;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After your change, line 988 has a comment saying "as mentioned in the comment above, we should not be doing lookup by name" - but I think you're removing that comment.

Maybe just update this to include the bit about idempotency, but drop the bit about the name lookup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed that in 0a7601b!

@hawkw
Copy link
Member Author

hawkw commented Jan 30, 2025

Hmm, this buildomat failure (logs) looks like the Buildomat worker just...died? I'm gonna re-run it.

@hawkw hawkw merged commit d062c60 into main Jan 30, 2025
16 checks passed
@hawkw hawkw deleted the eliza/start-saga-unwound branch January 30, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test failed in CI: test_action_failure_can_unwind_idempotently
3 participants