-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] Don't fail instances in create saga unwind #7437
Conversation
When the `instance_create` saga's `sic_create_instance_record` action unwinds, it executes the compensating action [`sic_delete_instance_record`][1]. This action moves the instance's state to `Failed` prior to actually calling into `project_delete_instance` to delete it: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L1010-L1033 This is because we presently only allow instances to be deleted when they are in the `Stopped` or `Failed` states, as noted here: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L987-L988 Because we must first transition the instance to `Failed` in order to delete it, there is an intermediate state when the instance record created by an unwinding saga exists but is in the `Failed` state. Instances in the `Failed` state are eligible to be restarted by the `instance_reincarnation` background task. That task queries for all `Failed`` instances and creates `instance_start` sagas to attempt to restart them. Therefore, if the `instance_reincarnation` background task runs during the window of time between when an unwinding `instance_create` saga marks the instance record as `Failed` and when it actually attempts to call `project_delete_instance`, the instance can transition to `Starting` by the new instance-start saga. This results in the attempt to delete it failing, causing the unwinding saga to get stuck. This is not great --- it causes a test flake (see #7326), but it's actually a real bug, as it can result in a saga unable to unwind. This commit fixes this by moving most of `project_delete_instance` to a new function, `project_delete_instance_in_state`, which accepts a list of states in which the instance may be deleted as an argument. `project_delete_instance` now calls that function with the "normal" list of states, but unwinding instance-create sagas are additionally able to allow the instance record to be deleted while it's `Creating` in a single atomic database operation, avoiding the transient `Failed` state. This fixes #7326. Unfortunately, it's quite challenging to have a regression test for this, because it would require interrupting the unwinding saga's `sic_delete_instance_record` mid-activation, which we don't really have a nice mechanism for. [1]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L970 But, this means that there's an intermediate state when the instance record created by an unwinding saga exists but is in the `Failed` state, making it eligible to be restarted by the `instance_reincarnation` background task. That task queries for all `Failed` instances and creates `instance_start` sagas to attempt to restart them.
There's a comment in the `instance_create` saga`s `sic_delete_instance_record` compensating action that states it would be nicer to lookup the instance record to delete by ID rather than by project and name. The comment references issue #1536. While I was here changing this action, I figured I'd go ahead and change this as well. My assumption is that the previous thing predates the `LookupPath::instance_id` method?
let result = LookupPath::new(&opctx, &datastore) | ||
.project_id(params.project_id) | ||
.instance_name(&instance_name) | ||
.instance_id(instance_id.into_untyped_uuid()) | ||
.fetch() | ||
.await; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After your change, line 988 has a comment saying "as mentioned in the comment above, we should not be doing lookup by name" - but I think you're removing that comment.
Maybe just update this to include the bit about idempotency, but drop the bit about the name lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed that in 0a7601b!
When the
instance_create
saga'ssic_create_instance_record
actionunwinds, it executes the compensating action
sic_delete_instance_record
.This action moves the instance's state to
Failed
prior to actuallycalling into
project_delete_instance
to delete it:omicron/nexus/src/app/sagas/instance_create.rs
Lines 1010 to 1033 in ec4b5dc
This is because we presently only allow instances to be deleted when
they are in the
Stopped
orFailed
states, as noted here:omicron/nexus/src/app/sagas/instance_create.rs
Lines 987 to 988 in ec4b5dc
Because we must first transition the instance to
Failed
in order todelete it, there is an intermediate state when the instance
record created by an unwinding saga exists but is in the
Failed
state. Instances in the
Failed
state are eligible to be restarted bythe
instance_reincarnation
background task. That task queries for allFailed`` instances and creates
instance_start` sagas to attempt torestart them.
Therefore, if the
instance_reincarnation
background task runs duringthe window of time between when an unwinding
instance_create
sagamarks the instance record as
Failed
and when it actually attempts tocall
project_delete_instance
, the instance can transition toStarting
by the new instance-start saga. This results in the attemptto delete it failing, causing the unwinding saga to get stuck. This is
not great --- it causes a test flake (see #7326), but it's actually a
real bug, as it can result in a saga unable to unwind.
This commit fixes this by moving most of
project_delete_instance
to anew function,
project_delete_instance_in_state
, which accepts a listof states in which the instance may be deleted as an argument.
project_delete_instance
now calls that function with the "normal" listof states, but unwinding instance-create sagas are additionally able to
allow the instance record to be deleted while it's
Creating
in asingle atomic database operation, avoiding the transient
Failed
state.This fixes #7326.
Unfortunately, it's quite challenging to have a regression test for
this, because it would require interrupting the unwinding saga's
sic_delete_instance_record
mid-activation, which we don't really havea nice mechanism for.
There's a comment in the
instance_create
sagas
sic_delete_instance_record` compensating action that states we should belooking up the instance record to delete by ID rather than by project and
name. The comment references issue #1536.
While I was here changing this action, I figured I'd go ahead and change
this as well. My assumption is that the previous thing predates the
LookupPath::instance_id
method?