[nexus] Don't fail instances in create saga unwind #7437

hawkw · 2025-01-30T00:53:25Z

When the instance_create saga's sic_create_instance_record action
unwinds, it executes the compensating action
sic_delete_instance_record.

This action moves the instance's state to Failed prior to actually
calling into project_delete_instance to delete it:

omicron/nexus/src/app/sagas/instance_create.rs

Lines 1010 to 1033 in ec4b5dc

    
           let runtime_state = db::model::InstanceRuntimeState { 
        
               nexus_state: db::model::InstanceState::Failed, 
        
               // Must update the generation, or the database query will fail. 
        
               // 
        
               // The runtime state of the instance record is only changed as a result 
        
               // of the successful completion of the saga, or in this action during 
        
               // saga unwinding. So we're guaranteed that the cached generation in the 
        
               // saga log is the most recent in the database. 
        
               gen: db::model::Generation::from(db_instance.runtime_state.gen.next()), 
        
               ..db_instance.runtime_state 
        
           }; 
        
           let updated = 
        
               datastore.instance_update_runtime(&instance_id, &runtime_state).await?; 
        
           if !updated { 
        
               warn!( 
        
                   osagactx.log(), 
        
                   "failed to update instance runtime state from creating to failed", 
        
               ); 
        
           } 
        
           // Actually delete the record. 
        
           datastore.project_delete_instance(&opctx, &authz_instance).await?;

This is because we presently only allow instances to be deleted when
they are in the Stopped or Failed states, as noted here:

omicron/nexus/src/app/sagas/instance_create.rs

Lines 987 to 988 in ec4b5dc

    
           // We currently only support deleting an instance if it is stopped or 
        
           // failed, so update the state accordingly to allow deletion.

Because we must first transition the instance to Failed in order to
delete it, there is an intermediate state when the instance
record created by an unwinding saga exists but is in the Failed
state. Instances in the Failed state are eligible to be restarted by
the instance_reincarnation background task. That task queries for all
Failed`` instances and creates instance_start` sagas to attempt to
restart them.

Therefore, if the instance_reincarnation background task runs during
the window of time between when an unwinding instance_create saga
marks the instance record as Failed and when it actually attempts to
call project_delete_instance, the instance can transition to
Starting by the new instance-start saga. This results in the attempt
to delete it failing, causing the unwinding saga to get stuck. This is
not great --- it causes a test flake (see #7326), but it's actually a
real bug, as it can result in a saga unable to unwind.

This commit fixes this by moving most of project_delete_instance to a
new function, project_delete_instance_in_state, which accepts a list
of states in which the instance may be deleted as an argument.
project_delete_instance now calls that function with the "normal" list
of states, but unwinding instance-create sagas are additionally able to
allow the instance record to be deleted while it's Creating in a
single atomic database operation, avoiding the transient Failed state.
This fixes #7326.

Unfortunately, it's quite challenging to have a regression test for
this, because it would require interrupting the unwinding saga's
sic_delete_instance_record mid-activation, which we don't really have
a nice mechanism for.

There's a comment in the instance_create sagas sic_delete_instance_record` compensating action that states we should be
looking up the instance record to delete by ID rather than by project and
name. The comment references issue #1536.

While I was here changing this action, I figured I'd go ahead and change
this as well. My assumption is that the previous thing predates the
LookupPath::instance_id method?

When the `instance_create` saga's `sic_create_instance_record` action unwinds, it executes the compensating action [`sic_delete_instance_record`][1]. This action moves the instance's state to `Failed` prior to actually calling into `project_delete_instance` to delete it: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L1010-L1033 This is because we presently only allow instances to be deleted when they are in the `Stopped` or `Failed` states, as noted here: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L987-L988 Because we must first transition the instance to `Failed` in order to delete it, there is an intermediate state when the instance record created by an unwinding saga exists but is in the `Failed` state. Instances in the `Failed` state are eligible to be restarted by the `instance_reincarnation` background task. That task queries for all `Failed`` instances and creates `instance_start` sagas to attempt to restart them. Therefore, if the `instance_reincarnation` background task runs during the window of time between when an unwinding `instance_create` saga marks the instance record as `Failed` and when it actually attempts to call `project_delete_instance`, the instance can transition to `Starting` by the new instance-start saga. This results in the attempt to delete it failing, causing the unwinding saga to get stuck. This is not great --- it causes a test flake (see #7326), but it's actually a real bug, as it can result in a saga unable to unwind. This commit fixes this by moving most of `project_delete_instance` to a new function, `project_delete_instance_in_state`, which accepts a list of states in which the instance may be deleted as an argument. `project_delete_instance` now calls that function with the "normal" list of states, but unwinding instance-create sagas are additionally able to allow the instance record to be deleted while it's `Creating` in a single atomic database operation, avoiding the transient `Failed` state. This fixes #7326. Unfortunately, it's quite challenging to have a regression test for this, because it would require interrupting the unwinding saga's `sic_delete_instance_record` mid-activation, which we don't really have a nice mechanism for. [1]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L970 But, this means that there's an intermediate state when the instance record created by an unwinding saga exists but is in the `Failed` state, making it eligible to be restarted by the `instance_reincarnation` background task. That task queries for all `Failed` instances and creates `instance_start` sagas to attempt to restart them.

There's a comment in the `instance_create` saga`s `sic_delete_instance_record` compensating action that states it would be nicer to lookup the instance record to delete by ID rather than by project and name. The comment references issue #1536. While I was here changing this action, I figured I'd go ahead and change this as well. My assumption is that the previous thing predates the `LookupPath::instance_id` method?

smklein · 2025-01-30T02:08:14Z

nexus/src/app/sagas/instance_create.rs

    let result = LookupPath::new(&opctx, &datastore)
-        .project_id(params.project_id)
-        .instance_name(&instance_name)
+        .instance_id(instance_id.into_untyped_uuid())
        .fetch()
        .await;



After your change, line 988 has a comment saying "as mentioned in the comment above, we should not be doing lookup by name" - but I think you're removing that comment.

Maybe just update this to include the bit about idempotency, but drop the bit about the name lookup?

Good catch, fixed that in 0a7601b!

hawkw · 2025-01-30T19:31:25Z

Hmm, this buildomat failure (logs) looks like the Buildomat worker just...died? I'm gonna re-run it.

hawkw added 2 commits January 29, 2025 16:08

hawkw requested review from smklein and gjcolombo January 30, 2025 00:53

smklein approved these changes Jan 30, 2025

View reviewed changes

gjcolombo approved these changes Jan 30, 2025

View reviewed changes

hawkw added 2 commits January 30, 2025 09:52

remove unused import

3dfa94f

remove outdated reference in comment

0a7601b

hawkw enabled auto-merge (squash) January 30, 2025 17:57

hawkw mentioned this pull request Jan 30, 2025

instance create sagas look up multiple things by name multiple times #1536

Open

hawkw mentioned this pull request Jan 30, 2025

"worker agent experienced a fatal error; aborting job" during Omicron CI run oxidecomputer/buildomat#67

Closed

hawkw merged commit d062c60 into main Jan 30, 2025
16 checks passed

hawkw deleted the eliza/start-saga-unwound branch January 30, 2025 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Don't fail instances in create saga unwind #7437

[nexus] Don't fail instances in create saga unwind #7437

hawkw commented Jan 30, 2025

smklein Jan 30, 2025

hawkw Jan 30, 2025

hawkw commented Jan 30, 2025 •

edited

Loading

	let runtime_state = db::model::InstanceRuntimeState {
	nexus_state: db::model::InstanceState::Failed,
	// Must update the generation, or the database query will fail.
	//
	// The runtime state of the instance record is only changed as a result
	// of the successful completion of the saga, or in this action during
	// saga unwinding. So we're guaranteed that the cached generation in the
	// saga log is the most recent in the database.
	gen: db::model::Generation::from(db_instance.runtime_state.gen.next()),
	..db_instance.runtime_state
	};

	let updated =
	datastore.instance_update_runtime(&instance_id, &runtime_state).await?;

	if !updated {
	warn!(
	osagactx.log(),
	"failed to update instance runtime state from creating to failed",
	);
	}

	// Actually delete the record.
	datastore.project_delete_instance(&opctx, &authz_instance).await?;

	// We currently only support deleting an instance if it is stopped or
	// failed, so update the state accordingly to allow deletion.

[nexus] Don't fail instances in create saga unwind #7437

[nexus] Don't fail instances in create saga unwind #7437

Conversation

hawkw commented Jan 30, 2025

smklein Jan 30, 2025

Choose a reason for hiding this comment

hawkw Jan 30, 2025

Choose a reason for hiding this comment

hawkw commented Jan 30, 2025 • edited Loading

hawkw commented Jan 30, 2025 •

edited

Loading