The API to `instance_update_runtime` is misleading

## Background

`instance_update_runtime` returns a boolean value meaning "updated", implying if the update operation succeeded (`true`) or if it failed, even though the object-to-be-updated does exist, because of a concurrent modification (`false`):

https://github.com/oxidecomputer/omicron/blob/3d132cc9627ff76e5bad0aeeeb1b4b687233ab61/nexus/src/db/datastore.rs#L816-L820

## Implications

The [generation numbers](https://rfd.shared.oxide.computer/rfd/0192#_transactions_ctes_sagas_generation_numbers) used within the Instance table are intended to guard access to the state object from concurrent modification. In a "read-modify-write" pattern, they can help the caller optimistically manage concurrency control.

I believe the intent behind ignoring the result was to imply "whoever was able to use the right generation number 'wins' the concurrency race, and their operation occurs". The other update operation - which does not occur - can be treated as if it happened earlier and was immediately overwritten.

This is supported by documentation in `instance_set_runtime`:

https://github.com/oxidecomputer/omicron/blob/3d132cc9627ff76e5bad0aeeeb1b4b687233ab61/nexus/src/nexus.rs#L1167-L1170

## Problem

**What if we modify more than the `state` column?**

The API for `instance_update_runtime` takes an entire `InstanceRuntimeState` object as input, and updates the entire portion of the DB row based on the generation number.

This *happens* to currently be safe with our current usage, because we exclusively use the pattern of:
- Read `Instance` row
- Modify the `State` column **exclusively**
- Write `InstanceRuntimeState` back to the `Instance` row
- If the update was skipped, that's fine - it's as if we transitioned to our state, and then transitioned immediately to the new state.

However, [`InstanceRuntimeState`](https://github.com/oxidecomputer/omicron/blob/3d132cc9627ff76e5bad0aeeeb1b4b687233ab61/nexus/src/db/model.rs#L744-L768) contains *many* fields, not just the state. It is entirely possible for a user of this API to perform the following operation:

- Read `Instance` row
- Modify the `State` column *and* another column (perhaps changing the hostname, increasing memory, etc)
- Write `InstanceRuntimeState` back to the `Instance` row
- If the update is skipped here, our associated data is lost! In most cases, this will not be acceptable, regardless of concurrent state changes.

Further, even in the case where we are *only* modifying the state, there are often cases "skipping" an intermediate state would be unacceptable. For example, setting the state of an instance to `failed` should be terminal, not skipped.

Although this is technically avoidable by reacting to the return code of the function, I'd argue that it's still dangerous and subtle, especially with the naming of the `pub` function, which doesn't imply this "conditional" aspect.

## Proposal

1. Avoid the multi-field update issue by changing the arguments. Instead of operating on an entire `InstanceRuntimeState` object, I propose this method act on the inputs of a UUID, the new desired state, and the new (expected) generation number.

2. Update the name of the function to imply that it is conditional, and only updates the state. I'd argue for a name like `instance_try_update_state`. Although all of our database operations may fail, it would be useful to be able to distinguish unconditional from conditional update functions, since the distinction always has implications for the caller.

-------------------------

### Current Usage, for Reference

- [`instance_set_runtime`](https://github.com/oxidecomputer/omicron/blob/3d132cc9627ff76e5bad0aeeeb1b4b687233ab61/nexus/src/nexus.rs#L1199-L1202)
- [`notify_instance_udpated`](https://github.com/oxidecomputer/omicron/blob/3d132cc9627ff76e5bad0aeeeb1b4b687233ab61/nexus/src/nexus.rs#L2240-L2243)
- [`sic_instance_ensure`](https://github.com/oxidecomputer/omicron/blob/3d132cc9627ff76e5bad0aeeeb1b4b687233ab61/nexus/src/sagas.rs#L304-L309)

	pub async fn instance_update_runtime(
	&self,
	instance_id: &Uuid,
	new_runtime: &InstanceRuntimeState,
	) -> Result<bool, Error> {

	* Ask the sled agent to begin the state change. Then update the
	* database to reflect the new intermediate state. If this update is
	* not the newest one, that's fine. That might just mean the sled agent
	* beat us to it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The API to `instance_update_runtime` is misleading #611

Background

Implications

Problem

Proposal

Current Usage, for Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The API to instance_update_runtime is misleading #611

Description

Background

Implications

Problem

Proposal

Current Usage, for Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The API to `instance_update_runtime` is misleading #611