Skip to content

The API to instance_update_runtime is misleading #611

Open
@smklein

Description

@smklein

Background

instance_update_runtime returns a boolean value meaning "updated", implying if the update operation succeeded (true) or if it failed, even though the object-to-be-updated does exist, because of a concurrent modification (false):

pub async fn instance_update_runtime(
&self,
instance_id: &Uuid,
new_runtime: &InstanceRuntimeState,
) -> Result<bool, Error> {

Implications

The generation numbers used within the Instance table are intended to guard access to the state object from concurrent modification. In a "read-modify-write" pattern, they can help the caller optimistically manage concurrency control.

I believe the intent behind ignoring the result was to imply "whoever was able to use the right generation number 'wins' the concurrency race, and their operation occurs". The other update operation - which does not occur - can be treated as if it happened earlier and was immediately overwritten.

This is supported by documentation in instance_set_runtime:

omicron/nexus/src/nexus.rs

Lines 1167 to 1170 in 3d132cc

* Ask the sled agent to begin the state change. Then update the
* database to reflect the new intermediate state. If this update is
* not the newest one, that's fine. That might just mean the sled agent
* beat us to it.

Problem

What if we modify more than the state column?

The API for instance_update_runtime takes an entire InstanceRuntimeState object as input, and updates the entire portion of the DB row based on the generation number.

This happens to currently be safe with our current usage, because we exclusively use the pattern of:

  • Read Instance row
  • Modify the State column exclusively
  • Write InstanceRuntimeState back to the Instance row
  • If the update was skipped, that's fine - it's as if we transitioned to our state, and then transitioned immediately to the new state.

However, InstanceRuntimeState contains many fields, not just the state. It is entirely possible for a user of this API to perform the following operation:

  • Read Instance row
  • Modify the State column and another column (perhaps changing the hostname, increasing memory, etc)
  • Write InstanceRuntimeState back to the Instance row
  • If the update is skipped here, our associated data is lost! In most cases, this will not be acceptable, regardless of concurrent state changes.

Further, even in the case where we are only modifying the state, there are often cases "skipping" an intermediate state would be unacceptable. For example, setting the state of an instance to failed should be terminal, not skipped.

Although this is technically avoidable by reacting to the return code of the function, I'd argue that it's still dangerous and subtle, especially with the naming of the pub function, which doesn't imply this "conditional" aspect.

Proposal

  1. Avoid the multi-field update issue by changing the arguments. Instead of operating on an entire InstanceRuntimeState object, I propose this method act on the inputs of a UUID, the new desired state, and the new (expected) generation number.

  2. Update the name of the function to imply that it is conditional, and only updates the state. I'd argue for a name like instance_try_update_state. Although all of our database operations may fail, it would be useful to be able to distinguish unconditional from conditional update functions, since the distinction always has implications for the caller.


Current Usage, for Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiRelated to the API.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions