Description
Background
instance_update_runtime
returns a boolean value meaning "updated", implying if the update operation succeeded (true
) or if it failed, even though the object-to-be-updated does exist, because of a concurrent modification (false
):
omicron/nexus/src/db/datastore.rs
Lines 816 to 820 in 3d132cc
Implications
The generation numbers used within the Instance table are intended to guard access to the state object from concurrent modification. In a "read-modify-write" pattern, they can help the caller optimistically manage concurrency control.
I believe the intent behind ignoring the result was to imply "whoever was able to use the right generation number 'wins' the concurrency race, and their operation occurs". The other update operation - which does not occur - can be treated as if it happened earlier and was immediately overwritten.
This is supported by documentation in instance_set_runtime
:
Lines 1167 to 1170 in 3d132cc
Problem
What if we modify more than the state
column?
The API for instance_update_runtime
takes an entire InstanceRuntimeState
object as input, and updates the entire portion of the DB row based on the generation number.
This happens to currently be safe with our current usage, because we exclusively use the pattern of:
- Read
Instance
row - Modify the
State
column exclusively - Write
InstanceRuntimeState
back to theInstance
row - If the update was skipped, that's fine - it's as if we transitioned to our state, and then transitioned immediately to the new state.
However, InstanceRuntimeState
contains many fields, not just the state. It is entirely possible for a user of this API to perform the following operation:
- Read
Instance
row - Modify the
State
column and another column (perhaps changing the hostname, increasing memory, etc) - Write
InstanceRuntimeState
back to theInstance
row - If the update is skipped here, our associated data is lost! In most cases, this will not be acceptable, regardless of concurrent state changes.
Further, even in the case where we are only modifying the state, there are often cases "skipping" an intermediate state would be unacceptable. For example, setting the state of an instance to failed
should be terminal, not skipped.
Although this is technically avoidable by reacting to the return code of the function, I'd argue that it's still dangerous and subtle, especially with the naming of the pub
function, which doesn't imply this "conditional" aspect.
Proposal
-
Avoid the multi-field update issue by changing the arguments. Instead of operating on an entire
InstanceRuntimeState
object, I propose this method act on the inputs of a UUID, the new desired state, and the new (expected) generation number. -
Update the name of the function to imply that it is conditional, and only updates the state. I'd argue for a name like
instance_try_update_state
. Although all of our database operations may fail, it would be useful to be able to distinguish unconditional from conditional update functions, since the distinction always has implications for the caller.