instance create sagas look up multiple things by name multiple times #1536

davepacheco · 2022-08-02T23:22:21Z

There are three places in the instance-create saga where we look up things by name well into the saga:

Towards the end, when we go to attach disks (or detach them, during undo), we use Nexus::instance_attach_disk()/Nexus::instance_detach_disk() and pass along the organization_name, project_name, instance_name, and disk_name from the instance create request. This causes a new lookup of the organization/project/instance/disk path. That seems bad: there's no guarantee that it will find the same disk (or instance or project or organization, for that matter). So we might attach or detach the wrong disk, or we might fail spuriously.
At the end, during instance_ensure(), we look up the instance by name in order to call Nexus::instance_set_runtime(). This could find a different instance than the one we created. We should be able to use the instance_id stored from a previous action instead.
In sic_delete_instance_record, the undo action for the (early) create-instance-record action, we look up the instance name that we created. We should use the instance id instead.

RFD 192 talks about this:

When Nexus receives an API request, we’ll usually want to resolve the name to an id up front and do most internal work using ids. Otherwise, we can wind up with nasty concurrency issues (i.e., serializability violations) if somebody renames a resource while we’re working on it. This can lead to security vulnerabilities, if we were to do an access check of some resource and then somebody renames it and we wind up operating on a different resource than the one we checked.

In terms of fixing: my first thought here is to do the lookup of organization and project once early in the saga to get the project_id. Then (again, still early in the saga) look up anything else we need (like all the disks) using LookupPath(...).project_id(...).disk_name(...) and store their ids. Thereafter, we can always use the ids.

We'll probably want to change Nexus::instance_attach_disk() and Nexus::instance_detach_disk() to accept an authz::Instance and authz::Disk (and have the lookups done in the caller).

Then I think we can remove the "instance_name" output from any saga action (so we're not ever tempted to use the instance name). We can also remove organization_name and project_name from the Params (for the same reason).

The text was updated successfully, but these errors were encountered:

davepacheco · 2022-08-02T23:38:55Z

This seems easy to get wrong. It makes me wonder if we should have the Nexus layer only accept authz resources (rather than ever accepting names) and having callers (mostly the HTTP layer, but sometimes the saga layer) use LookupPath directly. I think it'd be fair to say that the mapping of an API URL path to a resource is part of the API layer anyway, not Nexus.

I'm reminded too that right after that quote from RFD 192, it also says that some operations could be done without a separate lookup step altogether. To do that, you do want to pass the names all the way down to the datastore. Doing this correctly also requires integrating authz into the query, which feels like a long way off. So maybe we could consider this an optimization to be re-added later.

A middle ground might be to accept LookupPaths. That would avoid encouraging people to pass a name without actually preventing it. It would create its own problem, which is that if you've already got the authz object, you'll wind up needing to make a LookupPath out of it and then we'll wind up looking it up again. This too could be addressed: we could have a LookupPath::instance_value or something that, when you go do the lookup(), just returns what it was given. This approach seems like a can of worms not worth opening now. Maybe we can look into this if we decide that optimization is important.

zephraph · 2022-10-28T21:14:08Z

Once RFD-322 starts getting implemented I believe we should solve this at the API level. E.g. we should at least move everything to operate as id by default. Id or LookupPath might be a viable option depending, but I think we should restrict the usage of names (as much as possible) to go no deeper than the http_entrypoints file.

No Propolis changes other than to update Crucible Crucible changes are: Add debug/timeout to test_memory.sh (#1563) Consolidate ack checking (#1561) Rename for crutest: RegionInfo -> DiskInfo (#1562) Fix dtrace system level scripts (#1560) Remove `ackable_work`; ack immediately instead (#1552) No more New jobs, no more New jobs column (#1559) Remove delay-based backpressure in favor of explicit queue limits (#1515) Only send flushes when Downstairs is idle; send Barrier otherwise (#1505) Update Rust crate reqwest to v0.12.9 (#1536) Update Rust crate omicron-zone-package to 0.11.1 (#1535) Remove separate validation array (#1522) Remove more unnecessary `DsState` variants (#1550) Consolidate `DownstairsClient::reinitialize` (#1549) Update Rust crate uuid to v1.11.0 (#1546) Update Rust crate reedline to 0.36.0 (#1544) Update Rust crate bytes to v1.8.0 (#1541) Update Rust crate thiserror to v1.0.66 (#1539) Update Rust crate serde_json to v1.0.132 (#1538) Update Rust crate serde to v1.0.214 (#1537) Remove transient states in `DsState` (#1526) Update Rust crate libc to v0.2.161 (#1534) Update Rust crate futures to v0.3.31 (#1532) Update Rust crate clap to v4.5.20 (#1531) Update Rust crate async-trait to 0.1.83 (#1530) Update Rust crate anyhow to v1.0.92 (#1529) Remove obsolete crutest perf test (#1528) Update dependency rust to v1.82.0 (#1512) Still more updates to support Volume layer activities. (#1508) Remove remaining IOPS/bandwidth limiting code (#1525) Add unit test for VersionMismatch (#1524) Removing panic paths by only destructuring once (#1523) Update actions/checkout digest to 11bd719 (#1518) Switch to using `Duration` for times (#1520)

No Propolis changes other than to update Crucible Crucible changes are: Add debug/timeout to test_memory.sh (#1563) Consolidate ack checking (#1561) Rename for crutest: RegionInfo -> DiskInfo (#1562) Fix dtrace system level scripts (#1560) Remove `ackable_work`; ack immediately instead (#1552) No more New jobs, no more New jobs column (#1559) Remove delay-based backpressure in favor of explicit queue limits (#1515) Only send flushes when Downstairs is idle; send Barrier otherwise (#1505) Update Rust crate reqwest to v0.12.9 (#1536) Update Rust crate omicron-zone-package to 0.11.1 (#1535) Remove separate validation array (#1522) Remove more unnecessary `DsState` variants (#1550) Consolidate `DownstairsClient::reinitialize` (#1549) Update Rust crate uuid to v1.11.0 (#1546) Update Rust crate reedline to 0.36.0 (#1544) Update Rust crate bytes to v1.8.0 (#1541) Update Rust crate thiserror to v1.0.66 (#1539) Update Rust crate serde_json to v1.0.132 (#1538) Update Rust crate serde to v1.0.214 (#1537) Remove transient states in `DsState` (#1526) Update Rust crate libc to v0.2.161 (#1534) Update Rust crate futures to v0.3.31 (#1532) Update Rust crate clap to v4.5.20 (#1531) Update Rust crate async-trait to 0.1.83 (#1530) Update Rust crate anyhow to v1.0.92 (#1529) Remove obsolete crutest perf test (#1528) Update dependency rust to v1.82.0 (#1512) Still more updates to support Volume layer activities. (#1508) Remove remaining IOPS/bandwidth limiting code (#1525) Add unit test for VersionMismatch (#1524) Removing panic paths by only destructuring once (#1523) Update actions/checkout digest to 11bd719 (#1518) Switch to using `Duration` for times (#1520) Co-authored-by: Alan Hanson <alan@oxide.computer>

There's a comment in the `instance_create` saga`s `sic_delete_instance_record` compensating action that states it would be nicer to lookup the instance record to delete by ID rather than by project and name. The comment references issue #1536. While I was here changing this action, I figured I'd go ahead and change this as well. My assumption is that the previous thing predates the `LookupPath::instance_id` method?

hawkw · 2025-01-30T18:00:43Z

#7437 changes the sic_delete_instance_record action to look up the instance by ID rather than by name. The disk attach actions still look up disks by project/instance/disk name. The instance_ensure call is no longer performed by the instance_create saga, and is now part of the instance-start saga instead.

When the `instance_create` saga's `sic_create_instance_record` action unwinds, it executes the compensating action [`sic_delete_instance_record`][1]. This action moves the instance's state to `Failed` prior to actually calling into `project_delete_instance` to delete it, [here][2]. This is because we presently only allow instances to be deleted when they are in the `Stopped` or `Failed` states, as noted [here][3]. Because we must first transition the instance to `Failed` in order to delete it, there is an intermediate state when the instance record created by an unwinding saga exists but is in the `Failed` state. Instances in the `Failed` state are eligible to be restarted by the `instance_reincarnation` background task. That task queries for all `Failed`` instances and creates `instance_start` sagas to attempt to restart them. Therefore, if the `instance_reincarnation` background task runs during the window of time between when an unwinding `instance_create` saga marks the instance record as `Failed` and when it actually attempts to call `project_delete_instance`, the instance can transition to `Starting` by the new instance-start saga. This results in the attempt to delete it failing, causing the unwinding saga to get stuck. This is not great --- it causes a test flake (see #7326), but it's actually a real bug, as it can result in a saga unable to unwind. This commit fixes this by moving most of `project_delete_instance` to a new function, `project_delete_instance_in_state`, which accepts a list of states in which the instance may be deleted as an argument. `project_delete_instance` now calls that function with the "normal" list of states, but unwinding instance-create sagas are additionally able to allow the instance record to be deleted while it's `Creating` in a single atomic database operation, avoiding the transient `Failed` state. This fixes #7326. Unfortunately, it's quite challenging to have a regression test for this, because it would require interrupting the unwinding saga's `sic_delete_instance_record` mid-activation, which we don't really have a nice mechanism for. Also, there was a comment in the `sic_delete_instance_record` action that stated we should be looking up the instance record to delete by ID rather than by project and name. The comment references issue #1536. While I was here changing this action, I figured I'd go ahead and change this as well. My assumption is that the previous thing predates the `LookupPath::instance_id` method? [1]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L970 [2]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L1010-L1033 [3]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L987-L988

davepacheco added the security Related to security. label Aug 2, 2022

davepacheco mentioned this issue Aug 2, 2022

update to steno 0.2.0 #1532

Merged

5 tasks

davepacheco mentioned this issue Dec 9, 2022

v1 instance APIs as per RFD-322 #1957

Merged

davepacheco added this to the MVP milestone Feb 2, 2023

gjcolombo mentioned this issue Apr 28, 2023

instance creation & the create saga should avoid refetching DB records where possible #2945

Open

leftwo mentioned this issue Nov 19, 2024

Update Crucible to latest #7101

Merged

hawkw mentioned this issue Jan 30, 2025

[nexus] Don't fail instances in create saga unwind #7437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instance create sagas look up multiple things by name multiple times #1536

instance create sagas look up multiple things by name multiple times #1536

davepacheco commented Aug 2, 2022

davepacheco commented Aug 2, 2022

zephraph commented Oct 28, 2022

hawkw commented Jan 30, 2025

instance create sagas look up multiple things by name multiple times #1536

instance create sagas look up multiple things by name multiple times #1536

Comments

davepacheco commented Aug 2, 2022

davepacheco commented Aug 2, 2022

zephraph commented Oct 28, 2022

hawkw commented Jan 30, 2025