-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
instance create sagas look up multiple things by name multiple times #1536
Comments
This seems easy to get wrong. It makes me wonder if we should have the Nexus layer only accept authz resources (rather than ever accepting names) and having callers (mostly the HTTP layer, but sometimes the saga layer) use LookupPath directly. I think it'd be fair to say that the mapping of an API URL path to a resource is part of the API layer anyway, not Nexus. I'm reminded too that right after that quote from RFD 192, it also says that some operations could be done without a separate lookup step altogether. To do that, you do want to pass the names all the way down to the datastore. Doing this correctly also requires integrating authz into the query, which feels like a long way off. So maybe we could consider this an optimization to be re-added later. A middle ground might be to accept |
Once RFD-322 starts getting implemented I believe we should solve this at the API level. E.g. we should at least move everything to operate as id by default. Id or |
No Propolis changes other than to update Crucible Crucible changes are: Add debug/timeout to test_memory.sh (#1563) Consolidate ack checking (#1561) Rename for crutest: RegionInfo -> DiskInfo (#1562) Fix dtrace system level scripts (#1560) Remove `ackable_work`; ack immediately instead (#1552) No more New jobs, no more New jobs column (#1559) Remove delay-based backpressure in favor of explicit queue limits (#1515) Only send flushes when Downstairs is idle; send Barrier otherwise (#1505) Update Rust crate reqwest to v0.12.9 (#1536) Update Rust crate omicron-zone-package to 0.11.1 (#1535) Remove separate validation array (#1522) Remove more unnecessary `DsState` variants (#1550) Consolidate `DownstairsClient::reinitialize` (#1549) Update Rust crate uuid to v1.11.0 (#1546) Update Rust crate reedline to 0.36.0 (#1544) Update Rust crate bytes to v1.8.0 (#1541) Update Rust crate thiserror to v1.0.66 (#1539) Update Rust crate serde_json to v1.0.132 (#1538) Update Rust crate serde to v1.0.214 (#1537) Remove transient states in `DsState` (#1526) Update Rust crate libc to v0.2.161 (#1534) Update Rust crate futures to v0.3.31 (#1532) Update Rust crate clap to v4.5.20 (#1531) Update Rust crate async-trait to 0.1.83 (#1530) Update Rust crate anyhow to v1.0.92 (#1529) Remove obsolete crutest perf test (#1528) Update dependency rust to v1.82.0 (#1512) Still more updates to support Volume layer activities. (#1508) Remove remaining IOPS/bandwidth limiting code (#1525) Add unit test for VersionMismatch (#1524) Removing panic paths by only destructuring once (#1523) Update actions/checkout digest to 11bd719 (#1518) Switch to using `Duration` for times (#1520)
No Propolis changes other than to update Crucible Crucible changes are: Add debug/timeout to test_memory.sh (#1563) Consolidate ack checking (#1561) Rename for crutest: RegionInfo -> DiskInfo (#1562) Fix dtrace system level scripts (#1560) Remove `ackable_work`; ack immediately instead (#1552) No more New jobs, no more New jobs column (#1559) Remove delay-based backpressure in favor of explicit queue limits (#1515) Only send flushes when Downstairs is idle; send Barrier otherwise (#1505) Update Rust crate reqwest to v0.12.9 (#1536) Update Rust crate omicron-zone-package to 0.11.1 (#1535) Remove separate validation array (#1522) Remove more unnecessary `DsState` variants (#1550) Consolidate `DownstairsClient::reinitialize` (#1549) Update Rust crate uuid to v1.11.0 (#1546) Update Rust crate reedline to 0.36.0 (#1544) Update Rust crate bytes to v1.8.0 (#1541) Update Rust crate thiserror to v1.0.66 (#1539) Update Rust crate serde_json to v1.0.132 (#1538) Update Rust crate serde to v1.0.214 (#1537) Remove transient states in `DsState` (#1526) Update Rust crate libc to v0.2.161 (#1534) Update Rust crate futures to v0.3.31 (#1532) Update Rust crate clap to v4.5.20 (#1531) Update Rust crate async-trait to 0.1.83 (#1530) Update Rust crate anyhow to v1.0.92 (#1529) Remove obsolete crutest perf test (#1528) Update dependency rust to v1.82.0 (#1512) Still more updates to support Volume layer activities. (#1508) Remove remaining IOPS/bandwidth limiting code (#1525) Add unit test for VersionMismatch (#1524) Removing panic paths by only destructuring once (#1523) Update actions/checkout digest to 11bd719 (#1518) Switch to using `Duration` for times (#1520) Co-authored-by: Alan Hanson <alan@oxide.computer>
There's a comment in the `instance_create` saga`s `sic_delete_instance_record` compensating action that states it would be nicer to lookup the instance record to delete by ID rather than by project and name. The comment references issue #1536. While I was here changing this action, I figured I'd go ahead and change this as well. My assumption is that the previous thing predates the `LookupPath::instance_id` method?
#7437 changes the |
When the `instance_create` saga's `sic_create_instance_record` action unwinds, it executes the compensating action [`sic_delete_instance_record`][1]. This action moves the instance's state to `Failed` prior to actually calling into `project_delete_instance` to delete it, [here][2]. This is because we presently only allow instances to be deleted when they are in the `Stopped` or `Failed` states, as noted [here][3]. Because we must first transition the instance to `Failed` in order to delete it, there is an intermediate state when the instance record created by an unwinding saga exists but is in the `Failed` state. Instances in the `Failed` state are eligible to be restarted by the `instance_reincarnation` background task. That task queries for all `Failed`` instances and creates `instance_start` sagas to attempt to restart them. Therefore, if the `instance_reincarnation` background task runs during the window of time between when an unwinding `instance_create` saga marks the instance record as `Failed` and when it actually attempts to call `project_delete_instance`, the instance can transition to `Starting` by the new instance-start saga. This results in the attempt to delete it failing, causing the unwinding saga to get stuck. This is not great --- it causes a test flake (see #7326), but it's actually a real bug, as it can result in a saga unable to unwind. This commit fixes this by moving most of `project_delete_instance` to a new function, `project_delete_instance_in_state`, which accepts a list of states in which the instance may be deleted as an argument. `project_delete_instance` now calls that function with the "normal" list of states, but unwinding instance-create sagas are additionally able to allow the instance record to be deleted while it's `Creating` in a single atomic database operation, avoiding the transient `Failed` state. This fixes #7326. Unfortunately, it's quite challenging to have a regression test for this, because it would require interrupting the unwinding saga's `sic_delete_instance_record` mid-activation, which we don't really have a nice mechanism for. Also, there was a comment in the `sic_delete_instance_record` action that stated we should be looking up the instance record to delete by ID rather than by project and name. The comment references issue #1536. While I was here changing this action, I figured I'd go ahead and change this as well. My assumption is that the previous thing predates the `LookupPath::instance_id` method? [1]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L970 [2]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L1010-L1033 [3]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L987-L988
There are three places in the instance-create saga where we look up things by name well into the saga:
Nexus::instance_attach_disk()
/Nexus::instance_detach_disk()
and pass along the organization_name, project_name, instance_name, and disk_name from the instance create request. This causes a new lookup of the organization/project/instance/disk path. That seems bad: there's no guarantee that it will find the same disk (or instance or project or organization, for that matter). So we might attach or detach the wrong disk, or we might fail spuriously.instance_ensure()
, we look up the instance by name in order to callNexus::instance_set_runtime()
. This could find a different instance than the one we created. We should be able to use the instance_id stored from a previous action instead.sic_delete_instance_record
, the undo action for the (early) create-instance-record action, we look up the instance name that we created. We should use the instance id instead.RFD 192 talks about this:
In terms of fixing: my first thought here is to do the lookup of organization and project once early in the saga to get the project_id. Then (again, still early in the saga) look up anything else we need (like all the disks) using
LookupPath(...).project_id(...).disk_name(...)
and store their ids. Thereafter, we can always use the ids.We'll probably want to change
Nexus::instance_attach_disk()
andNexus::instance_detach_disk()
to accept anauthz::Instance
andauthz::Disk
(and have the lookups done in the caller).Then I think we can remove the "instance_name" output from any saga action (so we're not ever tempted to use the instance name). We can also remove
organization_name
andproject_name
from theParams
(for the same reason).The text was updated successfully, but these errors were encountered: