Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TOCTOU when adding network interfaces to an instance #1222

Merged
merged 4 commits into from
Jun 22, 2022
Merged

Conversation

bnaecker
Copy link
Collaborator

  • Fixes a TOCTOU bug, where we checked if an instance is stopped and
    then issue a separate query to actually create/attach a network
    interface to it. Done by adding an additional subquery which verifies
    the instance state in an additional CTE, at both create and delete
    time.
  • Adds additional tests for checking the new instance verification
  • Improves instance integration test coverage by using the endpoint to
    list interfaces both when the expected list is empty and non-empty

- Fixes a TOCTOU bug, where we checked if an instance is stopped and
  then issue a separate query to actually create/attach a network
  interface to it. Done by adding an additional subquery which verifies
  the instance state in an additional CTE, at both create and delete
  time.
- Adds additional tests for checking the new instance verification
- Improves instance integration test coverage by using the endpoint to
  list interfaces both when the expected list is empty and non-empty
@bnaecker
Copy link
Collaborator Author

Resolves #1134

@smklein smklein self-assigned this Jun 21, 2022
// NOTE: We need to lookup the VPC and VPC Subnet, since we need both
// IDs for creating the network interface.
//
// TODO-correctness: There are additional races here. The VPC and VPC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it isn't one, it's another, eh?

(Semi-related to this PR, I really want better testing / analysis to help us suss these issues out. DB-related raciness is an area with a lot of potential for unsafety)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it isn't one, it's another, eh?

Indeed :)

I really want better testing / analysis to help us suss these issues out

Yeah, that'd be nice. I'm not sure how to force ourselves to hit this, since the flow would be: fetch the VPC Subnet successfully, delete the VPC Subnet, then try to run this query. How would we do that? Maybe part of a set of stress tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I can think of a view options:

  • Random-walk stress testing: ideally, having a "set of operations" and "set of objects/names" that can all be acted upon, and doing some random-walk style operations would be neat.
  • Permutation testing: we could try to integrate something like https://github.com/tokio-rs/loom to invasively test the DB-related ops. There would still be reliance on CRDB, however, so these tests would probably not run as a part of the CI check - they might be delegated to "once daily" jobs.
  • Model checking: we could attempt to model the DB interactions externally (e.g., through TLA+) and try to prove safety properties about them. I'm skeptical about the cost/benefit of this approach; it seems like a more expensive option.

Copy link
Collaborator Author

@bnaecker bnaecker Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the second approach. The first is probably easiest, but I'd be worried about baking in our assumptions about how things should be used. I agree the formal verification approach is likely to be too heavyweight.

nexus/src/db/queries/network_interface.rs Show resolved Hide resolved
nexus/src/db/queries/network_interface.rs Outdated Show resolved Hide resolved
nexus/src/db/queries/network_interface.rs Outdated Show resolved Hide resolved
@smklein smklein assigned bnaecker and unassigned smklein Jun 21, 2022
- Reduce const visibility
- SQL formatting
- Refactor tests for specific behavior
@bnaecker bnaecker enabled auto-merge (squash) June 21, 2022 22:28
@bnaecker bnaecker merged commit 1d52a71 into main Jun 22, 2022
@bnaecker bnaecker deleted the no-racy-racy branch June 22, 2022 01:48
leftwo added a commit that referenced this pull request Mar 29, 2024
Propolis changes:
Add `IntrPin::import_state` and migrate LPC UART pin states (#669)
Attempt to set WCE for raw file backends
Fix clippy/lint nits for rust 1.77.0

Crucible changes:
Correctly (and robustly) count bytes (#1237)
test-replay.sh fix name of DTrace script (#1235)
BlockReq -> BlockOp (#1234)
Simplify `BlockReq` (#1218)
DTrace, cmon, cleanup, retry downstairs connections at 10 seconds.
(#1231)
Remove `MAX_ACTIVE_COUNT` flow control system (#1217)

Crucible changes that were in Omicron but not in Propolis before this commit.
Return *410 Gone* if volume is inactive (#1232)
Update Rust crate opentelemetry to 0.22.0 (#1224)
Update Rust crate base64 to 0.22.0 (#1222)
Update Rust crate async-recursion to 1.1.0 (#1221)
Minor cleanups to extent implementations (#1230)
Update Rust crate http to 0.2.12 (#1220)
Update Rust crate reedline to 0.30.0 (#1227)
Update Rust crate rayon to 1.9.0 (#1226)
Update Rust crate nix to 0.28 (#1223)
Update Rust crate async-trait to 0.1.78 (#1219)
Various buffer optimizations (#1211)
Add low-level test for message encoding (#1214)
Don't let df failures ruin the buildomat tests (#1213)
Activate the NBD server's psuedo file (#1209)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants