Skip to content

Disk Attach has some race conditions #1073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smklein opened this issue May 16, 2022 · 7 comments · Fixed by #1106
Closed

Disk Attach has some race conditions #1073

smklein opened this issue May 16, 2022 · 7 comments · Fixed by #1106
Labels
bug Something that isn't working. nexus Related to nexus

Comments

@smklein
Copy link
Collaborator

smklein commented May 16, 2022

Following-up on @plotnick 's comment here, I think there may be some deeper issues with disk attaching / detaching.

Background

The following steps attempt to roughly map out the disk attach process:

Issues

  • Between (2) and (3), other disks may be concurrently attached, bypassing the check. This is a TOCTTOU.
  • Between (4) and (6a-1), the disk state may be modified before the sled agent request is made. This could result in the sled agent attaching a disk that has been deleted or attached to a different instance.
  • Between (5) and (6), the instance state may be modified. The instance has a "state_generation" value for optimistic concurrency control, but it is not being checked / modified here.
@smklein smklein added bug Something that isn't working. nexus Related to nexus labels May 16, 2022
@smklein
Copy link
Collaborator Author

smklein commented May 16, 2022

I believe we can likely re-purpose the collection_insert CTE logic to be usable for an "update" statement. This would allow us to safely:

  • Check the state of the instance/disk
  • Perform a database operation validating that the generation number (on both) has not been incremented, which retroactively validates those checks.

@smklein
Copy link
Collaborator Author

smklein commented May 16, 2022

I'm happy to experiment with the CTE - if anyone reading this issue has spare cycles and can contribute, I'd appreciate help writing tests to reproduce the aforementioned race condition issues.

@smklein
Copy link
Collaborator Author

smklein commented May 17, 2022

Here's what I've experimented with so far - using cockroachdb's movr dataset as an example.

I did this to make things reproducible - if you copy and paste the following sql to a test.sql file, you can invoke:

$ cockroach demo -f test.sql

To see how this CTE works.

My approach is basically the following:

  1. Check for existence of both rows (by ID), to improve error reporting regardless of the update.
  2. Check for existence of both rows (by conditions). This is equivalent to checking the "state" of the instance / disk.
  3. Perform the updates conditional on both rows existing with the conditions we want. Ideally, they should both be updated or neither should be updated.
  4. Collect the results and return them to the caller. If the update failed, identify whether or not the rows existed at all. If the update succeeds, return the rows we want.
/* Setup - if you EXPLAIN the `WITH` statement, you'll see no full table scans with these indices */
CREATE INDEX on public.users (id);                                                                                                                                                                                                          
CREATE INDEX on public.vehicles (id); 

WITH
  /* Check for the existence of the tables */
  vehicle_by_id as (
    SELECT public.vehicles.id
    FROM public.vehicles
    WHERE public.vehicles.id='bbbbbbbb-bbbb-4800-8000-00000000000b'
  ),
  owner_by_id as (
    SELECT public.users.id
    FROM public.users
    WHERE public.users.id='bd70a3d7-0a3d-4000-8000-000000000025'
  ),

  /* Check for the WHERE clauses we want to validate. */
  vehicle_info as (
    SELECT *
    FROM public.vehicles
    WHERE (
      public.vehicles.id='bbbbbbbb-bbbb-4800-8000-00000000000b' AND
      public.vehicles.status='available'
    )
    FOR UPDATE
  ),
  owner_info as (
    SELECT *
    FROM public.users
    WHERE (
      public.users.id='bd70a3d7-0a3d-4000-8000-000000000025' AND
      public.users.city='amsterdam'
    )
    FOR UPDATE
  ),

  /* Conditionally update both tables */
  updated_owner as (
    UPDATE public.users
      /* This change is nonsensical, but it just shows we *can* do an update here */
      SET credit_card = REVERSE(credit_card)
    WHERE (
      public.users.id='bd70a3d7-0a3d-4000-8000-000000000025' AND
      EXISTS(SELECT id from vehicle_info) AND
      EXISTS(SELECT id from owner_info)
    )
    RETURNING *
  ),
  updated_vehicle as (
    UPDATE public.vehicles
      /* This change is nonsensical, but it just shows we *can* do an update here */
      SET status = REVERSE(status)
    WHERE (
      public.vehicles.id='bbbbbbbb-bbbb-4800-8000-00000000000b' AND
      EXISTS(SELECT id from vehicle_info) AND
      EXISTS(SELECT id from owner_info)
    )
    RETURNING *
  ),

  /* Return both tables */
  result as (
    SELECT
      IF(EXISTS(SELECT id FROM vehicle_by_id), 1, 0) as vehicle_exists,
      IF(EXISTS(SELECT id FROM owner_by_id), 1, 0) as owner_exists,
      COALESCE((SELECT * FROM updated_owner), NULL) as updated_owner,
      COALESCE((SELECT * FROM updated_vehicle), NULL) as updated_vehicle
  )
SELECT * from result;

@bnaecker
Copy link
Collaborator

bnaecker commented May 17, 2022

I've not looked at the query in depth, but I think there's some simplification to be had in the final SELECT clause. The IF(EXISTS(...), 1, 0) pattern could probably just be EXISTS(...), unless you're really trying to return an integer. The COALESCE clause could similarly just be the inner SELECT part. COALESCE returns the first non-NULL element, or NULL if all of its elements are NULL. So you'll be returning the result of SELECT if it's non-NULL, and NULL otherwise, both with and without COALESCE.

@smklein
Copy link
Collaborator Author

smklein commented May 17, 2022

@bnaecker : I can definitely remove the IF - parsing the "does it exist" value as a BOOL works fine directly from EXISTS.

Also, I can definitely remove the NULL argument to COALESCE - it's redundant. But I've noticed:

    result as (                                                                                                                                                                                                                               
      SELECT                                                                                                                                                                                                                                  
        EXISTS(SELECT id FROM vehicle_by_id) as vehicle_exists,
        EXISTS(SELECT id FROM owner_by_id) as owner_exists,
        COALESCE((SELECT * FROM updated_owner)) as updated_owner,
        COALESCE((SELECT * FROM updated_vehicle)) as updated_vehicle
    )     

For valid data, produces:

  vehicle_exists | owner_exists |  updated_owner  | updated_vehicle
-----------------+--------------+-----------------+--------------------
   true          |  true        | (<USER DATA>)   | (<VEHICLE DATA>)

And for invalid data (e.g., user does not exist) produces:

  vehicle_exists | owner_exists |  updated_owner  | updated_vehicle
-----------------+--------------+-----------------+--------------------
   true          |  false       | NULL            | NULL

However, if I try the following (same SQL, but no COALESCE):

    result as (                                                                                                                                                                                                                               
      SELECT                                                                                                                                                                                                                                  
        EXISTS(SELECT id FROM vehicle_by_id) as vehicle_exists,
        EXISTS(SELECT id FROM owner_by_id) as owner_exists,
        (SELECT * FROM updated_owner) as updated_owner,
        (SELECT * FROM updated_vehicle) as updated_vehicle
    )     

Then the query returns the following error:

ERROR: subquery must return only one column, found 5
SQLSTATE: 42601

@bnaecker
Copy link
Collaborator

bnaecker commented May 17, 2022

Ah, right. This isn't super well-documented, but the return value of COALESCE appears to be a tuple in this circumstance. That means in the result that works, the column updated_owner is a tuple where each element is a column from the record.

That is, the entry for the updated_owner column of your result record is something like:

(id, city, type, ...)

That might be OK for you, since you could deserialize that tuple in Rust. But you might need (or want) to return a row with each element as a scalar. One way to do that is to name each column explicitly in the last part:

...,
updated_vehicle AS (...)
SELECT
    EXISTS(SELECT id FROM vehicle_by_id) AS vehicle_exists,
    EXISTS(SELECT id FROM owner_by_id) AS owner_exists,
    id FROM update_vehicle,
    city FROM updated_vehicle,
    type FROM updated_vehicle,
    ...

Note that there's no result common table expression here, just selecting from the others.

That's definitely more verbose, but I've found writing out the columns to be helpful when developing these complicated queries. There's also probably a way to use CRDB's "tuple unpacking" operation, which looks like SELECT (tuple).*, but I'm not sure how to stuff that into this query in a way that makes it any better.

Hope that helps, didn't mean to sidetrack the more critical parts of the query.

@bnaecker
Copy link
Collaborator

Actually, maybe a better idea is to use a JOIN here rather than another CTE:

updated_vehicle AS (....)
SELECT * FROM
    (SELECT EXISTS(SELECT id FROM vehicle_by_id) AS vehicle_exists),
    (SELECT EXISTS(SELECT id FROM owner_by_id) AS owner_exists),
    updated_owner,
    updated_vehicle
WHERE
    (updated_owner.id = updated_vehicle.owner_id)

You'll get rows like:

  vehicle_exists | owner_exists | id | city | name | address | credit_card | id | city | type | owner_id | creation_time | status | current_location | ext
-----------------+--------------+----+------+------+---------+-------------+----+------+------+----------+---------------+--------+------------------+------
(0 rows)

leftwo pushed a commit that referenced this issue Jan 10, 2024
Propolis changes since the last update:
Gripe when using non-raw block device
Update zerocopy dependency
nvme: Wire up GetFeatures command
Make Viona more robust in the face of errors
bump softnpu (#577)
Modernize 16550 UART

Crucible changes since the last update:
Don't check ROP if the scrub is done (#1093)
Allow crutest cli to be quiet on generic test (#1070)
Offload write encryption (#1066)
Simplify handling of BlockReq at program exit (#1085)
Update Rust crate byte-unit to v5 (#1054)
Remove unused fields in match statements, downstairs edition (#1084)
Remove unused fields in match statements and consolidate (#1083)
Add logger to Guest (#1082)
Drive hash / decrypt tests from Upstairs::apply
Wait to reconnect if auto_promote is false
Change guest work id from u64 -> GuestWorkId
remove BlockOp::Commit (#1072)
Various clippy fixes (#1071)
Don't panic if tasks are destroyed out of order
Update Rust crate reedline to 0.27.1 (#1074)
Update Rust crate async-trait to 0.1.75 (#1073)
Buffer should destructure to Vec when single-referenced
Don't fail to make unencrypted regions (#1067)
Fix shadowing in downstairs (#1063)
Single-task refactoring (#1058)
Update Rust crate tokio to 1.35 (#1052)
Update Rust crate openapiv3 to 2.0.0 (#1050)
Update Rust crate libc to 0.2.151 (#1049)
Update Rust crate rusqlite to 0.30 (#1035)
leftwo added a commit that referenced this issue Jan 11, 2024
Propolis changes since the last update:
Gripe when using non-raw block device
Update zerocopy dependency
nvme: Wire up GetFeatures command
Make Viona more robust in the face of errors
bump softnpu (#577)
Modernize 16550 UART

Crucible changes since the last update:
Don't check ROP if the scrub is done (#1093)
Allow crutest cli to be quiet on generic test (#1070)
Offload write encryption (#1066)
Simplify handling of BlockReq at program exit (#1085)
Update Rust crate byte-unit to v5 (#1054)
Remove unused fields in match statements, downstairs edition (#1084)
Remove unused fields in match statements and consolidate (#1083)
Add logger to Guest (#1082)
Drive hash / decrypt tests from Upstairs::apply
Wait to reconnect if auto_promote is false
Change guest work id from u64 -> GuestWorkId
remove BlockOp::Commit (#1072)
Various clippy fixes (#1071)
Don't panic if tasks are destroyed out of order
Update Rust crate reedline to 0.27.1 (#1074)
Update Rust crate async-trait to 0.1.75 (#1073)
Buffer should destructure to Vec when single-referenced
Don't fail to make unencrypted regions (#1067)
Fix shadowing in downstairs (#1063)
Single-task refactoring (#1058)
Update Rust crate tokio to 1.35 (#1052)
Update Rust crate openapiv3 to 2.0.0 (#1050)
Update Rust crate libc to 0.2.151 (#1049)
Update Rust crate rusqlite to 0.30 (#1035)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. nexus Related to nexus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants