Disk Attach has some race conditions #1073

smklein · 2022-05-16T13:02:27Z

Following-up on @plotnick 's comment here, I think there may be some deeper issues with disk attaching / detaching.

Background

The following steps attempt to roughly map out the disk attach process:

1) instance_attach_disk - in app/instance.rs - is invoked.
2) instance_list_disks is invoked
3) The number of attached disks is compared against the maximum permitted for the instance.
4) The disk state is checked, only permitting attaching if it is in a valid state.
5) The instance state is checked, only permitting attaching if it is in a valid state.
6a) If the instance is running...
6a-1) A request is sent through the Sled Agent to update the disk state. This attaches the disk to a running instance first...
6a-2) ... then updates the database.
6b) If the instance is not running...
6b-1) The database is updated with the new runtime.

Issues

Between (2) and (3), other disks may be concurrently attached, bypassing the check. This is a TOCTTOU.
Between (4) and (6a-1), the disk state may be modified before the sled agent request is made. This could result in the sled agent attaching a disk that has been deleted or attached to a different instance.
Between (5) and (6), the instance state may be modified. The instance has a "state_generation" value for optimistic concurrency control, but it is not being checked / modified here.

The text was updated successfully, but these errors were encountered:

smklein · 2022-05-16T14:35:49Z

I believe we can likely re-purpose the collection_insert CTE logic to be usable for an "update" statement. This would allow us to safely:

Check the state of the instance/disk
Perform a database operation validating that the generation number (on both) has not been incremented, which retroactively validates those checks.

smklein · 2022-05-16T14:37:38Z

I'm happy to experiment with the CTE - if anyone reading this issue has spare cycles and can contribute, I'd appreciate help writing tests to reproduce the aforementioned race condition issues.

smklein · 2022-05-17T15:30:24Z

Here's what I've experimented with so far - using cockroachdb's movr dataset as an example.

I did this to make things reproducible - if you copy and paste the following sql to a test.sql file, you can invoke:

$ cockroach demo -f test.sql

To see how this CTE works.

My approach is basically the following:

Check for existence of both rows (by ID), to improve error reporting regardless of the update.
Check for existence of both rows (by conditions). This is equivalent to checking the "state" of the instance / disk.
Perform the updates conditional on both rows existing with the conditions we want. Ideally, they should both be updated or neither should be updated.
Collect the results and return them to the caller. If the update failed, identify whether or not the rows existed at all. If the update succeeds, return the rows we want.

/* Setup - if you EXPLAIN the `WITH` statement, you'll see no full table scans with these indices */
CREATE INDEX on public.users (id);                                                                                                                                                                                                          
CREATE INDEX on public.vehicles (id); 

WITH
  /* Check for the existence of the tables */
  vehicle_by_id as (
    SELECT public.vehicles.id
    FROM public.vehicles
    WHERE public.vehicles.id='bbbbbbbb-bbbb-4800-8000-00000000000b'
  ),
  owner_by_id as (
    SELECT public.users.id
    FROM public.users
    WHERE public.users.id='bd70a3d7-0a3d-4000-8000-000000000025'
  ),

  /* Check for the WHERE clauses we want to validate. */
  vehicle_info as (
    SELECT *
    FROM public.vehicles
    WHERE (
      public.vehicles.id='bbbbbbbb-bbbb-4800-8000-00000000000b' AND
      public.vehicles.status='available'
    )
    FOR UPDATE
  ),
  owner_info as (
    SELECT *
    FROM public.users
    WHERE (
      public.users.id='bd70a3d7-0a3d-4000-8000-000000000025' AND
      public.users.city='amsterdam'
    )
    FOR UPDATE
  ),

  /* Conditionally update both tables */
  updated_owner as (
    UPDATE public.users
      /* This change is nonsensical, but it just shows we *can* do an update here */
      SET credit_card = REVERSE(credit_card)
    WHERE (
      public.users.id='bd70a3d7-0a3d-4000-8000-000000000025' AND
      EXISTS(SELECT id from vehicle_info) AND
      EXISTS(SELECT id from owner_info)
    )
    RETURNING *
  ),
  updated_vehicle as (
    UPDATE public.vehicles
      /* This change is nonsensical, but it just shows we *can* do an update here */
      SET status = REVERSE(status)
    WHERE (
      public.vehicles.id='bbbbbbbb-bbbb-4800-8000-00000000000b' AND
      EXISTS(SELECT id from vehicle_info) AND
      EXISTS(SELECT id from owner_info)
    )
    RETURNING *
  ),

  /* Return both tables */
  result as (
    SELECT
      IF(EXISTS(SELECT id FROM vehicle_by_id), 1, 0) as vehicle_exists,
      IF(EXISTS(SELECT id FROM owner_by_id), 1, 0) as owner_exists,
      COALESCE((SELECT * FROM updated_owner), NULL) as updated_owner,
      COALESCE((SELECT * FROM updated_vehicle), NULL) as updated_vehicle
  )
SELECT * from result;

bnaecker · 2022-05-17T16:00:41Z

I've not looked at the query in depth, but I think there's some simplification to be had in the final SELECT clause. The IF(EXISTS(...), 1, 0) pattern could probably just be EXISTS(...), unless you're really trying to return an integer. The COALESCE clause could similarly just be the inner SELECT part. COALESCE returns the first non-NULL element, or NULL if all of its elements are NULL. So you'll be returning the result of SELECT if it's non-NULL, and NULL otherwise, both with and without COALESCE.

smklein · 2022-05-17T18:58:28Z

@bnaecker : I can definitely remove the IF - parsing the "does it exist" value as a BOOL works fine directly from EXISTS.

Also, I can definitely remove the NULL argument to COALESCE - it's redundant. But I've noticed:

    result as (                                                                                                                                                                                                                               
      SELECT                                                                                                                                                                                                                                  
        EXISTS(SELECT id FROM vehicle_by_id) as vehicle_exists,
        EXISTS(SELECT id FROM owner_by_id) as owner_exists,
        COALESCE((SELECT * FROM updated_owner)) as updated_owner,
        COALESCE((SELECT * FROM updated_vehicle)) as updated_vehicle
    )

For valid data, produces:

  vehicle_exists | owner_exists |  updated_owner  | updated_vehicle
-----------------+--------------+-----------------+--------------------
   true          |  true        | (<USER DATA>)   | (<VEHICLE DATA>)

And for invalid data (e.g., user does not exist) produces:

  vehicle_exists | owner_exists |  updated_owner  | updated_vehicle
-----------------+--------------+-----------------+--------------------
   true          |  false       | NULL            | NULL

However, if I try the following (same SQL, but no COALESCE):

    result as (                                                                                                                                                                                                                               
      SELECT                                                                                                                                                                                                                                  
        EXISTS(SELECT id FROM vehicle_by_id) as vehicle_exists,
        EXISTS(SELECT id FROM owner_by_id) as owner_exists,
        (SELECT * FROM updated_owner) as updated_owner,
        (SELECT * FROM updated_vehicle) as updated_vehicle
    )

Then the query returns the following error:

ERROR: subquery must return only one column, found 5
SQLSTATE: 42601

bnaecker · 2022-05-17T19:40:14Z

Ah, right. This isn't super well-documented, but the return value of COALESCE appears to be a tuple in this circumstance. That means in the result that works, the column updated_owner is a tuple where each element is a column from the record.

That is, the entry for the updated_owner column of your result record is something like:

(id, city, type, ...)

That might be OK for you, since you could deserialize that tuple in Rust. But you might need (or want) to return a row with each element as a scalar. One way to do that is to name each column explicitly in the last part:

...,
updated_vehicle AS (...)
SELECT
    EXISTS(SELECT id FROM vehicle_by_id) AS vehicle_exists,
    EXISTS(SELECT id FROM owner_by_id) AS owner_exists,
    id FROM update_vehicle,
    city FROM updated_vehicle,
    type FROM updated_vehicle,
    ...

Note that there's no result common table expression here, just selecting from the others.

That's definitely more verbose, but I've found writing out the columns to be helpful when developing these complicated queries. There's also probably a way to use CRDB's "tuple unpacking" operation, which looks like SELECT (tuple).*, but I'm not sure how to stuff that into this query in a way that makes it any better.

Hope that helps, didn't mean to sidetrack the more critical parts of the query.

bnaecker · 2022-05-17T22:44:07Z

Actually, maybe a better idea is to use a JOIN here rather than another CTE:

updated_vehicle AS (....)
SELECT * FROM
    (SELECT EXISTS(SELECT id FROM vehicle_by_id) AS vehicle_exists),
    (SELECT EXISTS(SELECT id FROM owner_by_id) AS owner_exists),
    updated_owner,
    updated_vehicle
WHERE
    (updated_owner.id = updated_vehicle.owner_id)

You'll get rows like:

  vehicle_exists | owner_exists | id | city | name | address | credit_card | id | city | type | owner_id | creation_time | status | current_location | ext
-----------------+--------------+----+------+------+---------+-------------+----+------+------+----------+---------------+--------+------------------+------
(0 rows)

Propolis changes since the last update: Gripe when using non-raw block device Update zerocopy dependency nvme: Wire up GetFeatures command Make Viona more robust in the face of errors bump softnpu (#577) Modernize 16550 UART Crucible changes since the last update: Don't check ROP if the scrub is done (#1093) Allow crutest cli to be quiet on generic test (#1070) Offload write encryption (#1066) Simplify handling of BlockReq at program exit (#1085) Update Rust crate byte-unit to v5 (#1054) Remove unused fields in match statements, downstairs edition (#1084) Remove unused fields in match statements and consolidate (#1083) Add logger to Guest (#1082) Drive hash / decrypt tests from Upstairs::apply Wait to reconnect if auto_promote is false Change guest work id from u64 -> GuestWorkId remove BlockOp::Commit (#1072) Various clippy fixes (#1071) Don't panic if tasks are destroyed out of order Update Rust crate reedline to 0.27.1 (#1074) Update Rust crate async-trait to 0.1.75 (#1073) Buffer should destructure to Vec when single-referenced Don't fail to make unencrypted regions (#1067) Fix shadowing in downstairs (#1063) Single-task refactoring (#1058) Update Rust crate tokio to 1.35 (#1052) Update Rust crate openapiv3 to 2.0.0 (#1050) Update Rust crate libc to 0.2.151 (#1049) Update Rust crate rusqlite to 0.30 (#1035)

Propolis changes since the last update: Gripe when using non-raw block device Update zerocopy dependency nvme: Wire up GetFeatures command Make Viona more robust in the face of errors bump softnpu (#577) Modernize 16550 UART Crucible changes since the last update: Don't check ROP if the scrub is done (#1093) Allow crutest cli to be quiet on generic test (#1070) Offload write encryption (#1066) Simplify handling of BlockReq at program exit (#1085) Update Rust crate byte-unit to v5 (#1054) Remove unused fields in match statements, downstairs edition (#1084) Remove unused fields in match statements and consolidate (#1083) Add logger to Guest (#1082) Drive hash / decrypt tests from Upstairs::apply Wait to reconnect if auto_promote is false Change guest work id from u64 -> GuestWorkId remove BlockOp::Commit (#1072) Various clippy fixes (#1071) Don't panic if tasks are destroyed out of order Update Rust crate reedline to 0.27.1 (#1074) Update Rust crate async-trait to 0.1.75 (#1073) Buffer should destructure to Vec when single-referenced Don't fail to make unencrypted regions (#1067) Fix shadowing in downstairs (#1063) Single-task refactoring (#1058) Update Rust crate tokio to 1.35 (#1052) Update Rust crate openapiv3 to 2.0.0 (#1050) Update Rust crate libc to 0.2.151 (#1049) Update Rust crate rusqlite to 0.30 (#1035) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

smklein added bug Something that isn't working. nexus Related to nexus labels May 16, 2022

smklein mentioned this issue May 23, 2022

[nexus] Re-implement disk attach/detach #1106

Merged

4 tasks

smklein closed this as completed in #1106 May 27, 2022

leftwo mentioned this issue Jan 10, 2024

Update Crucible and Propolis versions #4795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disk Attach has some race conditions #1073

Disk Attach has some race conditions #1073

smklein commented May 16, 2022 •

edited

Loading

smklein commented May 16, 2022

Uh oh!

smklein commented May 16, 2022

Uh oh!

smklein commented May 17, 2022 •

edited

Loading

Uh oh!

bnaecker commented May 17, 2022 •

edited

Loading

Uh oh!

smklein commented May 17, 2022 •

edited

Loading

Uh oh!

bnaecker commented May 17, 2022 •

edited

Loading

Uh oh!

bnaecker commented May 17, 2022

Uh oh!

Disk Attach has some race conditions #1073

Disk Attach has some race conditions #1073

Comments

smklein commented May 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Issues

smklein commented May 16, 2022

Uh oh!

smklein commented May 16, 2022

Uh oh!

smklein commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnaecker commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smklein commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnaecker commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnaecker commented May 17, 2022

Uh oh!

smklein commented May 16, 2022 •

edited

Loading

smklein commented May 17, 2022 •

edited

Loading

bnaecker commented May 17, 2022 •

edited

Loading

smklein commented May 17, 2022 •

edited

Loading

bnaecker commented May 17, 2022 •

edited

Loading