Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Region snapshot replacement for deleted snapshots #7862

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Mar 24, 2025

There were many tests for when a snapshot is delete after a region snapshot replacement was requested and the process had started, but not for when one was deleted before any of the associated sagas ran. Deleting the snapshot caused the snapshot volume to be hard-deleted, which caused the region snapshot replacement machinery to believe it couldn't do its job.

My first thought was that Nexus needed to not hard-delete the snapshot volume, and this would let the region snapshot replacement proceed. Nexus would also need to reconstruct any of the snapshot volumes that were hard-deleted through some arduous process of finding all the copies of the snapshot volume in the read-only parent fields of other volumes.

But then I started diving into the code relevant to the early stages of region snapshot replacement. When inserting a region snapshot replacement request, the snapshot volume ID was required so that a volume repair record can "lock" the volume. The comments here state that the repair record isn't needed (as snapshot volumes are never directly constructed and the lock is meant to serialize access to a volume so that an associated Upstairs won't receive multiple replacement requests) and that the lock was done out of an abudance of caution.

Looking at the "region snapshot replacement start" saga, it doesn't need the contents of the volume either: the snapshot volume ID is used to find other resources, but the volume contents themselves are never read.

Digging further, it became apparent that not only was this lock not required (narrator: spoiler alert it actually was), snapshot volume wasn't either!

The first step in realizing this was to remove creating the volume repair record with the snapshot's (or read-only region's) volume ID when inserting region snapshot replacement requests into the DB, and to stop creating the volume repair record. This required changing a bunch of unit tests, but those unit tests were creating 100% blank volumes to pass along for this purpose, further proving that the contents of the volume were not actually required:

@@ -1675,21 +1484,6 @@ mod test {
         let region_id = Uuid::new_v4();
         let snapshot_id = Uuid::new_v4();

-        let volume_id = VolumeUuid::new_v4();
-
-        datastore
-            .volume_create(
-                volume_id,
-                VolumeConstructionRequest::Volume {
-                    id: Uuid::new_v4(), // not required to match!
-                    block_size: 512,
-                    sub_volumes: vec![], // nothing needed here
-                    read_only_parent: None,
-                },
-            )
-            .await
-            .unwrap();
-
         let request = RegionSnapshotReplacement::new_from_region_snapshot(
             dataset_id,
             region_id,

The next step to realizing the snapshot volume was not required is that no other region snapshot replacement related code had to change!

Most of the test suite passed after this change, however I added an integration test for the issue seen in #7790 and it plus some others were intermittently failing.

Was this the wrong move?

Nope, it just revealed a case where serialization was required:

  • without a volume repair record on the snapshot volume, multiple region snapshot replacement requests could now concurrently run for the same snapshot volume

  • this is ok! the swapping of read-only targets will be serialized by the transaction in volume_replace_snapshot

  • but: caller A and caller B both attempting to

    1. get the number of currently allocated regions for a snapshot volume
    2. allocate a single read-only replacement region for the snapshot volume using a redundancy level increased by 1

    will, with a certain interleaving, both think that they've increased
    the number of allocated regions for a snapshot volume by 1, but both
    see the same new region. Classic TOCTOU!

  • two region snapshot replacement requests for different region snapshots that both have the same new replacement region will result in (after the two volume_replace_snapshot calls) a snapshot volume that has duplicate targets (this was the motivation for the check in Validate Volume region sets have unique targets #7846).

Concurrently getting the number of currently allocated regions, then calling arbitrary_region_allocate in this way is not safe, so Nexus is required to serialize these callers. The general way to do this is... using the volume repair record to "lock" the volume.

Before this commit, creating a volume repair record for a hard-deleted volume was not possible: the creation routine would return an error, saying that it didn't make sense to lock a non-existent volume.

Nexus is faced with the following scenario:

  • users can delete snapshots at any time, which will hard-delete the snapshot volume

  • getting the current number of allocated regions for the hard-deleted snapshot volume and calling arbitrary_region_allocate needs to be exclusive

So this commit relaxes that restriction: volume repair records can now be created for either not-created-yet or hard-deleted volumes. This means that region snapshot replacements will still be serialized even for hard-deleted snapshot volumes.

The alternative to this would either be some other form of exclusivity (another lock), or to spend cycles changing the "get number of currently allocated regions then allocation an additional one" pattern to be safe for multiple callers. In the name of urgency, the existing volume repair record is used without the aforementioned restriction.

Fixes #7790

There were many tests for when a snapshot is delete _after_ a region
snapshot replacement was requested and the process had started, but not
for when one was deleted _before_ any of the associated sagas ran.
Deleting the snapshot caused the snapshot volume to be hard-deleted,
which caused the region snapshot replacement machinery to believe it
couldn't do its job.

My first thought was that Nexus needed to _not_ hard-delete the snapshot
volume, and this would let the region snapshot replacement proceed.
Nexus would also need to reconstruct any of the snapshot volumes that
were hard-deleted through some arduous process of finding all the copies
of the snapshot volume in the read-only parent fields of other volumes.

But then I started diving into the code relevant to the early stages of
region snapshot replacement. When inserting a region snapshot
replacement request, the snapshot volume ID was required so that a
volume repair record can "lock" the volume. The comments here state that
the repair record isn't needed (as snapshot volumes are never directly
constructed and the lock is meant to serialize access to a volume so
that an associated Upstairs won't receive multiple replacement requests)
and that the lock was done out of an abudance of caution.

Looking at the "region snapshot replacement start" saga, it doesn't need
the contents of the volume either: the snapshot volume ID is used to
find other resources, but the volume contents themselves are never read.

Digging further, it became apparent that not only was this lock not
required (narrator: spoiler alert it actually was), snapshot volume
wasn't either!

The first step in realizing this was to remove creating the volume
repair record with the snapshot's (or read-only region's) volume ID when
inserting region snapshot replacement requests into the DB, and to stop
creating the volume repair record. This required changing a bunch of
unit tests, but those unit tests were creating 100% blank volumes to
pass along for this purpose, further proving that the contents of the
volume were not actually required:

```patch
@@ -1675,21 +1484,6 @@ mod test {
         let region_id = Uuid::new_v4();
         let snapshot_id = Uuid::new_v4();

-        let volume_id = VolumeUuid::new_v4();
-
-        datastore
-            .volume_create(
-                volume_id,
-                VolumeConstructionRequest::Volume {
-                    id: Uuid::new_v4(), // not required to match!
-                    block_size: 512,
-                    sub_volumes: vec![], // nothing needed here
-                    read_only_parent: None,
-                },
-            )
-            .await
-            .unwrap();
-
         let request = RegionSnapshotReplacement::new_from_region_snapshot(
             dataset_id,
             region_id,
```

The next step to realizing the snapshot volume was not required is that
no other region snapshot replacement related code had to change!

_Most_ of the test suite passed after this change, however I added an
integration test for the issue seen in #7790 and it plus some others
were intermittently failing.

Was this the wrong move?

Nope, it just revealed a case where serialization _was_ required:

- without a volume repair record on the snapshot volume, multiple region
  snapshot replacement requests could now concurrently run for the same
  snapshot volume

- this is ok! the swapping of read-only targets will be serialized by
  the transaction in `volume_replace_snapshot`

- but: caller A and caller B both attempting to

  1. get the number of currently allocated regions for a snapshot volume
  2. allocate a single read-only replacement region for the snapshot
     volume using a redundancy level increased by 1

  will, with a certain interleaving, _both_ think that they've increased
  the number of allocated regions for a snapshot volume by 1, but both
  see the same new region. Classic TOCTOU!

- two region snapshot replacement requests for different region
  snapshots that both have the _same_ new replacement region will result
  in (after the two `volume_replace_snapshot` calls) a snapshot volume
  that has duplicate targets (this was the motivation for the check in
  #7846).

Concurrently getting the number of currently allocated regions, then
calling `arbitrary_region_allocate` in this way is not safe, so Nexus is
required to serialize these callers. The general way to do this is...
using the volume repair record to "lock" the volume.

Before this commit, creating a volume repair record for a hard-deleted
volume was not possible: the creation routine would return an error,
saying that it didn't make sense to lock a non-existent volume.

Nexus is faced with the following scenario:

- users can delete snapshots at any time, which will hard-delete the
  snapshot volume

- getting the current number of allocated regions for the hard-deleted
  snapshot volume and calling `arbitrary_region_allocate` needs to be
  exclusive

So this commit relaxes that restriction: volume repair records can now
be created for either not-created-yet or hard-deleted volumes. This
means that region snapshot replacements will still be serialized even
for hard-deleted snapshot volumes.

The alternative to this would either be some other form of exclusivity
(another lock), or to spend cycles changing the "get number of currently
allocated regions then allocation an additional one" pattern to be safe
for multiple callers. In the name of urgency, the existing volume repair
record is used without the aforementioned restriction.

Fixes #7790
@jmpesp jmpesp requested a review from leftwo March 24, 2025 21:10
@@ -122,7 +122,7 @@ impl slog::KV for ReadOnlyTargetReplacement {
}

ReadOnlyTargetReplacement::ReadOnlyRegion { region_id } => {
serializer.emit_str("type".into(), "region_snapshot")?;
serializer.emit_str("type".into(), "read_only_region")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this string appear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time that enum is put into the slog KV section, mainly in the associated background tasks I think.

"you can boot any image, as long as it's alpine",
),
},
source: params::ImageSource::YouCanBootAnythingAsLongAsItsAlpine,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still around? I see it in a bunch of places still, so I don't know if it was external only that got removed and the internal side stuff is different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it was only removed from the openapi json.

@jmpesp
Copy link
Contributor Author

jmpesp commented Apr 8, 2025

It's been a while, so I merged with main and pushed just now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

post sled expungement, not all volumes were repaired
2 participants