release-21.1: migration,keys: fix incorrect range descriptor iteration #67366

irfansharif · 2021-07-08T15:04:03Z

The long-running migrations infrastructure exposes an
IterateRangeDescriptors interface to paginate through all the ranges
in the system. It was previously doing this (erroneously) by scanning
through the meta2 range. It was thus possible to entirely miss
descriptors for the meta ranges themselves. This could only happen for
large enough clusters with splits in the meta range. This issue was
caught during the careful review of #66445.

We remedy the situation by scanning over the entire [MetaMin, MetaMax)
span instead. One thing to take care of is the possibility of finding
the same range descriptor in both the meta1 and meta2 range. This is
possible when the first range in the system includes part of the user
keyspace (typically [/Min, /System/NodeLiveness)). We de-dup these
descriptors away by maintaining an in-memory map of the range IDs seen
in meta1.

What does this mean for existing long running migrations? There was only
one that made use of this descriptor iteration API: the truncated state
migration. Fortunately that migration was a below-raft one, so as part
of it we purged all extant ranges that hadn't seen the corresponding
Migrate command (see postTruncatedStateMigration and
PurgeOutdatedReplicas; the purge attempt reaches out to every store
and ensures every replica has been migrated, and if not, has been GC-ed).

If we had found unmigrated ranges (possible for the meta ranges as
described above), the migration would be blocked. We haven't seen this
happen with users running 21.1, which makes sense since this bug is only
possible in clusters with enough ranges to necessitate a split in the
meta range (very unlikely with our large default range size of 512 MB).

Still, we'll backport this PR to 21.1. Should users run into this
stalled migration, we'll recommend upgrading to whatever patch release
this will be part of.

Release note: None
Release justification: bug fixes and low-risk updates to new functionality

cockroach-teamcity · 2021-07-08T15:04:12Z

This change is

When executing a split, it's surprisingly tricky to learn what the resulting left-hand and right-hand side is. This is because when you retrieve it "after the fact", other operations may have changed the split already (for example, the split could have been merged, or additional splits added) and while you would get descriptors back, they wouldn't be meaningfully connected to the split any more in all cases. Really one would want to return the descriptors from the split txn itself, but AdminSplit is a no-op when the split already exists and so we would need special logic that performs a range lookup on the left neighbor. It could all be done, but does not seem worth it. There's still a nice simplification here that lets us remove the ad-hoc code, and we'll just accept that when there are concurrent splits the return values may not exactly line up with the split. This came out of cockroachdb#64060 (comment). Release note: None

The long-running migrations infrastructure exposes an `IterateRangeDescriptors` interface to paginate through all the ranges in the system. It was previously doing this (erroneously) by scanning through the meta2 range. It was thus possible to entirely miss descriptors for the meta ranges themselves. This could only happen for large enough clusters with splits in the meta range. This issue was caught during the careful review of cockroachdb#66445. We remedy the situation by scanning over the entire `[MetaMin, MetaMax)` span instead. One thing to take care of is the possibility of finding the same range descriptor in both the `meta1` and `meta2` range. This is possible when the first range in the system includes part of the user keyspace (typically `[/Min, /System/NodeLiveness)`). We de-dup these descriptors away by maintaining an in-memory map of the range IDs seen in meta1. --- What does this mean for existing long running migrations? There was only one that made use of this descriptor iteration API: the truncated state migration. Fortunately that migration was a below-raft one, so as part of it we purged all extant ranges that hadn't seen the corresponding `Migrate` command (see `postTruncatedStateMigration` and `PurgeOutdatedReplicas`; the purge attempt reaches out to every store and ensures every replica has been migrated, and if not, has been GC-ed). If we had found unmigrated ranges (possible for the meta ranges as described above), the migration would be blocked. We haven't seen this happen with users running 21.1, which makes sense since this bug is only possible in clusters with enough ranges to necessitate a split in the meta range (very unlikely with our large default range size of 512 MB). Still, we'll backport this PR to 21.1. Should users run into this stalled migration, we'll recommend upgrading to whatever patch release this will be part of. Release note: None Release justification: bug fixes and low-risk updates to new functionality

irfansharif · 2021-07-12T17:11:23Z

Huh, CI in the release-21.1 branch fails with what looks like buggy testserver code: client_test.go:49: could not look up left-hand side descriptor: end key /Meta1/"\x00" must be greater than start /Meta2/"". I see @tbg cleaned up this area in #64155, which I've now pre-prepended to this PR (should now pass CI just fine -- I'll wait till green).

irfansharif requested review from tbg and nvanbenschoten July 8, 2021 15:04

irfansharif changed the title ~~migration,keys: fix incorrect range descriptor iteration~~ release-21.1: migration,keys: fix incorrect range descriptor iteration Jul 8, 2021

tbg approved these changes Jul 8, 2021

View reviewed changes

irfansharif force-pushed the backport21.1-67315 branch 2 times, most recently from 5dcb2aa to ce5f69b Compare July 12, 2021 16:06

tbg and others added 2 commits July 12, 2021 13:04

irfansharif force-pushed the backport21.1-67315 branch from ce5f69b to 8885e3c Compare July 12, 2021 17:10

irfansharif merged commit 03b83e0 into cockroachdb:release-21.1 Jul 12, 2021

irfansharif deleted the backport21.1-67315 branch July 12, 2021 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.1: migration,keys: fix incorrect range descriptor iteration #67366

release-21.1: migration,keys: fix incorrect range descriptor iteration #67366

irfansharif commented Jul 8, 2021 •

edited

Loading

cockroach-teamcity commented Jul 8, 2021

irfansharif commented Jul 12, 2021

release-21.1: migration,keys: fix incorrect range descriptor iteration #67366

release-21.1: migration,keys: fix incorrect range descriptor iteration #67366

Conversation

irfansharif commented Jul 8, 2021 • edited Loading

cockroach-teamcity commented Jul 8, 2021

irfansharif commented Jul 12, 2021

irfansharif commented Jul 8, 2021 •

edited

Loading