-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migration: introduce primitives for below-raft migrations #58088
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! I didn't see anything substantial, so I think this is just a question of some more ardent polishing.
Reviewed 57 of 57 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/kv/db.go, line 685 at r1 (raw file):
// Migrate proactively forces ranges overlapping with the provided keyspace to // transition out of any legacy modes of operation (as defined by the target
the phrasing of comments around Migrate could use a clarification of its semantics. This is given a target version, so you'd think that it would somehow run from whatever version it's at to the target version, but it doesn't do that (because it's a primitive used in ececuting a single version step). So it's really more a way to attach some below-raft code to a particular version.
pkg/kv/kvserver/client_migration_test.go, line 79 at r1 (raw file):
desc := tc.RemoveVotersOrFatal(t, k, tc.Target(n2)) unregister := batcheval.TestingRegisterMigrationInterceptor(migrationVersion, func() {})
add a comment why this is needed.
pkg/kv/kvserver/client_migration_test.go, line 101 at r1 (raw file):
// Check to see that the replica was purged from n2. require.Nil(t, store.LookupReplica(roachpb.RKey(k)))
nice.
pkg/kv/kvserver/client_migration_test.go, line 115 at r1 (raw file):
blockSnapshotsCh := make(chan struct{}) knobs, ltk := makeReplicationTestKnobs() ltk.storeKnobs.ReceiveSnapshot = func(h *kvserver.SnapshotRequest_Header) error {
make this closure idempotent, so that a double-close here won't happen (as it would crash the whole pkg)
There's no way to make the snapshot we're waiting for uniquely identifiable, right? We could at least check that it's for the right range.
pkg/kv/kvserver/client_migration_test.go, line 143 at r1 (raw file):
// added. <-blockUntilSnapshotCh desc := tc.LookupRangeOrFatal(t, k)
this uses a consistent kv lookup, right? Otherwise it could be racy.
pkg/kv/kvserver/client_migration_test.go, line 152 at r1 (raw file):
// Migrate the scratch range. We expect this to hang given the in-flight // snapshot is held up. err := tc.Server(n1).DB().Migrate(ctx, desc.StartKey, desc.EndKey, migrationVersion)
How long does it hang?
pkg/kv/kvserver/client_migration_test.go, line 153 at r1 (raw file):
// snapshot is held up. err := tc.Server(n1).DB().Migrate(ctx, desc.StartKey, desc.EndKey, migrationVersion) require.Error(t, err)
Check the type of the error or, if we can't get this to work, string match. (But I think the type stuff should work now thanks to EncodedError).
pkg/kv/kvserver/client_migration_test.go, line 236 at r1 (raw file):
// command application on n3. err := tc.Server(n1).DB().Migrate(ctx, desc.StartKey, desc.EndKey, endV) require.Error(t, err)
ditto
pkg/kv/kvserver/client_replica_test.go, line 3610 at r1 (raw file):
key := tc.ScratchRange(t) require.NoError(t, tc.WaitForSplitAndInitialization(key))
Huh, never saw this one before. We don't want to fold this into ScratchRange
?
pkg/kv/kvserver/replica_application_result.go, line 297 at r1 (raw file):
func (r *Replica) handleVersionResult(ctx context.Context, version *roachpb.Version) { if (*version == roachpb.Version{}) {
not nil?
pkg/kv/kvserver/replica_proposal.go, line 776 at r1 (raw file):
r.mu.RUnlock() if !usingAppliedStateKey { // The range applied state was introduced in v21.1. The cluster
It was not introduced in v21.1, but sometime earlier.
pkg/kv/kvserver/store.go, line 2827 at r1 (raw file):
alloc, err := qp.Acquire(ctx, 1) if err != nil { retErr = err
use g.GoCtx(func(ctx context.Context) error { return err })
and then you don't have to bother with retErr
?
pkg/kv/kvserver/store.go, line 2834 at r1 (raw file):
defer alloc.Release() processed, err := s.replicaGCQueue.process(ctx, repl, cfg)
process
doesn't even use the system config (it's forced into it only because of queueImpl
. I think we should make use of that and pass nil
here to avoid a gossip depedency.
pkg/kv/kvserver/store.go, line 2836 at r1 (raw file):
processed, err := s.replicaGCQueue.process(ctx, repl, cfg) if err != nil { return err
errors.Wrapf(err, "on %s", repl.Desc())
pkg/kv/kvserver/store_create_replica.go, line 188 at r1 (raw file):
if s.ClusterSettings().Version.IsActive(ctx, clusterversion.ReplicaVersions) { version := s.ClusterSettings().Version.ActiveVersion(ctx).Version repl.mu.state.Version = &version
This strikes me as a problematic. Anything in repl.mu.state
must match on-disk state, and here there's no on-disk state. I think the version should be unspecified until a snapshot arrives. This is going to mildly awkward, but we can probably apply this idea here in spirit whenever we look at the replica's versions, which will then be the right place to do so. If we need to do this multiple times, should be easy enough to come up with a helper.
pkg/kv/kvserver/store_snapshot.go, line 627 at r1 (raw file):
} // We gate any incoming snapshots that were generated from a replica, that
s/gate/reject/
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1007 at r1 (raw file):
var replicaVersion roachpb.Version if rec.ClusterSettings().Version.IsActive(ctx, clusterversion.ReplicaVersions) { replicaVersion = rec.ClusterSettings().Version.ActiveVersion(ctx).Version
I don't think this is right. The version needs to be that of the LHS (and the split needs to declare a read latch on the version, and the migrate cmd a write). As is, we could conceivably split a v10 LHS that has passed the migrate check and create a v9 RHS, and then bump the active version to v10 even though we didn't upgrade the v9 rhs.
pkg/kv/kvserver/batcheval/cmd_migrate.go, line 41 at r1 (raw file):
// cumbersome. We could spruce up the migration type and allow authors to // define the allow authors for specific set of keys each migration needs to // grab latches and locks over.
needs rw latch for range version key.
Also, read latch on the range descriptor. You're accessing the rangeID which technically comes from the descriptor.
pkg/kv/kvserver/batcheval/cmd_migrate.go, line 61 at r1 (raw file):
} // Migrate executes the below-raft migration corresponding to the given version.
put a reference to the exhaustive comment (not sure where the best place to put it is, I like MigrateRequest)
pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r1 (raw file):
// initialized (either during bootstrap, or through snapshots, or splits), // it's annotated with a replica version. During bootstrap, replicas source // the active cluster version at the time. For replicas created through
Will need an update as I commented on that code elsewhere.
pkg/kv/kvserver/stateloader/initial.go, line 105 at r1 (raw file):
replicaVersion roachpb.Version, ) error { initialTruncStateType := TruncatedStateUnreplicated
nit: const
pkg/kv/kvserver/stateloader/initial.go, line 111 at r1 (raw file):
// WriteInitialRangeStateWithTruncatedState is the same as // WriteInitialRangeState, but allows the caller to override the truncated // state.
type
pkg/migration/manager.go, line 126 at r1 (raw file):
// upgrades) are running binaries that know about the version. // Bumping the version gates only after running the migration itself is
This makes it seem as though we migrate-before-bump only to accommodate below-raft migrations, but it's because of the general invariant that we want that if version X is active, it is fully active. If we roll out the migration after activating the version, then what good is the version? You would have to wait for X+1 instead, so you arrive at the same place but with an annoying one-off. You're really just explaining the details of below-raft migrations here which is useful, but seems like the wrong place. At a high level, it seems that on line 105 above, you could instead write
// First, run the actual migration. The cluster version bump will be rolled out below. This
// means that when a migration for version X first starts running and for some time after
// its completions, parts of the cluster might correspond to version X-1, and others to version X.
// However, once version X has fully been rolled out, there is certainty that all data has been
// migrated, and in particular any legacy behavior is no longer exercised.
and put the detail on below-raft migrations in a more appropriate place (the helpers related to below-raft migrations)?
pkg/migration/manager.go, line 128 at r1 (raw file):
// Bumping the version gates only after running the migration itself is // important for below-raft migrations. Below-raft migrations mutate // replica state, making use of the Migrate(version=V) primitive when
which they issue against the entire keyspace.
pkg/migration/migrations.go, line 81 at r1 (raw file):
// TODO(irfansharif): Introduce a `system.migrations` table, and populate it here. func (m *Migration) Run(ctx context.Context, h *Helper) (err error) { ctx = logtags.AddTag(ctx, fmt.Sprintf("migration=%s", h.ClusterVersion()), nil)
🤔 why not AddTag(ctx, "migration", h.ClusterVersion())
?
pkg/migration/migrations.go, line 90 at r1 (raw file):
} // defaultPageSize controls how many ranges are paged in by default when
range descriptors
pkg/migration/migrations.go, line 121 at r1 (raw file):
// TODO(irfansharif): Instead of logging this to the debug log, we // should insert these into a `system.migrations` table for external
I'm not even so sure about that any more! I would be curious how @knz things this should ultimately work. A special migrations logging channel?
pkg/migration/migrations.go, line 135 at r1 (raw file):
// Make sure that all stores have synced. Given we're a below-raft // migrations, this ensures that the applied state is flushed to disk.
Hrm, unfortunately, the following could happen:
- replica X applies the change, we return from the migration above
- node w/ replica X restarts, unapplying the change (it wasn't synced)
- the flush request below syncs, but the cmd isn't applied yet
I think we just have to run this whole method in the context of an everynode.
pkg/migration/migrations.go, line 155 at r1 (raw file):
op := fmt.Sprintf("purge-outdated-replicas=%s", req.Version) if err := h.EveryNode(ctx, op, func(ctx context.Context, client serverpb.MigrationClient) error { _, err := client.PurgeOutdatedReplicas(ctx, req)
I guess this one needs to flush too, and does need to do so in the same everynode
Makes me wonder if Everynode is still the right name, since we appear to need its stabilizing property all around. h.UntilClusterStable
?
pkg/migration/migrations_test.go, line 74 at r1 (raw file):
s.VisitReplicas(func(repl *kvserver.Replica) bool { err = f(repl) return err == nil
// wantMore
pkg/migration/migrations_test.go, line 241 at r1 (raw file):
}) // Register the below raft migration.
Is it still ok to register these now? I thought it would have to be done before starting up the system.
pkg/migration/migrations_test.go, line 247 at r1 (raw file):
// Register the top-level migration. migrated := false unregister := migration.TestingRegisterMigrationInterceptor(endCV, func(ctx context.Context, h *migration.Helper) error {
ditto
pkg/migration/migrations_test.go, line 264 at r1 (raw file):
} if err := g.Wait(); !testutils.IsError(err, fmt.Sprintf("snapshot generated at %s, only accepting versions >= %s", startCV, endCV)) {
can you make this error typed and assert on the type (errors.Is
)? I believe it should round-trip now.
pkg/migration/migrations_test.go, line 331 at r1 (raw file):
require.NoError(t, err) if got := repl.Version(); got != endCV.Version {
This is flaky in theory if repl
happens to be a follower.
pkg/roachpb/api.go, line 1329 at r1 (raw file):
return isWrite | isRange | isAlone | isUnsplittable | canBackpressure } func (*MigrateRequest) flags() int { return isWrite | isRange | isAlone }
yep, there's the isAlone.
pkg/server/migration.go, line 173 at r1 (raw file):
ctx = logtags.AddTag(ctx, "purge-outdated-replicas", nil) if err := m.server.node.stores.VisitStores(func(s *kvserver.Store) error {
Awkward complication: stores can be added asynchronously, so we have to find a way to determine that the async bootstrap has happened. I hate that this is necessary but it is.
pkg/server/migration_test.go, line 231 at r1 (raw file):
} s.Stopper().Stop(context.Background())
why not defer a few lines up? You're leaking the stopper on fatals now.
pkg/server/serverpb/migration.proto, line 74 at r1 (raw file):
// FlushAllEngines is used to instruct the target node to flush all its // engines. rpc FlushAllEngines (FlushAllEnginesRequest) returns (FlushAllEnginesResponse) { }
do we actually want to flush? I think we just want to sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the super detailed review, that was really good ha. I think I've addressed all of it. The one flaking test was from our snapshot gating mechanism, but turns out that's not actually needed (explained in comments).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/kv/db.go, line 685 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
the phrasing of comments around Migrate could use a clarification of its semantics. This is given a target version, so you'd think that it would somehow run from whatever version it's at to the target version, but it doesn't do that (because it's a primitive used in ececuting a single version step). So it's really more a way to attach some below-raft code to a particular version.
Done.
pkg/kv/kvserver/client_migration_test.go, line 79 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
add a comment why this is needed.
Done.
pkg/kv/kvserver/client_migration_test.go, line 115 at r1 (raw file):
make this closure idempotent
Done. I don't think we need to check that it's for the right range, right? We're using the base.ReplicationManual
mode, so there are no snapshots other than the one we're generating. Also I'm not sure how to cleanly check for it being the right range when we'll only know the descriptor below, when creating the scratch range.
pkg/kv/kvserver/client_migration_test.go, line 143 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
this uses a consistent kv lookup, right? Otherwise it could be racy.
Yup.
pkg/kv/kvserver/client_migration_test.go, line 152 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
How long does it hang?
Too long, changed it out to 50ms (cargo-culting from what you did here).
pkg/kv/kvserver/client_migration_test.go, line 153 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Check the type of the error or, if we can't get this to work, string match. (But I think the type stuff should work now thanks to EncodedError).
Done (the type didn't match context.DeadlineExceeded 🤷♀️)
pkg/kv/kvserver/client_migration_test.go, line 236 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto
Done.
pkg/kv/kvserver/client_replica_test.go, line 3610 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Huh, never saw this one before. We don't want to fold this into
ScratchRange
?
Don't actually need it here (was just cargo culting as usual). I'm not sure why it isn't already folded in. Maybe for tests that disable the split queues?
pkg/kv/kvserver/replica_application_result.go, line 297 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
not nil?
Only called when non-nil, structurally identical to the other callsites within handleNonTrivialReplicatedEvalResult. This is actually dead-code, we'll only ever have non-nil versions passed in as part of the replica state when there's a below-raft migration happening (so with a non-zero version attached). I had it written this way to convince myself that we're never setting in-memory version states unless the cluster version gating the use of replica.Version was active (and so it was being funneled in from elsewhere where we presumably checked for that condition). Do you prefer to skip it entirely? Add a comment? Assertion?
pkg/kv/kvserver/replica_proposal.go, line 776 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It was not introduced in v21.1, but sometime earlier.
Done.
pkg/kv/kvserver/store.go, line 2827 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
use
g.GoCtx(func(ctx context.Context) error { return err })
and then you don't have to bother withretErr
?
Done.
pkg/kv/kvserver/store.go, line 2834 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
process
doesn't even use the system config (it's forced into it only because ofqueueImpl
. I think we should make use of that and passnil
here to avoid a gossip depedency.
Done.
pkg/kv/kvserver/store.go, line 2836 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
errors.Wrapf(err, "on %s", repl.Desc())
Done.
pkg/kv/kvserver/store_create_replica.go, line 188 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This strikes me as a problematic. Anything in
repl.mu.state
must match on-disk state, and here there's no on-disk state. I think the version should be unspecified until a snapshot arrives. This is going to mildly awkward, but we can probably apply this idea here in spirit whenever we look at the replica's versions, which will then be the right place to do so. If we need to do this multiple times, should be easy enough to come up with a helper.
Done, and makes sense. I only had this here to be able to gate incoming snapshots instantiating outdated replicas (so we could compare the snapshot's version with the store's version, but we could've been checking for the store's active cluster version there anyway, so I'm not sure anymore why I had it. Removed.
pkg/kv/kvserver/store_snapshot.go, line 627 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s/gate/reject/
No longer applicable.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1007 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't think this is right. The version needs to be that of the LHS (and the split needs to declare a read latch on the version, and the migrate cmd a write). As is, we could conceivably split a v10 LHS that has passed the migrate check and create a v9 RHS, and then bump the active version to v10 even though we didn't upgrade the v9 rhs.
Done, thanks for catching this. The split already declares a read span over all replicated range-ID keys (of which version key is one.
pkg/kv/kvserver/batcheval/cmd_migrate.go, line 41 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
needs rw latch for range version key.
Also, read latch on the range descriptor. You're accessing the rangeID which technically comes from the descriptor.
Done, but I'm not sure I follow why declaring a read latch over the range descriptor is necessary. What's the possible hazard if we don't declare it?
pkg/kv/kvserver/batcheval/cmd_migrate.go, line 61 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
put a reference to the exhaustive comment (not sure where the best place to put it is, I like MigrateRequest)
Done.
pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Will need an update as I commented on that code elsewhere.
Done.
pkg/kv/kvserver/stateloader/initial.go, line 105 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: const
Done.
pkg/kv/kvserver/stateloader/initial.go, line 111 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
type
Done.
pkg/migration/manager.go, line 126 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This makes it seem as though we migrate-before-bump only to accommodate below-raft migrations, but it's because of the general invariant that we want that if version X is active, it is fully active. If we roll out the migration after activating the version, then what good is the version? You would have to wait for X+1 instead, so you arrive at the same place but with an annoying one-off. You're really just explaining the details of below-raft migrations here which is useful, but seems like the wrong place. At a high level, it seems that on line 105 above, you could instead write
// First, run the actual migration. The cluster version bump will be rolled out below. This
// means that when a migration for version X first starts running and for some time after
// its completions, parts of the cluster might correspond to version X-1, and others to version X.
// However, once version X has fully been rolled out, there is certainty that all data has been
// migrated, and in particular any legacy behavior is no longer exercised.and put the detail on below-raft migrations in a more appropriate place (the helpers related to below-raft migrations)?
Done. As for moving the comment block, I'm not sure I'm finding a better place for it. On the one hand it's a bit out of odd here, but it's also now a single page with fully commentary on the entire machinery for long running migrations, which is a nice.
pkg/migration/manager.go, line 128 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
which they issue against the entire keyspace.
Done.
pkg/migration/migrations.go, line 81 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
🤔 why not
AddTag(ctx, "migration", h.ClusterVersion())
?
Doh, done.
pkg/migration/migrations.go, line 90 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
range descriptors
Done.
pkg/migration/migrations.go, line 121 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm not even so sure about that any more! I would be curious how @knz things this should ultimately work. A special migrations logging channel?
Referring to #58183 now instead.
pkg/migration/migrations.go, line 135 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hrm, unfortunately, the following could happen:
- replica X applies the change, we return from the migration above
- node w/ replica X restarts, unapplying the change (it wasn't synced)
- the flush request below syncs, but the cmd isn't applied yet
I think we just have to run this whole method in the context of an everynode.
Done, PTAL.
pkg/migration/migrations.go, line 155 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I guess this one needs to flush too, and does need to do so in the same everynode
Makes me wonder if Everynode is still the right name, since we appear to need its stabilizing property all around.
h.UntilClusterStable
?
Done.
pkg/migration/migrations_test.go, line 74 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// wantMore
Done (with the named return type).
pkg/migration/migrations_test.go, line 241 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is it still ok to register these now? I thought it would have to be done before starting up the system.
Yup, but it's kind of a shitty pattern to use because we're mutating global state. If there are multiple tests running all registering to the same version, there'll be a confusing panic. I'm haven't done anything about it in this review pass, but I think plumbing a testing knob in is in order.
pkg/migration/migrations_test.go, line 247 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto
See above.
pkg/migration/migrations_test.go, line 264 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
can you make this error typed and assert on the type (
errors.Is
)? I believe it should round-trip now.
No longer applicable.
pkg/migration/migrations_test.go, line 331 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is flaky in theory if
repl
happens to be a follower.
Shouldn't be, the migration will bump the version on all followers. But this test was flaky for another reason: we need to ensure all nodes have heartbeat their liveness records at least once before attempting a migration, lest we consider them unavailable.
pkg/server/migration.go, line 173 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Awkward complication: stores can be added asynchronously, so we have to find a way to determine that the async bootstrap has happened. I hate that this is necessary but it is.
Done. Ha, I'm learning a lot from how paranoid your reviews are. I should be less sloppy.
pkg/server/migration_test.go, line 231 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
why not defer a few lines up? You're leaking the stopper on fatals now.
No good reason, done.
pkg/server/serverpb/migration.proto, line 74 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
do we actually want to flush? I think we just want to sync?
good point, Done.
245553a
to
f0c8afd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 25 of 25 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)
pkg/kv/kvserver/client_migration_test.go, line 153 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Done (the type didn't match context.DeadlineExceeded 🤷♀️)
Ah, it's a wrapped grpc error with code codes.DeadlineExceeded
, but status.Code(err)
doesn't know how to wrap through to the wrapped grpc error. It's not worth improving; this is fine
pkg/kv/kvserver/replica_application_result.go, line 297 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Only called when non-nil, structurally identical to the other callsites within handleNonTrivialReplicatedEvalResult. This is actually dead-code, we'll only ever have non-nil versions passed in as part of the replica state when there's a below-raft migration happening (so with a non-zero version attached). I had it written this way to convince myself that we're never setting in-memory version states unless the cluster version gating the use of replica.Version was active (and so it was being funneled in from elsewhere where we presumably checked for that condition). Do you prefer to skip it entirely? Add a comment? Assertion?
Assertion sounds right.
pkg/kv/kvserver/replica_proposal.go, line 776 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Done.
Doesn't look done - I meant that the range applied state was not introduced in v21.1. It was introduced a lot earlier. It is first enforced on all ranges in 21.1.
pkg/kv/kvserver/replica_write.go, line 243 at r3 (raw file):
// [1]: See PurgeOutdatedReplicas from the Migration service. // [2]: pkg/migration desc := r.Desc()
Add something to the comment about the races. It is possible that between the proposal returning and the call to r.Desc()
below, the descriptor has changed. But the only thing that matters is that r.Desc()
is at least as up to date as the descriptor the command applied on previously. If a replica got removed - fine, waitForApplication will fail; we will have to cope with that. If one got added - it was either already a learner when we migrated (in which case waitForApplication will get it) or, thanks to Migrate read latching on the range descriptor, the replication change serialized after the migration and so the snapshot will also include the migration (by virtue of being started after a replica change txn).
Hmm, in a cluster that is constantly changing its replica sets, I wonder if we can get into a situation in which a Migrate command never manages to complete - all it takes is a single range in each attempt to throw things off. I wonder if an error on waitForApplication should lead to a retry of just that RPC instead of propagating an error for the whole migrate invocation.
pkg/kv/kvserver/store.go, line 2824 at r3 (raw file):
return err }) return false
Can you explain the return false here? On errors below, you keep going. What does it mean if the quota pool doesn't let us acquire?
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 974 at r3 (raw file):
} replicaVersion, err := MakeStateLoader(rec).LoadVersion(ctx, rec.Engine())
This is the third (at least) stateloader in this code, are there any issues with taking the opportunity to declare it once? This also makes sure that my TODO above wondering why this is using rec.Engine()
is not needed multiple times.
pkg/kv/kvserver/batcheval/cmd_migrate.go, line 41 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Done, but I'm not sure I follow why declaring a read latch over the range descriptor is necessary. What's the possible hazard if we don't declare it?
Assume we run a migration that declares no keys at all. It could run concurrently with a split. So you could be reading rangeID 15 here but the split finishes first, and propagates the old version to the RHS. Yet, you return claiming to have migrated the whole range, but the right hand side of the split has the old version still (and isn't migrated).
I have to admit that I don't 100% feel sure that I completely grok this. Usually, a span based command (i.e. one that uses EndKey) declares its span [Key, EndKey)
and so automatically it's not going to interact weirdly with a split. But here we are intentionally not doing that, so we need another mechanism that makes sure that we have migrated every range that touches our key span, because that is what we promise to do. A read latch on the range descriptor makes sure that nobody changes the keyspan while we migrate.
I wouldn't mind if @nvanbenschoten came back to this comment and sanity checked me here at some point.
pkg/kv/kvserver/kvserverpb/state.proto, line 87 at r3 (raw file):
// that all replica versions are bumped along side it. For everything else, // the coupling is neither necessary nor enforced. When a new replica is being // initialized (either during bootstrap, or through snapshots, or splits),
I'm not sure we refer to any of this as bootstrap. Let's explicitly say that replicas are either created at cluster creation time (in which case they source the binary version, i.e. the cluster version of that new cluster) and otherwise they inherit the version from either the snapshot or the LHS of the split.
pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r3 (raw file):
// initialized (either during bootstrap, or through snapshots, or splits), // it's annotated with a replica version. During bootstrap, replicas source // the binary version. For replicas created through snapshots or splits, they
(i.e. the version the cluster is bootstrapped with)
pkg/migration/helper.go, line 124 at r3 (raw file):
// (b) We'll invoke the closure // (c) We'll retrieve the list of node IDs again to account for the // possibility of a new node being added during (b)
or a node restarting
pkg/migration/manager.go, line 146 at r3 (raw file):
// forward. // // That still leaves rooms for replicas in the GC queue to evade
replica GC queue
pkg/migration/migrations.go, line 150 at r3 (raw file):
// truncated state and the range applied state. We're sure to also durably // persist any changes made in the same closure. Doing so in separate // UntilClusterStable closure runs the (small) risk that a node might have
nit: s/runs/would run/
pkg/server/migration.go, line 161 at r3 (raw file):
for _, eng := range m.server.engines { batch := eng.NewBatch() if err := batch.Commit(true); err != nil {
Can someone from storage check that this is doing the right thing? Seems like the kind of operation that might accidentally be optimized into a noop. @sumeerbhola or @jbowens maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @nvanbenschoten, and @sumeerbhola)
pkg/server/migration.go, line 161 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can someone from storage check that this is doing the right thing? Seems like the kind of operation that might accidentally be optimized into a noop. @sumeerbhola or @jbowens maybe?
Good call @tbg. An empty batch skips the WAL: https://github.com/cockroachdb/pebble/blob/108cd0260986c5972d20bd807a4d7192d912c97a/commit.go#L241-L243
You can use batch.LogData(nil)
to add a log-only entry to the batch and ensure the WAL is synced.
f0c8afd
to
512875f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, thanks for the review Tobi! Will merge on green 🤞
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica_write.go, line 243 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Add something to the comment about the races. It is possible that between the proposal returning and the call to
r.Desc()
below, the descriptor has changed. But the only thing that matters is thatr.Desc()
is at least as up to date as the descriptor the command applied on previously. If a replica got removed - fine, waitForApplication will fail; we will have to cope with that. If one got added - it was either already a learner when we migrated (in which case waitForApplication will get it) or, thanks to Migrate read latching on the range descriptor, the replication change serialized after the migration and so the snapshot will also include the migration (by virtue of being started after a replica change txn).Hmm, in a cluster that is constantly changing its replica sets, I wonder if we can get into a situation in which a Migrate command never manages to complete - all it takes is a single range in each attempt to throw things off. I wonder if an error on waitForApplication should lead to a retry of just that RPC instead of propagating an error for the whole migrate invocation.
Done, and added a TODO here. I expect a fair bit of shaking out to happen as we onboard more migrations in earnest, and I'll whittle it down then.
pkg/kv/kvserver/store.go, line 2824 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you explain the return false here? On errors below, you keep going. What does it mean if the quota pool doesn't let us acquire?
Errors from the quota pool, quickly skimming through it, seem to only come about when the pool is already closed (we're not doing that here) or configured with zero total capacity (also not applicable). Seeing as how it's "dead-code", short of asserting on no-errors, terminating early seems appropriate. For the errors below they'll only come about when individually processing replicas, which I figured shouldn't interfere much with the processing of other replicas.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 974 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is the third (at least) stateloader in this code, are there any issues with taking the opportunity to declare it once? This also makes sure that my TODO above wondering why this is using
rec.Engine()
is not needed multiple times.
Done.
pkg/kv/kvserver/kvserverpb/state.proto, line 87 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm not sure we refer to any of this as bootstrap. Let's explicitly say that replicas are either created at cluster creation time (in which case they source the binary version, i.e. the cluster version of that new cluster) and otherwise they inherit the version from either the snapshot or the LHS of the split.
Done.
pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
(i.e. the version the cluster is bootstrapped with)
Done.
pkg/migration/helper.go, line 124 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
or a node restarting
Done.
pkg/migration/manager.go, line 146 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
replica GC queue
Done.
pkg/migration/migrations.go, line 150 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: s/runs/would run/
Done.
pkg/server/migration.go, line 161 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Good call @tbg. An empty batch skips the WAL: https://github.com/cockroachdb/pebble/blob/108cd0260986c5972d20bd807a4d7192d912c97a/commit.go#L241-L243
You can use
batch.LogData(nil)
to add a log-only entry to the batch and ensure the WAL is synced.
Done, thanks for verifying Jackson.
512875f
to
1d1d5a8
Compare
and onboard the first long running migration (for the truncated state, see below). The primitives here include: - The KV ranged `Migrate` command. This command forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific, stated version. This has the effect of moving the range out of any legacy modes operation they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. - The `SyncAllEngines` RPC. This is be used to instruct the target node to persist releveant in-memory state to disk. Like we mentioned above, KV currently waits for the `Migrate` command to have applied on all replicas before returning. With the applied state, there's no necessity to durably persist it (the representative version is already stored in the raft log). Out of an abundance of caution, and to really really ensure that no pre-migrated state is ever seen in the system, we provide the migration manager a mechanism to flush out all in-memory state to disk. This will let us guarantee that by the time a specific cluster version is bumped, all pre-migrated state from prior to a certain version will have been fully purged from the system. We'll also use it in conjunction with PurgeOutdatedReplicas below. - The `PurgeOutdatedReplicas` RPC. This too comes up in the context of wanting the ensure that ranges where we've executed a ranged `Migrate` command over have no way of ever surfacing pre-migrated state. This can happen with older replicas in the replica GC queue and with applied state that is not yet persisted. Currently we wait for the `Migrate` to have applied on all replicas of a range before returning to the caller. This does not include earlier incarnations of the range, possibly sitting idle in the replica GC queue. These replicas can still request leases, and go through the request evaluation paths, possibly tripping up assertions that check to see no pre-migrated state is found. The `PurgeOutdatedReplicas` lets the migration manager do exactly as the name suggests, ensuring all "outdated" replicas are processed before declaring the specific cluster version bump complete. - The concept of a "replica state version". This is what's used to construct the migration manager's view of what's "outdated", telling us which migrations can be assumed to have run against a particular replica. When we introduce backwards incompatible changes to the replica state (for example using the unreplicated truncated state instead of the replicated variant), the version would inform us if, for a given replica, we should expect a state representation prior to, or after the migration (in our example this corresponds to whether or not we can assume an unreplicated truncated state). As part of this commit, we also re-order the steps taken by the migration manager so that it executes a given migration first before bumping version gates cluster wide. This is because we want authors of migrations to ascertain that their own migrations have run to completion, instead of attaching that check to the next version. --- This PR motivates all of the above by also onboarding the TruncatedAndRangeAppliedState migration, lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges In 21.2 we'll finally be able to delete holdover code that knows how to handle the legacy replicated truncated state. Release note (general change): Cluster version upgrades, as initiated by SET CLUSTER SETTING version = <major>-<minor>, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. --- The ideas here follow from our original prototype in cockroachdb#57445.
1d1d5a8
to
89781b1
Compare
bors r+ |
Build succeeded: |
Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None
Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None
60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten Fixes #60779. Fixes #60580. We were only checking that the batch header timestamp was equal to or greater than this pushee's min timestamp, so this is as far as we can bump the timestamp cache. 62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890 This patch slightly improves the performance of many spatial builtins by storing the number of edges used in the loop conditions of for loops into a variable. We discovered this was taking a lot of time when profiling the point-in-polygon optimization. Release note: None 62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif Fixes #58378. Fixes #62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in #62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in #58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in #60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None 62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan * Subzones may be defined in a different order. We did not take this into account which can cause bugs when e.g. ADD REGION adds a subzone in the end rather than in the old "expected" location in the subzones array. This has been fixed by comparing subzones using an unordered map. * The ApplyZoneConfig we previously did overwrote subzone fields on the original subzone array element, meaning that if there was a mismatch it would not be reported through validation. This is now fixed by applying the expected zone config to *zonepb.NewZoneConfig() instead. * Added logic to only check for zone config matches subzones from active subzone IDs. * Improve the error messaging when a subzone config is mismatching - namely, add index and partitioning information and differentiate between missing fields and missing / extraneous zone configs Resolves #62790 Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET LOCALITY / crdb_internal.validate_multi_region_zone_config where validation errors could occur when the database of a REGIONAL BY ROW table has a new region added. Also fix a validation bug partition zone mismatches configs were not caught. 62872: build: use -json for RandomSyntax test r=otan a=rafiss I'm hoping this will help out with an issue where the test failures seem to be missing helpful logs. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Andy Yang <ayang@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state and the old range applied state keys in cockroachdb#58088. We're now guaranteed to never see the legacy truncated+applied state representations. Release justification: cluster version cleanup Release note: None
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state and the old range applied state keys (RaftAppliedIndexLegacyKey, LeaseAppliedIndexLegacyKey, RangeStatsLegacyKey) in cockroachdb#58088. We're now guaranteed to never see the legacy truncated+applied state representations, making this code (finally) safe to delete. Release justification: cluster version cleanup Release note: None
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state and the old range applied state keys (RaftAppliedIndexLegacyKey, LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're now guaranteed to never see the legacy truncated+applied state representations, making this code (finally) safe to delete. Release justification: cluster version cleanup Release note: None --- This turned out to be pretty hairy and invasive. I'll wait for a few CI runs and inspect the diffs here a bit more closely. Given the surface area here, I'm curious if we feel this is worth getting in during stability. Some questions to consider: - The various legacy keys that were removed were removed in their entirety. Our CLI to scan over store keys will no longer recognize these keys. Is that ok? - I don't think we need any checks to make sure we're not re-using some of these keys in the future -- but I'd like other opinions. - Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for the unreplicated truncated key as well, so just renamed it more appropriately. - v21.1 nodes have to interop with v20.2 nodes where the legacy representation was possible, and it identified such cases through booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState for example). This means that we still need to explicitly set them in v21.2 code. Crucially we'll stop reading these fields in v21.2 code, letting us remove these fields entirely in v22.1. - The treatment above does not extend to ReplicaState.UsingAppliedStateKey; that field was only being used to trigger an idempotent migration to start using the newer applied state key representation, not as a way to express that the replica should expect the legacy representation. Since cockroachdb#58088 has already migrated every range in the system, we no longer need to trigger anything, and this field can be removed.
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state and the old range applied state keys (RaftAppliedIndexLegacyKey, LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're now guaranteed to never see the legacy truncated+applied state representations, making this code (finally) safe to delete. Release justification: cluster version cleanup Release note: None --- This turned out to be pretty hairy and invasive. I'll wait for a few CI runs and inspect the diffs here a bit more closely. Given the surface area here, I'm curious if we feel this is worth getting in during stability. Some questions to consider: - The various legacy keys that were removed were removed in their entirety. Our CLI to scan over store keys will no longer recognize these keys. Is that ok? - I don't think we need any checks to make sure we're not re-using some of these keys in the future -- but I'd like other opinions. - Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for the unreplicated truncated key as well, so just renamed it more appropriately. - v21.1 nodes have to interop with v20.2 nodes where the legacy representation was possible, and it identified such cases through booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState for example). This means that we still need to explicitly set them in v21.2 code. Crucially we'll stop reading these fields in v21.2 code, letting us remove these fields entirely in v22.1. - The treatment above does not extend to ReplicaState.UsingAppliedStateKey; that field was only being used to trigger an idempotent migration to start using the newer applied state key representation, not as a way to express that the replica should expect the legacy representation. Since cockroachdb#58088 has already migrated every range in the system, we no longer need to trigger anything, and this field can be removed.
Fixes cockroachdb#66544. Fixes cockroachdb#66834. We fully migrated away from the replicated truncated state and the old range applied state keys (RaftAppliedIndexLegacyKey, LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're now guaranteed to never see the legacy truncated+applied state representations, making this code (finally) safe to delete. Release justification: cluster version cleanup Release note: None --- This turned out to be pretty hairy and invasive. I'll wait for a few CI runs and inspect the diffs here a bit more closely. Given the surface area here, I'm curious if we feel this is worth getting in during stability. Some questions to consider: - The various legacy keys that were removed were removed in their entirety. Our CLI to scan over store keys will no longer recognize these keys. Is that ok? - I don't think we need any checks to make sure we're not re-using some of these keys in the future -- but I'd like other opinions. - Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for the unreplicated truncated key as well, so just renamed it more appropriately. - v21.1 nodes have to interop with v20.2 nodes where the legacy representation was possible, and it identified such cases through booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState for example). This means that we still need to explicitly set them in v21.2 code. Crucially we'll stop reading these fields in v21.2 code, letting us remove these fields entirely in v22.1. - The treatment above does not extend to ReplicaState.UsingAppliedStateKey; that field was only being used to trigger an idempotent migration to start using the newer applied state key representation, not as a way to express that the replica should expect the legacy representation. Since cockroachdb#58088 has already migrated every range in the system, we no longer need to trigger anything, and this field can be removed.
Fixes cockroachdb#66544. Fixes cockroachdb#66834. We fully migrated away from the replicated truncated state and the old range applied state keys (RaftAppliedIndexLegacyKey, LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're now guaranteed to never see the legacy truncated+applied state representations, making this code (finally) safe to delete. Release justification: cluster version cleanup Release note: None --- This turned out to be pretty hairy and invasive. I'll wait for a few CI runs and inspect the diffs here a bit more closely. Given the surface area here, I'm curious if we feel this is worth getting in during stability. Some questions to consider: - The various legacy keys that were removed were removed in their entirety. Our CLI to scan over store keys will no longer recognize these keys. Is that ok? - I don't think we need any checks to make sure we're not re-using some of these keys in the future -- but I'd like other opinions. - Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for the unreplicated truncated key as well, so just renamed it more appropriately. - v21.1 nodes have to interop with v20.2 nodes where the legacy representation was possible, and it identified such cases through booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState for example). This means that we still need to explicitly set them in v21.2 code. Crucially we'll stop reading these fields in v21.2 code, letting us remove these fields entirely in v22.1. - The treatment above does not extend to ReplicaState.UsingAppliedStateKey; that field was only being used to trigger an idempotent migration to start using the newer applied state key representation, not as a way to express that the replica should expect the legacy representation. Since cockroachdb#58088 has already migrated every range in the system, we no longer need to trigger anything, and this field can be removed.
67737: opt: zigzag scan hints r=cucaroach a=cucaroach Fixes #35814 Enable table@{FORCE_ZIGZAG} and table@{FORCE_ZIGZAG=index} hints to lower the cost of zigzag plans to that they will be preferred over other plans. Note that this is a breaking change, if customer has an index called FORCE_ZIGZAG and is hinting with table@{FORCE_ZIGZAG} that will no longer do what it did before. Not sure what the ramifications of that are. Release note (sql change): Extend index scan hint to allow zigzag joins to be preferred. 69887: kv,migration: rm code handling legacy raft truncated state r=irfansharif a=irfansharif Fixes #66544. Fixes #66834. We fully migrated away from the replicated truncated state and the old range applied state keys (RaftAppliedIndexLegacyKey, LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in #58088. We're now guaranteed to never see the legacy truncated+applied state representations, making this code (finally) safe to delete. Release justification: cluster version cleanup Release note: None --- This turned out to be pretty hairy and invasive. I'll wait for a few CI runs and inspect the diffs here a bit more closely. Given the surface area here, I'm curious if we feel this is worth getting in during stability. Some questions to consider: - The various legacy keys that were removed were removed in their entirety. Our CLI to scan over store keys will no longer recognize these keys. Is that ok? - I don't think we need any checks to make sure we're not re-using some of these keys in the future -- but I'd like other opinions. - Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for the unreplicated truncated key as well, so just renamed it more appropriately. - v21.1 nodes have to interop with v20.2 nodes where the legacy representation was possible, and it identified such cases through booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState for example). This means that we still need to explicitly set them in v21.2 code. Crucially we'll stop reading these fields in v21.2 code, letting us remove these fields entirely in v22.1. - The treatment above does not extend to ReplicaState.UsingAppliedStateKey; that field was only being used to trigger an idempotent migration to start using the newer applied state key representation, not as a way to express that the replica should expect the legacy representation. Since [#58088](#58088) has already migrated every range in the system, we no longer need to trigger anything, and this field can be removed. 70059: roachtest: make --workload optional r=mgartner a=mgartner This commit makes the `--workload` flag optional because some roachtests do not require a workload binary. Release note: None 70120: opt: match ScalarGroupBy in EliminateIndexJoinOrProjectInsideGroupBy r=mgartner a=mgartner This commit extends EliminateIndexJoinOrProjectInsideGroupBy so that ScalarGroupBy expressions are also matched. This allows the rule to eliminate unnecessary index joins in more cases. The primary motivation for this change was to make partial index validation queries more efficient. These queries always have ScalarGroupBy expressions because they are in the form: SELECT count(1) FROM table@partial_index WHERE predicate Prior to this change, an index join was planned for these queries which would operate on every row in the partial index. This could be extremely expensive for large partial indexes. Fixes #70116 Release note (performance improvement): A limitation has been fixed that made creating partial indexes inefficient. 70162: ui: fix default behaviours of columns on stmt page on cc console r=maryliag a=maryliag When a CC Console user open the Statement page for the first time (no cache was created for column selector), this commits make sure that the default columns will be displayed. Fixes: #70160 Release justification: Category 4 Release note (bug fix): Default columns being displayed on Statements page on CC console when the user never made any selection Co-authored-by: Tommy Reilly <treilly@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Fixes cockroachdb#72029. using_applied_state_key was previously used to check whether the Range had upgraded to begin using the RangeAppliedState key. When set to true, replicas receiving the state (through a ReplicatedEvalResult) knew to begin using the new key. In cockroachdb#58088 (in 21.1) we introduced a migration to iterate through all ranges in the system and have them start using the RangeAppliedState key. In 21.2, this field was always set to true -- 21.2 nodes were already using the RangeAppliedState key, or receiving messages from 21.1 nodes that were also using the RangeAppliedState key. In 21.2 (and in 21.1 for that matter) we didn't need to trigger the "if set to true in an incoming message, start using the RangeAppliedState key" code path. When looking to get rid of this field in 22.1 (cockroachdb#70464), we observed that there was an unintentional read of this field in 21.2 nodes (see cockroachdb#72029 and \cockroachdb#72222); the saga is as follows: - Removing this field in 22.1 meant it was set as false when received at 21.2 nodes. - This should've been fine! We weren't using this field to trigger any upgrade code paths (like it was originally intended for). - It turns out that in 21.2 we were using the ReplicaState from the incoming snapshot to update our in-memory replica state - Because the proto field was being phased out, there was now a divergence between the on-disk state (field set to true, from earlier 21.2 operations) and the in-memory state (field set to false, because sent from a version that attempted to get rid of this field). Removing proto fields from the replica state are not possible until we stop using the protobuf copy of the replica state when applying a snapshot (cockroachdb#72222). Once that's done, we should be able to stop sending the replica state as part of the snapshot in the subsequent release. Release note: None
72132: coldata: fix BenchmarkAppend so that it could be run with multiple count r=yuzefovich a=yuzefovich Previously, we could get an index out of bounds because we'd allocate larger `Bytes.data` slice than `int32` offset can support. Release note: None 72239: kvserver: resurrect using_applied_state_key in ReplicaState r=irfansharif a=irfansharif Fixes #72029. `using_applied_state_key` was previously used to check whether the Range had upgraded to begin using the `RangeAppliedState` key. When set to true, replicas receiving the state (through a `ReplicatedEvalResult`) knew to begin using the new key. In #58088 (in 21.1) we introduced a migration to iterate through all ranges in the system and have them start using the `RangeAppliedState` key. In 21.2, this field was always set to true -- 21.2 nodes were already using the `RangeAppliedState` key, or receiving messages from 21.1 nodes that were also using the `RangeAppliedState` key. In 21.2 (and in 21.1 for that matter) we didn't need to trigger the "if set to true in an incoming message, start using the `RangeAppliedState` key" code path. When looking to get rid of this field in 22.1 (#70464), we observed that there was an unintentional read of this field in 21.2 nodes (see #72029 and \#72222); the saga is as follows: - Removing this field in 22.1 meant it was set as false when received at 21.2 nodes. - This should've been fine! We weren't using this field to trigger any upgrade code paths (like it was originally intended for). - It turns out that in 21.2 we were using the `ReplicaState` from the incoming snapshot to update our in-memory replica state - Because the proto field was being phased out, there was now a divergence between the on-disk state (field set to true, from earlier 21.2 operations) and the in-memory state (field set to false, because sent from a version that attempted to get rid of this field). Removing proto fields from the replica state are not possible until we stop using the protobuf copy of the replica state when applying a snapshot (#72222). Once that's done, we should be able to stop sending the replica state as part of the snapshot in the subsequent release. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
We used to have to send the raft log along with snapshots as a result of (in the olden days) requiring all replicas to agree on the truncated state. This hasn't been (generally) true as of v19.1 (cockroachdb#35701), though it was still a possibility until v22.1 (cockroachdb#58088). This commit removes the corresponding code. Release note: None
We used to have to send the raft log along with snapshots as a result of (in the olden days) requiring all replicas to agree on the truncated state. This hasn't been (generally) true as of v19.1 (cockroachdb#35701), though it was still a possibility until v22.1 (cockroachdb#58088). This commit removes the corresponding code. Release note: None
72310: kvserver: remove unused snapshot log entry code r=erikgrinaker a=tbg We used to have to send the raft log along with snapshots as a result of (in the olden days) requiring all replicas to agree on the truncated state. This hasn't been (generally) true as of v19.1 (#35701), though it was still a possibility until v22.1 (#58088). This commit removes the corresponding code. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
These include:
The KV ranged
Migrate
command. This command forces all rangesoverlapping with the request spans to execute the (below-raft)
migrations corresponding to the specific, stated version. This has the
effect of moving the range out of any legacy modes operation they may
currently be in. KV waits for this command to durably apply on all the
replicas before returning, guaranteeing to the caller that all
pre-migration state has been completely purged from the system.
The
FlushAllEngines
RPC. This is be used to instruct the target nodeto persist all in-memory state to disk. Like we mentioned above, KV
currently waits for the
Migrate
command to have applied on allreplicas before returning. With the applied state, there's no
necessity to durably persist it (the representative version is already
stored in the raft log). Out of an abundance of caution, and to really
really ensure that no pre-migrated state is ever seen in the system,
we provide the migration manager a mechanism to flush out all
in-memory state to disk. This will let us guarantee that by the time a
specific cluster version is bumped, all pre-migrated state from prior
to a certain version will have been fully purged from the system.
The
PurgeOutdatedReplicas
RPC. This too comes up in the context ofwanting the ensure that ranges where we've executed a ranged
Migrate
command over have no way of ever surfacing pre-migrated state. This can
happen with older replicas in the replica GC queue and with applied
state that is not yet persisted. Currently we wait for the
Migrate
to have applied on all replicas of a range before returning to the
caller. This does not include earlier incarnations of the range,
possibly sitting idle in the replica GC queue. These replicas can
still request leases, and go through the request evaluation paths,
possibly tripping up assertions that check to see no pre-migrated
state is found. The
PurgeOutdatedReplicas
lets the migration managerdo exactly as the name suggests, ensuring all "outdated" replicas are
processed before declaring the specific cluster version bump complete.
The concept of a "replica state version". This is what's used to
construct the migration manager's view of what's "outdated", telling
us which migrations can be assumed to have run against a particular
replica. When we introduce backwards incompatible changes to the
replica state (for example using the unreplicated truncated state
instead of the replicated variant), the version would inform us if,
for a given replica, we should expect a state representation prior to,
or after the migration (in our example this corresponds to whether or
not we can assume an unreplicated truncated state).
A gating mechanism for "outdated" snapshots. Below-raft migrations
want to rely on the invariant that there are no extant replicas in the
system that haven't seen the specific Migrate command. This is made
difficult by the fact that it's possible for in-flight snapshots to
instantiate replicas with pre-migrated state. To guard against this,
we gate any incoming snapshots that were generated from a replica that
(at the time) had a version less than our store's current active
version.
As part of this change, we re-order the steps taken by the migration
manager so that it executes a given migration first before bumping
version gates cluster wide. This is in consideration of the fact that
the migration process can be arbitrarily long running, and it'd be
undesirable to pause rebalancing/repair mechanisms from functioning
during this time (which we're doing by disallowing snapshots from
older replica versions).
This PR motivates all of the above by also onboarding the
TruncatedAndRangeAppliedState migration, lets us do the following:
i. Use the RangeAppliedState on all ranges
ii. Use the unreplicated TruncatedState on all ranges
In 21.2 we'll finally be able to delete holdover code that knows how to
handle the legacy replicated truncated state.
Release note(general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = -, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.
The ideas here follow from our original prototype in #57445. This PR
needs to be broken down into a few more piecemeal ones:
Migrate
(stand-alone commit)FlushAllEngines/PurgeOutdatedReplicas
(stand-alone commit)Manager.Migrate
changes (stand-alone commit)