Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: Update snapshot strategy for disaggregated ingests #103028

Closed
itsbilal opened this issue May 10, 2023 · 3 comments · Fixed by #105839
Closed

kvserver: Update snapshot strategy for disaggregated ingests #103028

itsbilal opened this issue May 10, 2023 · 3 comments · Fixed by #105839
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@itsbilal
Copy link
Member

itsbilal commented May 10, 2023

The kvBatchSnapshotStrategy needs to be updated to recognize where the sending/recipient nodes are both using the same shared.Storage, and if so, use the new skip-shared iteration mode in db.ScanInternal to produce a list of shared sstable metadatas as well as some flat non-shared sstables that sit on top of them.

On the recipient side, these shared sstable metadata structs can be ingested into Pebble using the new IngestAndExcise command (see cockroachdb/pebble#2520 , which this issue has a dependency on).

Jira issue: CRDB-27800

@itsbilal itsbilal added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels May 10, 2023
@itsbilal itsbilal self-assigned this May 30, 2023
@itsbilal
Copy link
Member Author

itsbilal commented Jun 1, 2023

As mentioned in cockroachdb/pebble#2538, we should call the new IngestAndExcise even if there are no shared files, to avoid the need to write a range tombstone.

@sumeerbhola
Copy link
Collaborator

Do we have a list of all the users of Pebble snapshots and how they will be affected by excise removing things under them? Copying some text from cockroachdb/pebble#2424 where we started listing these and also discussing whether the use of Pebble snapshots was necessary.

Pebble snapshots will only be valid if we have not excised a keyspan from the LSM, which we will be doing when dropping a CockroachDB range. So the current uses of Pebble snapshots will need to ensure that the CockroachDB range is still present.

itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 14, 2023
This change updates the ScanInternal function signature
to not use the internal keyspan.Key type. Instead we use
the new InternalRangeKey type alias that exports it.

Also update the pointCollapsingIter to just call SeekGE
in NextPrefix instead of panicking.

Necessary for cockroachdb/cockroach#103028.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 14, 2023
This change updates the ScanInternal function signature
to not use the internal keyspan.Key type. Instead we use
the rangekey.Key type alias that exports it.

Also update the pointCollapsingIter to just call SeekGE
in NextPrefix instead of panicking.

Necessary for cockroachdb/cockroach#103028.
itsbilal added a commit to cockroachdb/pebble that referenced this issue Jun 14, 2023
This change updates the ScanInternal function signature
to not use the internal keyspan.Key type. Instead we use
the rangekey.Key type alias that exports it.

Also update the pointCollapsingIter to just call SeekGE
in NextPrefix instead of panicking.

Necessary for cockroachdb/cockroach#103028.
@itsbilal
Copy link
Member Author

@sumeerbhola Thanks for linking that! Looks like the TS maintenance queue no longer uses a snapshot. The use of a snapshot to send a KV snapshot is likely okay for the purposes of this issue as we are guaranteed that we won't be clearing the replica until that replication is complete.

The only other use of a snapshot I can see is the GC queue. I couldn't find anything else from an audit of the code.

itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 29, 2023
If the sender node was created with a SharedStorage, switch to
fast ingestion where we ScanInternal() the keys not in shared
levels, and just share the metadata for files in shared levels.
The sender of the snapshot specifies in the Header that it
is using this ability, and the receiver rejects the snapshot if
it cannot accept shared snapshots.

If ScanInternal() returns an `ErrInvalidSkipSharedIteration`,
we switch back to old-style snapshots where the entirety
of the range is sent over the stream as SnapshotRequests.

Future changes will add better support for detection of when
different nodes point to different blob storage buckets / shared
storage locations, and incorporate that in rebalancing.

Fixes cockroachdb#103028.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 24, 2023
This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of cockroachdb#103028.

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 1, 2023
This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of cockroachdb#103028

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 1, 2023
This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of cockroachdb#103028.

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 3, 2023
This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of cockroachdb#103028

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 3, 2023
This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of cockroachdb#103028.

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 8, 2023
This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of cockroachdb#103028

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 8, 2023
This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of cockroachdb#103028.

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 9, 2023
This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of cockroachdb#103028

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 9, 2023
This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of cockroachdb#103028.

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Aug 10, 2023
107297: storage,kvserver: Foundational changes for disaggregated ingestions r=sumeerbhola a=itsbilal

This change contains two commits (split off from the original mega-PR, #105839). The first is a pkg/storage change to add new interface methods to call pebble's db.ScanInternal as well as implement related helper methods in sstable writers/batch readers/writers to be able to do disaggregated snapshot ingestion. The second is a kvserver/rditer change to allow finer-grained control on what replicated spans we iterate on, as well as to be able to specifically opt into skip-shared iteration over the user key span through the use of `ScanInternal`.

---

**storage: Update Engine/Reader/Writer interfaces for ScanInternal**

This change updates pkg/storage interfaces and implementations to allow
the use of ScanInternal in skip-shared iteration mode as well as
writing/reading of internal point keys, range dels and range keys.
Replication / snapshot code will soon rely on these changes to
be able to replicate internal keys in higher levels plus metadata
of shared sstables in lower levels, as opposed to just observed
user keys.

Part of #103028

Epic: none

Release note: None

**kvserver: Add ability to filter replicated spans in Select/Iterate**

This change adds the ability to select for just the replicated
span in rditer.Select and rditer.IterateReplicaKeySpans. Also
adds a new rditer.IterateReplicaKeySpansShared that does a
ScanInternal on just the user key span, to be able to collect
metadata of shared sstables as well as any internal keys above
them.

We only use skip-shared iteration for the replicated user key span
of a range, and in practice, only if it's a non-system range.

Part of #103028.

Epic: none

Release note: None

108336: sql: retry more distributed errors as local r=yuzefovich a=yuzefovich

This PR contains a couple of commits that increase the allow-list of errors that are retried locally. In particular, it allows us to hide some issues we have around using DistSQL and shutting down SQL pods.

Fixes: #106537.
Fixes: #108152.
Fixes: #108271.

Release note: None

108406: server,testutils: remove complexity r=yuzefovich,herkolategan a=knz

There is a saying (paraphrasing) that it always takes more work removing unwanted complexity than it takes to add it. This is an example of that.

Prior to this commit, there was an "interesting" propagation of the flag that decides whether or not to define a test tenant for test servers and clusters. In a nutshell, we had:

- an "input" flag in `base.TestServerArgs`, which remained mostly immutable
- a boolean decided once by `ShouldStartDefaultTestTenant()` either in:
  - `serverutils.StartServerOnlyE`
  - or `testcluster.Start`
- that boolean choice was then propagated to `server.testServer` via _another_ boolean config flag in `server.BaseConfig`
- both the 2nd boolean and the original input flag were then again checked when the time came to do the work (in `maybeStartDefaultTestTenant`).

Additional complexity was then incurred by the need of `TestCluster` to make the determination just once (and not once per server).

This commit cuts through all the layers of complexity by simply propagating the choice of `ShouldStartDefaultTestTenant()` back into the `TestServerArgs` and only ever reading from that subsequently.

Release note: None
Epic: CRDB-18499

108465: cloudccl: allow external connection tests to be run in parallel r=rhu713 a=rhu713

Currently external connection tests read and write to the same path in cloud storage. Add a random uint64 as part of the path so that test runs have unique paths and can be run in parallel.

Fixes: #107407

Release note: None

108481: acceptance: stabilize start-single-node in tcl test r=santamaura a=dhartunian

We've continued to see flakes on this test which contain messages of throttled stores on node startup. The hypothesis is that these are due to leftover data directories from prior startups during the same test.

This change clears the `logs/db` data directory for those invocations and also adds the sql memory flag which the common tcl function also uses.

Resolves #108405
Epic: None

Release note: None

108496: kv: unit test `PrepareTransactionForRetry` and `TransactionRefreshTimestamp` r=miraradeva a=nvanbenschoten

Informs #104233.

This commit adds a pair of new unit tests to verify the behavior of `PrepareTransactionForRetry` and `TransactionRefreshTimestamp`. These functions will be getting more complex for #104233, so it will be helpful to have these tests in place. The tests also serve as good documentation.

Release note: None

Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 10, 2023
If the sender node was created with a SharedStorage, switch to
fast ingestion where we ScanInternal() the keys not in shared
levels, and just share the metadata for files in shared levels.
The sender of the snapshot specifies in the Header that it
is using this ability, and the receiver rejects the snapshot if
it cannot accept shared snapshots.

If ScanInternal() returns an `ErrInvalidSkipSharedIteration`,
we switch back to old-style snapshots where the entirety
of the range is sent over the stream as SnapshotRequests.

Future changes will add better support for detection of when
different nodes point to different blob storage buckets / shared
storage locations, and incorporate that in rebalancing.

Fixes cockroachdb#103028.

Release note (general change): Takes advantage of new CLI option,
`--experimental-shared-storage` to rebalance faster from node to node.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 15, 2023
If the sender node was created with a SharedStorage, switch to
fast ingestion where we ScanInternal() the keys not in shared
levels, and just share the metadata for files in shared levels.
The sender of the snapshot specifies in the Header that it
is using this ability, and the receiver rejects the snapshot if
it cannot accept shared snapshots.

If ScanInternal() returns an `ErrInvalidSkipSharedIteration`,
we switch back to old-style snapshots where the entirety
of the range is sent over the stream as SnapshotRequests.

Future changes will add better support for detection of when
different nodes point to different blob storage buckets / shared
storage locations, and incorporate that in rebalancing.

Fixes cockroachdb#103028.

Release note (general change): Takes advantage of new CLI option,
`--experimental-shared-storage` to rebalance faster from node to node.
craig bot pushed a commit that referenced this issue Aug 16, 2023
105839: kvserver,storage: Update snapshot strategy to use shared storage r=sumeerbhola a=itsbilal

If the sender node was created with a SharedStorage, switch to fast ingestion where we ScanInternal() the keys not in shared levels, and just share the metadata for files in shared levels. The sender of the snapshot specifies in the Header that it is using this ability, and the receiver rejects the snapshot if it cannot accept shared snapshots.

If ScanInternal() returns an `ErrInvalidSkipSharedIteration`, we switch back to old-style snapshots where the entirety of the range is sent over the stream as SnapshotRequests.

Future changes will add better support for detection of when different nodes point to different blob storage buckets / shared storage locations, and incorporate that in rebalancing.

Fixes #103028.

Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
@craig craig bot closed this as completed in 02a4b7e Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants