db: reduce Ingest I/O with manifest+db locks #2112

jbowens · 2022-11-09T19:03:30Z

To identify the target level of M ingested files in a LSM with N levels, we may need to seek in O(N × M) sstables to look for data overlap. This is all performed during application while holding DB.mu and the manifest lock, which prevents concurrent flushes and compactions from scheduling or completing while the mutex is held.

Ingest performs these data overlap checks while holding the manifest lock for a consistent view of the LSM. However, for the purposes of detecting data overlap, ingest could be optimistic about its view of the LSM.

When Ingest acquires the manifest lock during the apply step, Ingest can exclude any sstables or memtables with largest sequence numbers already observed during data overlap calculations. Ingest only needs to search in any new sstables for more recent overlap introduced since its optimistic scan.

Related to #25 and this TODO to reduce overlap checks by using L0 sublevels.

Also related to #1683 which will remove the strict requirement of preventing file boundary overlap, making target level a function of data overlap exclusively.

Jira issue: PEBBLE-173

The text was updated successfully, but these errors were encountered:

jbowens · 2023-03-31T21:40:51Z

This may be used to improve the estimate of ApproxIngestedIntoL0Bytes for flushable ingests. The precomputed data overlap can be used to determine which sstables overlapped L0 before entering the commit pipeline.

The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as cockroachdb#2112. Close cockroachdb#2421. Informs cockroachdb#2112.

The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as #2112. Close #2421. Informs #2112.

The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as cockroachdb#2112. Close cockroachdb#2421. Informs cockroachdb#2112.

The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as #2112. Close #2421. Informs #2112.

jbowens · 2023-05-12T19:53:21Z

I suspect in these graphs we see ingestions holding up the commit pipeline, causing a divergence between the WAL fsync latency and the raft log commit latency.

When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112.

When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112. When the database is closed and re-opened, the new field LargestSeqNumAbsolute is initialized to LargestSeqNum for all existing sstables. This means that LargestSeqNumAbsolute only provides an upper bound of a sstables' keys' sequence numbers over the lifetime of the database instance in the current process. This is sufficient for many use cases, including delete-only compaction accounting. The reason this is sufficient in the delete-only compaction use case is subtle. The problem we're seeking to avoid is a range tombstone [start,end)#n deleting a key k#m such that s ≤ k < e and m ≥ n. Because of the sequence number invariant, the range tombstone can never fall beneath the key k that it does not delete within the LSM. However, our in-memory delete-only compaction hints are not atomically updated with transformations of the LSM. They represent the state of the LSM at a single instant when the table stats collector observed range deletion(s) within a particular file. This stale view of the LSM is what necessitates a mechanism like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion hint. After a database restart, none of the previous instance's in-memory delete-only compactions hints exist. The table stats collector must re-populate the hints by scanning range deletions in sstables in the background. However, because the sequence number invariant prevents inversion of sequence numbers across process restarts, any hints we construct from the LSM will be correct with respect to that view of the LSM.

When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112. When the database is closed and re-opened, the new field LargestSeqNumAbsolute is initialized to LargestSeqNum for all existing sstables. This means that LargestSeqNumAbsolute only provides an upper bound of a sstables' keys' sequence numbers over the lifetime of the database instance in the current process. This is sufficient for many use cases, including delete-only compaction accounting. The reason this is sufficient in the delete-only compaction use case is subtle. The problem we're seeking to avoid is a range tombstone [start,end)#n deleting a key k#m such that s ≤ k < e and m ≥ n. Because of the LSM sequence number invariant, the range tombstone can never fall into a lower level than k. However, our in-memory delete-only compaction hints are not atomically updated with transformations of the LSM. They represent the state of the LSM at a single instant when the table stats collector observed range deletion(s) within a particular file. This stale view of the LSM is what necessitates a mechanism like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion hint. After a database restart, none of the previous instance's in-memory delete-only compactions hints exist. The table stats collector must re-populate the hints by scanning range deletions in sstables in the background. However, because the sequence number invariant prevents inversion of sequence numbers across process restarts, any hints we construct from the LSM will be correct with respect to that view of the LSM.

When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112.

Refactor the overlap checking to avoid using the levelIter to populate the range deletion iterator. This was subtle and confusing. This refactor will also benefit the implementation of cockroachdb#2112. Informs cockroachdb#2112. Informs cockroachdb#2863.

Refactor the overlap checking to avoid using the levelIter to populate the range deletion iterator. This was subtle and confusing. This refactor will also benefit the implementation of #2112. Informs #2112. Informs #2863.

When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in #2112. When the database is closed and re-opened, the new field LargestSeqNumAbsolute is initialized to LargestSeqNum for all existing sstables. This means that LargestSeqNumAbsolute only provides an upper bound of a sstables' keys' sequence numbers over the lifetime of the database instance in the current process. This is sufficient for many use cases, including delete-only compaction accounting. The reason this is sufficient in the delete-only compaction use case is subtle. The problem we're seeking to avoid is a range tombstone [start,end)#n deleting a key k#m such that s ≤ k < e and m ≥ n. Because of the LSM sequence number invariant, the range tombstone can never fall into a lower level than k. However, our in-memory delete-only compaction hints are not atomically updated with transformations of the LSM. They represent the state of the LSM at a single instant when the table stats collector observed range deletion(s) within a particular file. This stale view of the LSM is what necessitates a mechanism like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion hint. After a database restart, none of the previous instance's in-memory delete-only compactions hints exist. The table stats collector must re-populate the hints by scanning range deletions in sstables in the background. However, because the sequence number invariant prevents inversion of sequence numbers across process restarts, any hints we construct from the LSM will be correct with respect to that view of the LSM.

RaduBerinde · 2024-05-15T01:30:13Z

This is my current thinking:

refactor the code to stop using levelIter. We can just find the overlapping files and check for overlap with those.
store an "overlap cache" in each FileMetadata. The overlap cache consists of a small number of non-overlapping UserKeyBounds; each one is either a known empty region of the file, or a known "full" region of the file (either just a point, or a range corresponding to a range del or range key).
when we check with overlap with a FileMetadata, we first take a look at the overlap cache and see if it can give us an answer (i.e. if we are completely inside of a "known empty" region, or if we intersect with any "known full" region). If not, we check for overlap and add one or two regions in the cache (one for any empty region we find, and one for any full region we find).

When restoring a table, we can expect many external ingestions in the same area so the cache should help across ingestions too. In any case, to minimize I/O while holding mutexes, we will do an optimistic "dry run" without the mutex; then, if the version has changed, we run it again under the mutex, and it will benefit from all the cached regions from the dry run.

jbowens · 2024-05-15T16:25:41Z

We can just find the overlapping files and check for overlap with those.

If a file's bounds are wholly contained within the ingested sstable bounds, I think we should also assume the file is non-empty and has data overlap without checking. During imports, ingested sstables can be wide in keyspace and we wouldn't want to actually look for data overlap in every overlapping file.

jbowens added the C-performance label Nov 9, 2022

blathers-crl bot added A-storage T-storage labels Nov 9, 2022

jbowens mentioned this issue Mar 31, 2023

db: populate return statistics for flushable ingests #2425

Merged

jbowens mentioned this issue Apr 3, 2023

crl-release-23.1: db: populate return statistics for flushable ingests #2429

Merged

itsbilal mentioned this issue Jun 5, 2023

*: add IngestAndExcise operation #2538

Merged

jbowens mentioned this issue Jun 16, 2023

db: investigate elevated batch commit tail latencies #2646

Closed

jbowens mentioned this issue Aug 9, 2023

*: implement EventuallyFileOnlySnapshots #2775

Merged

itsbilal mentioned this issue Aug 14, 2023

snapshot: avoid memtable iteration when holding db mutex during EFOS creation #2824

Open

jbowens self-assigned this May 1, 2024

jbowens mentioned this issue May 2, 2024

db: track files' pre-zeroing largest sequence numbers #3576

Merged

jbowens mentioned this issue May 2, 2024

db: refactor ingest overlap checking #3580

Merged

RaduBerinde self-assigned this May 14, 2024

RaduBerinde mentioned this issue May 16, 2024

ingest: perform LSM overlap checks outside of DB.mu #3626

Merged

RaduBerinde unassigned jbowens Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: reduce Ingest I/O with manifest+db locks #2112

db: reduce Ingest I/O with manifest+db locks #2112

jbowens commented Nov 9, 2022 •

edited by cockroach-jira-scripts

Loading

jbowens commented Mar 31, 2023

jbowens commented May 12, 2023

RaduBerinde commented May 15, 2024

jbowens commented May 15, 2024

db: reduce Ingest I/O with manifest+db locks #2112

db: reduce Ingest I/O with manifest+db locks #2112

Comments

jbowens commented Nov 9, 2022 • edited by cockroach-jira-scripts Loading

jbowens commented Mar 31, 2023

jbowens commented May 12, 2023

RaduBerinde commented May 15, 2024

jbowens commented May 15, 2024

jbowens commented Nov 9, 2022 •

edited by cockroach-jira-scripts

Loading