Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: reduce Ingest I/O with manifest+db locks #2112

Open
jbowens opened this issue Nov 9, 2022 · 4 comments
Open

db: reduce Ingest I/O with manifest+db locks #2112

jbowens opened this issue Nov 9, 2022 · 4 comments

Comments

@jbowens
Copy link
Collaborator

jbowens commented Nov 9, 2022

To identify the target level of M ingested files in a LSM with N levels, we may need to seek in O(N × M) sstables to look for data overlap. This is all performed during application while holding DB.mu and the manifest lock, which prevents concurrent flushes and compactions from scheduling or completing while the mutex is held.

Ingest performs these data overlap checks while holding the manifest lock for a consistent view of the LSM. However, for the purposes of detecting data overlap, ingest could be optimistic about its view of the LSM.

When Ingest acquires the manifest lock during the apply step, Ingest can exclude any sstables or memtables with largest sequence numbers already observed during data overlap calculations. Ingest only needs to search in any new sstables for more recent overlap introduced since its optimistic scan.

Related to #25 and this TODO to reduce overlap checks by using L0 sublevels.

Also related to #1683 which will remove the strict requirement of preventing file boundary overlap, making target level a function of data overlap exclusively.

Jira issue: PEBBLE-173

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 31, 2023

This may be used to improve the estimate of ApproxIngestedIntoL0Bytes for flushable ingests. The precomputed data overlap can be used to determine which sstables overlapped L0 before entering the commit pipeline.

jbowens added a commit to jbowens/pebble that referenced this issue Mar 31, 2023
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest.
Cockroach uses these statistics to inform admission control. During a flushable
ingest, at commit time the number of bytes that will be ingested into L0 is
unknown. This commit adds an approximation based on which tables overlapped the
flushable queue. This required adjusting the memtable overlap logic to avoid
short circuiting as soon as overlap is found.

This approximation is rough and may be improved with future work, such as cockroachdb#2112.

Close cockroachdb#2421.
Informs cockroachdb#2112.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 3, 2023
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest.
Cockroach uses these statistics to inform admission control. During a flushable
ingest, at commit time the number of bytes that will be ingested into L0 is
unknown. This commit adds an approximation based on which tables overlapped the
flushable queue. This required adjusting the memtable overlap logic to avoid
short circuiting as soon as overlap is found.

This approximation is rough and may be improved with future work, such as cockroachdb#2112.

Close cockroachdb#2421.
Informs cockroachdb#2112.
jbowens added a commit that referenced this issue Apr 3, 2023
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest.
Cockroach uses these statistics to inform admission control. During a flushable
ingest, at commit time the number of bytes that will be ingested into L0 is
unknown. This commit adds an approximation based on which tables overlapped the
flushable queue. This required adjusting the memtable overlap logic to avoid
short circuiting as soon as overlap is found.

This approximation is rough and may be improved with future work, such as #2112.

Close #2421.
Informs #2112.
jbowens added a commit to jbowens/pebble that referenced this issue Apr 3, 2023
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest.
Cockroach uses these statistics to inform admission control. During a flushable
ingest, at commit time the number of bytes that will be ingested into L0 is
unknown. This commit adds an approximation based on which tables overlapped the
flushable queue. This required adjusting the memtable overlap logic to avoid
short circuiting as soon as overlap is found.

This approximation is rough and may be improved with future work, such as cockroachdb#2112.

Close cockroachdb#2421.
Informs cockroachdb#2112.
jbowens added a commit that referenced this issue Apr 3, 2023
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest.
Cockroach uses these statistics to inform admission control. During a flushable
ingest, at commit time the number of bytes that will be ingested into L0 is
unknown. This commit adds an approximation based on which tables overlapped the
flushable queue. This required adjusting the memtable overlap logic to avoid
short circuiting as soon as overlap is found.

This approximation is rough and may be improved with future work, such as #2112.

Close #2421.
Informs #2112.
@jbowens
Copy link
Collaborator Author

jbowens commented May 12, 2023

Screenshot 2023-05-12 at 3 33 53 PM

I suspect in these graphs we see ingestions holding up the commit pipeline, causing a divergence between the WAL fsync latency and the raft log commit latency.

@jbowens jbowens self-assigned this May 1, 2024
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
When there exist no keys beneath a compaction's key range, Pebble performs
sequence number zeroing. This is an optimization that allows for easier
detection of new user keys during iteration. This commit introduces a new
FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the
sequence numbers of a sstables' keys before they were zeroed. This is useful as
a stable upper bound on the recency of an sstable's keys.

In this commit we use this new upper bound to provide an alternative solution
to the interaction between delete-only compactions and sequence number zeroing.
Previously any compaction that zeroed sequence numbers and overlapped a
delete-only compaction hint was required to clear the conflicting hints to
ensure a delete-only compaction did not accidentally drop a table containing
keys more recent than any of the hint's constituent tombstones. This
interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence
number on the file metadata is more direct, and will allow us to use the
sequence number for recency ordering of sstables' keys more generally,
including in cockroachdb#2112.
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
When there exist no keys beneath a compaction's key range, Pebble performs
sequence number zeroing. This is an optimization that allows for easier
detection of new user keys during iteration. This commit introduces a new
FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the
sequence numbers of a sstables' keys before they were zeroed. This is useful as
a stable upper bound on the recency of an sstable's keys.

In this commit we use this new upper bound to provide an alternative solution
to the interaction between delete-only compactions and sequence number zeroing.
Previously any compaction that zeroed sequence numbers and overlapped a
delete-only compaction hint was required to clear the conflicting hints to
ensure a delete-only compaction did not accidentally drop a table containing
keys more recent than any of the hint's constituent tombstones. This
interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence
number on the file metadata is more direct, and will allow us to use the
sequence number for recency ordering of sstables' keys more generally,
including in cockroachdb#2112.
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
When there exist no keys beneath a compaction's key range, Pebble performs
sequence number zeroing. This is an optimization that allows for easier
detection of new user keys during iteration. This commit introduces a new
FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the
sequence numbers of a sstables' keys before they were zeroed. This is useful as
a stable upper bound on the recency of an sstable's keys.

In this commit we use this new upper bound to provide an alternative solution
to the interaction between delete-only compactions and sequence number zeroing.
Previously any compaction that zeroed sequence numbers and overlapped a
delete-only compaction hint was required to clear the conflicting hints to
ensure a delete-only compaction did not accidentally drop a table containing
keys more recent than any of the hint's constituent tombstones. This
interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence
number on the file metadata is more direct, and will allow us to use the
sequence number for recency ordering of sstables' keys more generally,
including in cockroachdb#2112.

When the database is closed and re-opened, the new field LargestSeqNumAbsolute
is initialized to LargestSeqNum for all existing sstables. This means that
LargestSeqNumAbsolute only provides an upper bound of a sstables' keys'
sequence numbers over the lifetime of the database instance in the current
process. This is sufficient for many use cases, including delete-only
compaction accounting.

The reason this is sufficient in the delete-only compaction use case is subtle.
The problem we're seeking to avoid is a range tombstone [start,end)#n deleting
a key k#m such that s ≤ k < e and m ≥ n. Because of the sequence number
invariant, the range tombstone can never fall beneath the key k that it does
not delete within the LSM. However, our in-memory delete-only compaction hints
are not atomically updated with transformations of the LSM. They represent the
state of the LSM at a single instant when the table stats collector observed
range deletion(s) within a particular file. This stale view of the LSM is what
necessitates a mechanism like LargestSeqNumAbsolute to avoid erroroneous
applications of a deletion hint. After a database restart, none of the previous
instance's in-memory delete-only compactions hints exist. The table stats
collector must re-populate the hints by scanning range deletions in sstables in
the background. However, because the sequence number invariant prevents
inversion of sequence numbers across process restarts, any hints we construct
from the LSM will be correct with respect to that view of the LSM.
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
When there exist no keys beneath a compaction's key range, Pebble performs
sequence number zeroing. This is an optimization that allows for easier
detection of new user keys during iteration. This commit introduces a new
FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the
sequence numbers of a sstables' keys before they were zeroed. This is useful as
a stable upper bound on the recency of an sstable's keys.

In this commit we use this new upper bound to provide an alternative solution
to the interaction between delete-only compactions and sequence number zeroing.
Previously any compaction that zeroed sequence numbers and overlapped a
delete-only compaction hint was required to clear the conflicting hints to
ensure a delete-only compaction did not accidentally drop a table containing
keys more recent than any of the hint's constituent tombstones. This
interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence
number on the file metadata is more direct, and will allow us to use the
sequence number for recency ordering of sstables' keys more generally,
including in cockroachdb#2112.

When the database is closed and re-opened, the new field LargestSeqNumAbsolute
is initialized to LargestSeqNum for all existing sstables. This means that
LargestSeqNumAbsolute only provides an upper bound of a sstables' keys'
sequence numbers over the lifetime of the database instance in the current
process. This is sufficient for many use cases, including delete-only
compaction accounting.

The reason this is sufficient in the delete-only compaction use case is subtle.
The problem we're seeking to avoid is a range tombstone [start,end)#n deleting
a key k#m such that s ≤ k < e and m ≥ n. Because of the LSM sequence number
invariant, the range tombstone can never fall into a lower level than k.
However, our in-memory delete-only compaction hints are not atomically updated
with transformations of the LSM. They represent the state of the LSM at a
single instant when the table stats collector observed range deletion(s) within
a particular file. This stale view of the LSM is what necessitates a mechanism
like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion
hint. After a database restart, none of the previous instance's in-memory
delete-only compactions hints exist. The table stats collector must re-populate
the hints by scanning range deletions in sstables in the background. However,
because the sequence number invariant prevents inversion of sequence numbers
across process restarts, any hints we construct from the LSM will be correct
with respect to that view of the LSM.
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
When there exist no keys beneath a compaction's key range, Pebble performs
sequence number zeroing. This is an optimization that allows for easier
detection of new user keys during iteration. This commit introduces a new
FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the
sequence numbers of a sstables' keys before they were zeroed. This is useful as
a stable upper bound on the recency of an sstable's keys.

In this commit we use this new upper bound to provide an alternative solution
to the interaction between delete-only compactions and sequence number zeroing.
Previously any compaction that zeroed sequence numbers and overlapped a
delete-only compaction hint was required to clear the conflicting hints to
ensure a delete-only compaction did not accidentally drop a table containing
keys more recent than any of the hint's constituent tombstones. This
interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence
number on the file metadata is more direct, and will allow us to use the
sequence number for recency ordering of sstables' keys more generally,
including in cockroachdb#2112.
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of cockroachdb#2112.

Informs cockroachdb#2112.
Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 2, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of cockroachdb#2112.

Informs cockroachdb#2112.
Informs cockroachdb#2863.
jbowens added a commit to jbowens/pebble that referenced this issue May 3, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of cockroachdb#2112.

Informs cockroachdb#2112.
Informs cockroachdb#2863.
jbowens added a commit that referenced this issue May 3, 2024
Refactor the overlap checking to avoid using the levelIter to populate the
range deletion iterator. This was subtle and confusing. This refactor will also
benefit the implementation of #2112.

Informs #2112.
Informs #2863.
jbowens added a commit that referenced this issue May 3, 2024
When there exist no keys beneath a compaction's key range, Pebble performs
sequence number zeroing. This is an optimization that allows for easier
detection of new user keys during iteration. This commit introduces a new
FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the
sequence numbers of a sstables' keys before they were zeroed. This is useful as
a stable upper bound on the recency of an sstable's keys.

In this commit we use this new upper bound to provide an alternative solution
to the interaction between delete-only compactions and sequence number zeroing.
Previously any compaction that zeroed sequence numbers and overlapped a
delete-only compaction hint was required to clear the conflicting hints to
ensure a delete-only compaction did not accidentally drop a table containing
keys more recent than any of the hint's constituent tombstones. This
interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence
number on the file metadata is more direct, and will allow us to use the
sequence number for recency ordering of sstables' keys more generally,
including in #2112.

When the database is closed and re-opened, the new field LargestSeqNumAbsolute
is initialized to LargestSeqNum for all existing sstables. This means that
LargestSeqNumAbsolute only provides an upper bound of a sstables' keys'
sequence numbers over the lifetime of the database instance in the current
process. This is sufficient for many use cases, including delete-only
compaction accounting.

The reason this is sufficient in the delete-only compaction use case is subtle.
The problem we're seeking to avoid is a range tombstone [start,end)#n deleting
a key k#m such that s ≤ k < e and m ≥ n. Because of the LSM sequence number
invariant, the range tombstone can never fall into a lower level than k.
However, our in-memory delete-only compaction hints are not atomically updated
with transformations of the LSM. They represent the state of the LSM at a
single instant when the table stats collector observed range deletion(s) within
a particular file. This stale view of the LSM is what necessitates a mechanism
like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion
hint. After a database restart, none of the previous instance's in-memory
delete-only compactions hints exist. The table stats collector must re-populate
the hints by scanning range deletions in sstables in the background. However,
because the sequence number invariant prevents inversion of sequence numbers
across process restarts, any hints we construct from the LSM will be correct
with respect to that view of the LSM.
@RaduBerinde RaduBerinde self-assigned this May 14, 2024
@RaduBerinde
Copy link
Member

This is my current thinking:

  • refactor the code to stop using levelIter. We can just find the overlapping files and check for overlap with those.
  • store an "overlap cache" in each FileMetadata. The overlap cache consists of a small number of non-overlapping UserKeyBounds; each one is either a known empty region of the file, or a known "full" region of the file (either just a point, or a range corresponding to a range del or range key).
  • when we check with overlap with a FileMetadata, we first take a look at the overlap cache and see if it can give us an answer (i.e. if we are completely inside of a "known empty" region, or if we intersect with any "known full" region). If not, we check for overlap and add one or two regions in the cache (one for any empty region we find, and one for any full region we find).

When restoring a table, we can expect many external ingestions in the same area so the cache should help across ingestions too. In any case, to minimize I/O while holding mutexes, we will do an optimistic "dry run" without the mutex; then, if the version has changed, we run it again under the mutex, and it will benefit from all the cached regions from the dry run.

@jbowens
Copy link
Collaborator Author

jbowens commented May 15, 2024

We can just find the overlapping files and check for overlap with those.

If a file's bounds are wholly contained within the ingested sstable bounds, I think we should also assume the file is non-empty and has data overlap without checking. During imports, ingested sstables can be wide in keyspace and we wouldn't want to actually look for data overlap in every overlapping file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

2 participants