-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: reduce Ingest I/O with manifest+db locks #2112
Comments
This may be used to improve the estimate of |
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as cockroachdb#2112. Close cockroachdb#2421. Informs cockroachdb#2112.
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as cockroachdb#2112. Close cockroachdb#2421. Informs cockroachdb#2112.
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as #2112. Close #2421. Informs #2112.
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as cockroachdb#2112. Close cockroachdb#2421. Informs cockroachdb#2112.
The IngestWithStats ingestion entrypoint returnrs statistics about the ingest. Cockroach uses these statistics to inform admission control. During a flushable ingest, at commit time the number of bytes that will be ingested into L0 is unknown. This commit adds an approximation based on which tables overlapped the flushable queue. This required adjusting the memtable overlap logic to avoid short circuiting as soon as overlap is found. This approximation is rough and may be improved with future work, such as #2112. Close #2421. Informs #2112.
When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112.
When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112.
When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112. When the database is closed and re-opened, the new field LargestSeqNumAbsolute is initialized to LargestSeqNum for all existing sstables. This means that LargestSeqNumAbsolute only provides an upper bound of a sstables' keys' sequence numbers over the lifetime of the database instance in the current process. This is sufficient for many use cases, including delete-only compaction accounting. The reason this is sufficient in the delete-only compaction use case is subtle. The problem we're seeking to avoid is a range tombstone [start,end)#n deleting a key k#m such that s ≤ k < e and m ≥ n. Because of the sequence number invariant, the range tombstone can never fall beneath the key k that it does not delete within the LSM. However, our in-memory delete-only compaction hints are not atomically updated with transformations of the LSM. They represent the state of the LSM at a single instant when the table stats collector observed range deletion(s) within a particular file. This stale view of the LSM is what necessitates a mechanism like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion hint. After a database restart, none of the previous instance's in-memory delete-only compactions hints exist. The table stats collector must re-populate the hints by scanning range deletions in sstables in the background. However, because the sequence number invariant prevents inversion of sequence numbers across process restarts, any hints we construct from the LSM will be correct with respect to that view of the LSM.
When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112. When the database is closed and re-opened, the new field LargestSeqNumAbsolute is initialized to LargestSeqNum for all existing sstables. This means that LargestSeqNumAbsolute only provides an upper bound of a sstables' keys' sequence numbers over the lifetime of the database instance in the current process. This is sufficient for many use cases, including delete-only compaction accounting. The reason this is sufficient in the delete-only compaction use case is subtle. The problem we're seeking to avoid is a range tombstone [start,end)#n deleting a key k#m such that s ≤ k < e and m ≥ n. Because of the LSM sequence number invariant, the range tombstone can never fall into a lower level than k. However, our in-memory delete-only compaction hints are not atomically updated with transformations of the LSM. They represent the state of the LSM at a single instant when the table stats collector observed range deletion(s) within a particular file. This stale view of the LSM is what necessitates a mechanism like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion hint. After a database restart, none of the previous instance's in-memory delete-only compactions hints exist. The table stats collector must re-populate the hints by scanning range deletions in sstables in the background. However, because the sequence number invariant prevents inversion of sequence numbers across process restarts, any hints we construct from the LSM will be correct with respect to that view of the LSM.
When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in cockroachdb#2112.
Refactor the overlap checking to avoid using the levelIter to populate the range deletion iterator. This was subtle and confusing. This refactor will also benefit the implementation of cockroachdb#2112. Informs cockroachdb#2112. Informs cockroachdb#2863.
Refactor the overlap checking to avoid using the levelIter to populate the range deletion iterator. This was subtle and confusing. This refactor will also benefit the implementation of cockroachdb#2112. Informs cockroachdb#2112. Informs cockroachdb#2863.
Refactor the overlap checking to avoid using the levelIter to populate the range deletion iterator. This was subtle and confusing. This refactor will also benefit the implementation of cockroachdb#2112. Informs cockroachdb#2112. Informs cockroachdb#2863.
When there exist no keys beneath a compaction's key range, Pebble performs sequence number zeroing. This is an optimization that allows for easier detection of new user keys during iteration. This commit introduces a new FileMetadata field LargestSeqNumAbsolute that provides an upper bound on the sequence numbers of a sstables' keys before they were zeroed. This is useful as a stable upper bound on the recency of an sstable's keys. In this commit we use this new upper bound to provide an alternative solution to the interaction between delete-only compactions and sequence number zeroing. Previously any compaction that zeroed sequence numbers and overlapped a delete-only compaction hint was required to clear the conflicting hints to ensure a delete-only compaction did not accidentally drop a table containing keys more recent than any of the hint's constituent tombstones. This interaction was a bit indirect and subtle. Encoding the pre-zeroing sequence number on the file metadata is more direct, and will allow us to use the sequence number for recency ordering of sstables' keys more generally, including in #2112. When the database is closed and re-opened, the new field LargestSeqNumAbsolute is initialized to LargestSeqNum for all existing sstables. This means that LargestSeqNumAbsolute only provides an upper bound of a sstables' keys' sequence numbers over the lifetime of the database instance in the current process. This is sufficient for many use cases, including delete-only compaction accounting. The reason this is sufficient in the delete-only compaction use case is subtle. The problem we're seeking to avoid is a range tombstone [start,end)#n deleting a key k#m such that s ≤ k < e and m ≥ n. Because of the LSM sequence number invariant, the range tombstone can never fall into a lower level than k. However, our in-memory delete-only compaction hints are not atomically updated with transformations of the LSM. They represent the state of the LSM at a single instant when the table stats collector observed range deletion(s) within a particular file. This stale view of the LSM is what necessitates a mechanism like LargestSeqNumAbsolute to avoid erroroneous applications of a deletion hint. After a database restart, none of the previous instance's in-memory delete-only compactions hints exist. The table stats collector must re-populate the hints by scanning range deletions in sstables in the background. However, because the sequence number invariant prevents inversion of sequence numbers across process restarts, any hints we construct from the LSM will be correct with respect to that view of the LSM.
This is my current thinking:
When restoring a table, we can expect many external ingestions in the same area so the cache should help across ingestions too. In any case, to minimize I/O while holding mutexes, we will do an optimistic "dry run" without the mutex; then, if the version has changed, we run it again under the mutex, and it will benefit from all the cached regions from the dry run. |
If a file's bounds are wholly contained within the ingested sstable bounds, I think we should also assume the file is non-empty and has data overlap without checking. During imports, ingested sstables can be wide in keyspace and we wouldn't want to actually look for data overlap in every overlapping file. |
To identify the target level of M ingested files in a LSM with N levels, we may need to seek in O(N × M) sstables to look for data overlap. This is all performed during application while holding
DB.mu
and the manifest lock, which prevents concurrent flushes and compactions from scheduling or completing while the mutex is held.Ingest performs these data overlap checks while holding the manifest lock for a consistent view of the LSM. However, for the purposes of detecting data overlap, ingest could be optimistic about its view of the LSM.
When Ingest acquires the manifest lock during the apply step, Ingest can exclude any sstables or memtables with largest sequence numbers already observed during data overlap calculations. Ingest only needs to search in any new sstables for more recent overlap introduced since its optimistic scan.
Related to #25 and this TODO to reduce overlap checks by using L0 sublevels.
Also related to #1683 which will remove the strict requirement of preventing file boundary overlap, making target level a function of data overlap exclusively.
Jira issue: PEBBLE-173
The text was updated successfully, but these errors were encountered: