Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: efficient skipping of points deleted by RANGEDEL in the same file #2424

Closed
sumeerbhola opened this issue Mar 31, 2023 · 7 comments
Closed

Comments

@sumeerbhola
Copy link
Collaborator

When a RANGEDEL deletes points in a lower-level, but the physical deletion has not yet happened, mergingIter seeks to the end-key of the RANGEDEL, which results in efficient skipping of the deleted points during iteration.

This efficient skipping does not work if the RANGEDEL is in the same file as the points. We have seen customer issues where the RANGEDEL has fallen down to L6 and deletes a large number of points, but because of an open Pebble snapshot, the compaction is unable to physically delete those points. This happens in CockroachDB despite the rare use of Pebble snapshots (for GC and sending range snapshots). This results in sequential iteration over all the deleted points and checking whether each point is covered by the RANGEDEL, which is very inefficient, hence customer escalations related to slow queries.

A solution is to use block property filters to efficiently skip blocks whose key range and seqnum range are fully covered by the RANGEDEL. The key range bound for a block is already known due to the index block (not tight, but should be good enough). The seqnum interval can be collected using a new seqnum block property collector. Intervals are delta-encoded and varint-encoded so should be cheap to store in the index block entry. Seqnum zeroing can sometimes make this cheaper, if the whole block has zero seqnums. If part of the block has seqnum zeroing, we potentially change an interval [S1, S2) to [0, S2), which could have similar cost (first will be encoded as varint(S1), varint(S2-S1) and the second as varint(S2)).

Alternatives: We have discussed eliminating all uses of Pebble snapshots in CockroachDB, which would eliminate this problem in the CockroachDB context. With the upcoming disaggregated storage

  • Pebble snapshots will only be valid if we have not excised a keyspan from the LSM, which we will be doing when dropping a CockroachDB range. So the current uses of Pebble snapshots will need to ensure that the CockroachDB range is still present. This presents an opportunity to rethink whether Pebble snapshots are needed.
  • Sending range snapshots will be cheaper since most of the data is sitting in shared sstables, and does not need to be iterated over, hence using a Pebble snapshot is unnecessary. We have also discussed that even without disaggregated storage we should simply copy sstables (or whole ssblocks) to avoid the CPU cost of iterating over each key-value pair -- this too implies no need for Pebble snapshots when sending range snapshots.

The counter argument to this alternative is that we are planning to make ranges much larger, and if we need a consistent storage view of the state machine, a Pebble snapshot is very useful (and ranges are not moved frequently so there is low probability of excising happening).

Next steps:

  • Understand whether mvccGCQueue needs a consistent state using a snapshot, or can afford to use multiple iterators when scanning a range.
  • The complexity here is in the filtering, since it is a dynamic filter that depends on the current RANGEDEL. The dynamism is akin to range key masking, where the code is complex. The RANGEDEL is known to the mergingIter so maybe we can localize the changes to mergingIter, levelIter and the sstable iters. We should try to prototype this to see if the complexity is warranted.
@andrewbaptist
Copy link

For mvccGCQueue it needs a snapshot that is the size of the MVCCStats that it has. One consideration is to split these into "sub-range" stats, each covering a smaller range size (say 16MB or so). Then we could hold a snapshot for shorter to scan and update the stats for that range. This would be done as part of the work for larger ranges. We are still in the exploratory phase as part of #98488.

For snapshots without disaggregated stores, we will still need the ability to hold pebble snapshots for longer for range snapshots. With delegated snapshots, it is less likely that the leaseholder of a range will hold a long snapshot and we could tune this to almost never be the leaseholder if it made a difference.

So I don't think it is possible to remove all usages, but we could limit it so that MVCCStats held it for much shorter, and range snapshots were normally only from followers.

How long in practice do we see snapshots open for? I would be surprised if it is ever over 1 minute today. Is that long enough to meaningfully hold up compactions or are we seeing them held open longer?

@sumeerbhola
Copy link
Collaborator Author

For mvccGCQueue it needs a snapshot that is the size of the MVCCStats that it has.

Say the mvccGCQueue has decided to process a range and called gc.Run. This method already paginates by constructing a sequence of GCRequests. Can't we close the iterator, remember the resumption key, and open a new iterator for constructing the next GCRequest? And if enough time has elapsed, even close within the construction of a single GCRequest (as long as we don't do it in the middle of a roachpb.Key)? Any consistency of stats will be handled by the delta stats computation when the GCRequest is executed, so it should not concern the GC queue.

How long in practice do we see snapshots open for? I would be surprised if it is ever over 1 minute today.

I don't know. But yes, it should be much less than 1 min for a 512MB range (assuming good cache hit rate).

Is that long enough to meaningfully hold up compactions or are we seeing them held open longer?

Compactions are not held up for snapshots or open iterators (which is the right design choice). But compactions need to consider open snapshots and not delete the data needed by a snapshot, which has caused problems as noted in this issue.

@jbowens
Copy link
Collaborator

jbowens commented Apr 3, 2023

Related to #1070.

@sumeerbhola
Copy link
Collaborator Author

@jbowens
Copy link
Collaborator

jbowens commented Apr 14, 2023

Good catch @nicktrav! That use of a snapshot looks wholly unnecessary. (*ts.DB).MaintainTimeSeries passes it into (*ts.DB).findTimeSeries, which uses it once to create an iterator to find the names of the time series.

@sumeerbhola
Copy link
Collaborator Author

Replica.computeChecksumPostApply also uses a Pebble snapshot. And it can be costly for writes as discussed in https://cockroachlabs.slack.com/archives/C057ULDSKC0/p1688150445571159?thread_ts=1688137716.597589&cid=C057ULDSKC0
This use of snapshots is hard to replace with an iterator since (a) the snapshot is taken at the same raft index in replicated state machine application across all replicas, (b) "consistency checker has been paced at 8MB/s for the past few releases. So it takes up to 512/8=64s to run", so it is long lived and will get longer as we increase the size of a range.

@sumeerbhola
Copy link
Collaborator Author

The work here has been subsumed by a combination of the obsolete bit in sstables #2465 which provides for efficient skipping, and by EventuallyFileOnlySnapshot (which we are slowly moving all CRDB snapshot use cases to) #2740 which obviates the need for compactions to retain obsolete points needed by snapshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants