Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] storage/engine: attempt to do best-effort GC in compactions (fa… #42514

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sumeerbhola
Copy link
Collaborator

…iled).

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to being very incomplete, this doesn't seem viable -- see the DANGER comment. Sending this out mainly for discussion in case folks have other ideas.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @ajwerner, @itsbilal, and @petermattis)

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @ajwerner, @itsbilal, @petermattis, and @sumeerbhola)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

//   the former has moved to a lower level without the latter. If we considered sstables to be
//   overlapping for compaction purposes if these real keys overlapped it would enforce the stronger
//   invariant.

This is more difficult than I previously would have guessed. Thanks for the clear explanation of the problem.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @ajwerner, @itsbilal, @petermattis, and @sumeerbhola)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

This is more difficult than I previously would have guessed. Thanks for the clear explanation of the problem.

Would it be feasible in the future when intents are stored separately, and presumably we can read from that separate store cheaply during compaction?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @ajwerner, @itsbilal, and @petermattis)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Would it be feasible in the future when intents are stored separately, and presumably we can read from that separate store cheaply during compaction?

That may indeed be necessary since in the future of #41720 there can be multiple provisional values that are not deleted from Pebble's perspective. At most one of these will be actually provisional, the rest will be aborted or committed, but the state of what is aborted/committed may not even be persisted in the intent/lock metadata (it may only be in-memory in the leaseholder, since it can be recovered using persistent information distributed across potentially multiple nodes).
If reading this lock state requires reading from Pebble, we may need to think carefully about the circularity.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @ajwerner, @itsbilal, and @petermattis)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

Previously, sumeerbhola wrote…

That may indeed be necessary since in the future of #41720 there can be multiple provisional values that are not deleted from Pebble's perspective. At most one of these will be actually provisional, the rest will be aborted or committed, but the state of what is aborted/committed may not even be persisted in the intent/lock metadata (it may only be in-memory in the leaseholder, since it can be recovered using persistent information distributed across potentially multiple nodes).
If reading this lock state requires reading from Pebble, we may need to think carefully about the circularity.

What if we did this GC only for bottommost level compactions and only did the GC optimistically? That is, we'd only do the GC for SET operations at the bottom level where we can prove they are not provisional. So we'd need to either see the intent, or a newer value for the key.

We'd need some sort of fallback mechanism for scenarios where the the values for a key are split across sstables. Perhaps something could be done by looking at the sstable boundaries for L6 and doing some sort of slow-path GC (i.e. not during compaction).

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @ajwerner, and @itsbilal)


pkg/storage/engine/pebble.go, line 187 at r1 (raw file):

//   The SET a@12#5 could be at level 5 while the DEL a@12#19 is at level 2. Now the SET a#30#25
//   could sink down to level 4 (the DEL could still be at level 2 since they are different user
//   keys from Pebble's perspective). Now both SET a@30#25, SET a@0#31 are involved in a compaction

Typo s/SET a@0#31/SET a@12#5/


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What if we did this GC only for bottommost level compactions and only did the GC optimistically? That is, we'd only do the GC for SET operations at the bottom level where we can prove they are not provisional. So we'd need to either see the intent, or a newer value for the key.

We'd need some sort of fallback mechanism for scenarios where the the values for a key are split across sstables. Perhaps something could be done by looking at the sstable boundaries for L6 and doing some sort of slow-path GC (i.e. not during compaction).

I don't think the bottommost level compaction can rely on any extra invariant to be true that helps here. In the example I had constructed both a@30#25 and a@12#5 could be at the bottommost level in different files and be compacted together while the DEL a@12#19 is sitting at at higher level. And even if we use the fact that these two points were in separate input files to avoid GCing a in this compaction, in the output of the compaction they are in the same file, so if they are later involved in a compaction at the lowest level we don't know what is safe.
I think external information about the provisional values of a will be necessary.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, and @tbg)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

Previously, sumeerbhola wrote…

I don't think the bottommost level compaction can rely on any extra invariant to be true that helps here. In the example I had constructed both a@30#25 and a@12#5 could be at the bottommost level in different files and be compacted together while the DEL a@12#19 is sitting at at higher level. And even if we use the fact that these two points were in separate input files to avoid GCing a in this compaction, in the output of the compaction they are in the same file, so if they are later involved in a compaction at the lowest level we don't know what is safe.
I think external information about the provisional values of a will be necessary.

In your example, if a@30#25.SET and a@12#5.SET are compacted to the bottommost level we have two possibilities. The first is that they are in the same sstable and the GC policy can kick in because it knows a@12#5.SET is non-provisional. a@30#25.SET may be provisional or it may be the latest value and thus need to be retained. The other possibility is that a@30#25.SET and a@12#5.SET reside in different sstables in the bottommost level, but we're guaranteed those sstables are adjacent. I think we can look at the boundary keys for the bottommost sstable to see this and have a separate GC path for keys which straddle sstables.

Btw, in case you're not aware: an MVCCDelete operation results in a RocksDB/Pebble SET operation where the value is empty. I'm not sure we'd ever have the case of interleaved SET and DEL as in your example. But even if we did, it seems safe to GC a@12#5.SET.

There is another challenge that this PR doesn't touch on that I think is larger than the actual GC itself: keeping the range MVCCStats up to date. This is a gigantic problem. Search for MVCCStats in storage/engine/mvcc.go to realize what kind of corner we've painted ourselves into with that design.

Cc @tbg as he may have thought about the MVCCStats problem with compaction-based GC.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, and @tbg)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

In your example, if a@30#25.SET and a@12#5.SET are compacted to the bottommost level we have two possibilities. The first is that they are in the same sstable and the GC policy can kick in because it knows a@12#5.SET is non-provisional.

One can't know it is provisional, or in the example above that it is deleted, without finding the a@12#19.DEL which can be in some other higher level sstable.

Btw, in case you're not aware: an MVCCDelete operation results in a RocksDB/Pebble SET operation where the value is empty. I'm not sure we'd ever have the case of interleaved SET and DEL as in your example. But even if we did, it seems safe to GC a@12#5.SET.

In this example it was intents that were rolled back. MVCCDeletes don't cause problems since the delete is another version of the value so if one sees a@30 and a@12 in a compaction and not see a@18 it is ok to remove a@12 since whether or not there is another version in between does not affect correctness. The rollbacks are really what causes problems (the rest of the problems can be eliminated with the segregated lock table).

I think GC in compactions can't be done without making sstables overlapping if they have the same MVCCKey.Key, which means an incompatible change from RocksDB. That will ensure that if there is a sequence of changes of the form a@t1#s1.SET, a@t1#s2.DEL, a@t2#s3.SET, a@t2#s4.DEL, a@t3#s5.SET, a@t3#s6.DEL, ... where t1 < t2 < t3 < ... and s1 < s2 < s3 ... then any compaction will see a contiguous subsequence. From a performance perspective this doesn't seem bad since MVCCMetadata will be segregated and not in this sequence and GC can happen more aggressively so having multiple versions be part of the same stack should not result in very fat sstables. But I am inclined towards this not being a 20.2 item -- we should wait until after Pebble is the only option for users and the segregated lock table is deployed.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, and @tbg)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

I think GC in compactions can't be done without making sstables overlapping if they have the same MVCCKey.Key, which means an incompatible change from RocksDB. That will ensure that if there is a sequence of changes of the form a@t1#s1.SET, a@t1#s2.DEL, a@t2#s3.SET, a@t2#s4.DEL, a@t3#s5.SET, a@t3#s6.DEL, ... where t1 < t2 < t3 < ... and s1 < s2 < s3 ... then any compaction will see a contiguous subsequence.

It turns out that this is not a sufficient condition. Since seeing a subsequence means that the youngest in the subsequence could be deleted and one may not see the delete for it. But it does allow for GCing all but the 2 youngest versions, since the youngest one may be deleted, but the second youngest is surely not deleted. This does put a lower bound on what can be GC'd via compactions.

The same issue applied if we were to segregate latest and older versions in separate files in each level. The "latest" sstable needs to keep the two most recent versions for the key since the most recent one may be deleted.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @itsbilal, and @tbg)


pkg/storage/engine/pebble.go, line 195 at r1 (raw file):

It turns out that this is not a sufficient condition

Yeah, I think the smallest example that demonstrates this is a@10#1.SET, a@20#2.SET, a@20#3.DEL. If the DEL is split off from the other two values in some higher level of the LSM, the compaction won't be able to determine whether a@20 shadows a@10 or not because it may or may not be rolled back.

Having just spent a while thinking about replication-level closed timestamps and "flushing through" changes to establish a lagging immutable timeline, I'm wondering if we can do something similar here. Specifically, what if we used the GC timestamp as more than just permission to GC below a threshold, but actually a promise that values won't change below a threshold? To do so, we'd need to treat the GC threshold as a property of a given SSTin the LSM. For instance, a GC threshold of 25 in a given SST containing a@20 in the LSM would mean that an a@20#3.DEL could not exist at a higher level. We could then pass this promise in at the top of the LSM from a Replica and let it propagate its way down, compacting as it went.

One prerequisite for this would be to guarantee that the GC threshold for a Range never exceeds the timestamp for an active intent. We don't currently make this guarantee because we bump the GC threshold before scanning for intents in the gcQueue. Being able to cheaply scan through all intents in a range (e.g. the lock-table) would make this guarantee reasonable to provide.

That's all super handwavy and avoids a lot of questions around a) enforcing this promise above Pebble, b) propagating this promise through compactions, c) making this all work while multiplexing Ranges onto a single LSM. But I think this would at least allow you to avoid the need to make sstables overlap if they have the same MVCCKey.

@petermattis
Copy link
Collaborator

That's all super handwavy and avoids a lot of questions around a) enforcing this promise above Pebble, b) propagating this promise through compactions, c) making this all work while multiplexing Ranges onto a single LSM. But I think this would at least allow you to avoid the need to make sstables overlap if they have the same MVCCKey.

Range tombstones are an existing mechanism that restricts an operation (a deletion) to a range of keys. Compactions have to know how to split range tombstones into the output sstables. Your description sounds like some sort of "conditional range delete" operation. The condition would be "delete obsolete values older than this timestamp". There would be a lot of details to figure out here (e.g. how to store the condition information on the range tombstone, how to provide a hook back into CRDB code to process the condition, etc), but it doesn't seem impossible. A "conditional range delete" operation would not be backwards compatible with RocksDB or with the 20.2 version of Pebble which is another headache.

@nvanbenschoten
Copy link
Member

Your description sounds like some sort of "conditional range delete" operation. The condition would be "delete obsolete values older than this timestamp".

That's a very powerful insight. I was thinking of this as a piece of metadata attached to an SST, but a "conditional range delete" stored within the SST's data is much closer to what we actually want, given the propagation rules for range deletions. This also avoids the complexity of flushing the memtable to properly sequence a GC with not-yet-durable writes. So replica GC would simplify down to the process of laying down one of these conditional range deletion tombstones and letting compactions asynchronously do the rest.

@petermattis
Copy link
Collaborator

So replica GC would simplify down to the process of laying down one of these conditional range deletion tombstones and letting compactions asynchronously do the rest.

Thinking about this more, I'm not certain we'd want to model this as a "range delete". I would expect the semantics of a "conditional range delete" to affect reads, but that might be fairly hard to achieve as we'd have to disable all of the fancy logic for skipping over swaths of keys covered by one of these "conditional range tombstones". If we don't need the GC to affect read operations life becomes a lot simpler. Perhaps we can define some "range property" operation that is then visible to compaction filters.

@nvanbenschoten
Copy link
Member

Having the conditional range delete affect reads sounds tricky, given that its condition needs to consider multiple user keys to determine which should be visible. I would suggest avoiding that kind of requirement and making this only visible to compactions filters.

If this "range property" did not impact read operations then KV would once again need to be careful to ignore inconsistencies between replicas below the GC threshold, but doing so doesn't seem complex or controversial to me.

@petermattis
Copy link
Collaborator

If this "range property" did not impact read operations then KV would once again need to be careful to ignore inconsistencies between replicas below the GC threshold, but doing so doesn't seem complex or controversial to me.

Yeah, probably easier to trim off records below the GC threshold in MVCCScan than doing it in Pebble as there is only a single threshold per range while Pebble needs to handle the arbitrary case of different thresholds for different ranges of keys.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 4, 2021

Perhaps we can define some "range property" operation that is then visible to compaction filters.

@nvanbenschoten mentioned in a call yesterday that the generalized range keys being designed in cockroachdb/pebble#1341 might be a good fit here, if we exposed them to compaction filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants