compaction: excessively large L6 sstables #1181

jbowens · 2021-07-01T17:26:27Z

In cockroachlabs/support#1063, we saw two instances of excessively large sstables, both in L6: 5020623 (300 G) and 7432597 (133 G).

The compaction of 5020623 took 92 minutes, and the compaction of 7432597 took 45 minutes. These two compactions overlapped in time, reducing the compaction concurrency available elsewhere during their entire duration. It seems there is a bug somewhere in the compaction output splitting logic.

jbowens · 2021-07-01T22:15:31Z

This manual compaction test demonstrates the behavior. On master this test case produces a single output file rather than the two in this one's expectation.:

define target-file-sizes=(1, 1, 1)
L1
  a.RANGEDEL.10:z
  b.SET.11:foo
  c.SET.11:foo
L2
  c.SET.0:foo
  d.SET.0:foo
----
1:
  000004:[a-z]
2:
  000005:[c-d]

compact a-z L1
----
2:
  000006:[b#0,SET-b#0,SET]
  000007:[c#0,SET-c#0,SET]

One observation here is that in this test case, a.RANGEDEL.10:z will be elided so the nonzero-seqnum logic is being overly protective. In this case, it would be okay to split outputs because the range deletion will not be part of the sstable or the bounds.

Maybe a more interesting test case is this one:

define target-file-sizes=(1, 1, 1) snapshots=(12)
L1
  a.RANGEDEL.13:z
  b.SET.11:foo
  c.SET.11:foo
L2
  c.SET.0:foo
  d.SET.0:foo
----
1:
  000004:[a-z]
2:
  000005:[c-d]

compact a-z L1
----
# ???

If there's an open snapshot between the range deletion and all its covered data, the output splitter is forced to put all the keys in the same sstable. I think it makes sense to avoid zeroing sequence numbers while the rangedel fragmenter contains a range tombstone not in the last snapshot stripe.

This commit contains two changes related to cockroachdb#1181, fixing two ways excessively large output tables may be created. The first adjusts the nonZeroSeqNumSplitter to allow splitting if the compaction is allowing zeroing of sequence numbers and all of the pending range deletions will be elided. This avoids the problematic case where a very wide range deletion at a low sequence number sits in the rangedel fragmenter, preventing splitting sstables. Previously if the compaction was actively zeroing sequence numbers, we were unable to split compaction output until the range deletion's end. The second adjusts the logic around when to zero sequence numbers, preventing zeroing if there's a range deletion tombstone in the fragmenter that covers the current key but didn't drop it because of an open snapshot. This ensures that we can split the compaction output even if a wide range deletion must be added to the output table. Fix cockroachdb#1181.

jbowens · 2021-07-03T19:37:38Z

Neither of the fixes in #1182 account for the scenario where multiple sstables have already had all their keys' sequence numbers zeroed, and then they're compacted with a sstable containing just a single wide range tombstone but there's a snapshot preventing elision. In that case, the compaction would be forced to join all the sstables together into a single large sstable. I'm not sure what we can do about that other than try to pick other compactions (eg #872).

The grandparent splitting should help us some in this case too, by limiting the range deletion's overlap to maxGrandparentOverlapBytes (~10x the target file size).

sumeerbhola · 2021-07-06T11:40:52Z

two instances of excessively large sstables, both in L6: 5020623 (300 G) and 7432597 (133 G).

Aren't all the range tombstones written in the CockroachDB context limited to a CockroachDB range, which is limited in size to 512MB and split when exceeded? I am trying to understand if these fixes, which are well worth doing, are expected to solve the observed issue.

jbowens · 2021-07-06T14:43:41Z

Aren't all the range tombstones written in the CockroachDB context limited to a CockroachDB range, which is limited in size to 512MB and split when exceeded? I am trying to understand if these fixes, which are well worth doing, are expected to solve the observed issue.

Yeah, that's true. I suspect that there were many range tombstones at lower sequence numbers covering the entire key space that was consolidated into a file. This can happen if the node received its data through snapshot rebalancing, because we include a tombstone covering the range's keyspan in the ingested sstables. (cockroachdb/cockroach#44048).

The compaction continually iterated over range tombstone, followed by a span of zero-seqnum keys, repeat.

jbowens · 2021-07-06T14:59:44Z

In cockroachlabs/support#1063, at times there were many compactions from L5 into L6 and zero compactions out of L0, all the while L0 read-amplification climbed steadily. By level sizes, L5 was less than 1/10th the total database size.

I think this is a side effect of the large files, but I'd like some verification that my logic makes sense. In this instance, the node had just been offline for several hours for the manual compaction. When it was brought back up it needed to clear all of its replicas. Range tombstones were written range-by-range, and eventually enough were written to L5 to trigger the compaction into the ~100 GB L6 file. This large compaction took 45m, all the while new range tombstones were added to L5 which increased L5's compensated size. Many of these tombstones increasing the score of L5 couldn't be compacted yet because of the massive in-progress compaction over a conflicting keyspace, so we pursued less productive compactions that did not reduce L5's score significantly.

This commit contains two changes related to cockroachdb#1181, fixing two ways excessively large output tables may be created. The first adjusts the nonZeroSeqNumSplitter to allow splitting if the compaction is allowing zeroing of sequence numbers and all of the pending range deletions will be elided. This avoids the problematic case where a very wide range deletion at a low sequence number sits in the rangedel fragmenter, preventing splitting sstables. Previously if the compaction was actively zeroing sequence numbers, we were unable to split compaction output until the range deletion's end. The second adjusts the logic around when to zero sequence numbers, preventing zeroing if there's a range deletion tombstone in the fragmenter that covers the current key but didn't drop it because of an open snapshot. This ensures that we can split the compaction output even if a wide range deletion must be added to the output table. Fix cockroachdb#1181.

sumeerbhola · 2021-07-07T12:05:53Z

at times there were many compactions from L5 into L6 and zero compactions out of L0, all the while L0 read-amplification climbed steadily. By level sizes, L5 was less than 1/10th the total database size.
... while new range tombstones were added to L5 which increased L5's compensated size. Many of these tombstones increasing the score of L5 couldn't be compacted yet because of the massive in-progress compaction over a conflicting keyspace, so we pursued less productive compactions ...

This sounds like a reasonable theory. Can we verify that these couldn't be compacted based on looking at the key range of the huge L6 file and looking for L5 files that have range tombstones as one of the end keys?
Regarding "less productive compactions" I am not convinced of the scoring adjustment made in https://github.com/cockroachdb/pebble/blame/31f3ea68e61c41e601e53c9501d406f7678e2d5f/compaction_picker.go#L731-L765 which was introduced in the commit here 730159f. I think the comment in that commit that says "This trade-off seems ok" for trading off a reduction in write amp from 7.4 to 6.3 for an increase in read amp from 26 to 54 is open to debate (I think we often have CPU and IOPs to spare when these bad LSMs happen, so can afford some increase in write-amp). I'm still partial to the idea that an inverted LSM is not inherently terrible. The original L0 sublevels prototype had the idea that as long as the size(level i)/size(level i-1) did not become higher than 10, it was ok to not prioritize compactions out from level i => level i+1 and allow level i-1 to continue compacting to level i. One option is to keep doing the adjustment as currently done for all levels other than L0 since they have a read amp of 1, and gate that adjustment for L0 on sub-level-count < A || size(Lbase)/size(L0) > B, where B = 10 and A is some read-amp that we are willing to tolerate, say 5. Thresholds introduce a sudden change in behavior, so we may want something else that achieves a similar objective.

jbowens · 2021-07-07T16:03:45Z

Can we verify that these couldn't be compacted based on looking at the key range of the huge L6 file and looking for L5 files that have range tombstones as one of the end keys?

I hacked a little script together, and it checks out. There were 398 L5 sstables that overlapped the huge L6 file, and all 398 had range tombstone end keys. The uncompensated file size sum of all these L5 sstables was 12 GB.

I am not convinced of the scoring adjustment made in ...

I have been eyeing that adjustment too, but in the context of compensated sizes. I'm not sure it makes sense to adjust higher level's scores by a lower level score that is elevated because of pending range deletions. One alternative, if we keep the adjustment, is to compute both compensated and uncompensated scores, using the uncompensated score as the divisor in the adjustment.

One option is to keep doing the adjustment as currently done for all levels other than L0 since they have a read amp of 1, and gate that adjustment for L0 on sub-level-count < A || size(Lbase)/size(L0) > B, where B = 10 and A is some read-amp that we are willing to tolerate, say 5. Thresholds introduce a sudden change in behavior, so we may want something else that achieves a similar objective.

That makes sense to me. Have we previously more generally explored using sublevel-count as an input into L0's score? Combing through these manifests, I noticed many tiny (1 KB) sstables with rangedel end boundaries ingested into L0. These files can increase read-amplification, don't obey flush splits and don't significantly increase the byte-size of L0.

sumeerbhola · 2021-07-08T11:44:58Z

Have we previously more generally explored using sublevel-count as an input into L0's score?

Unlike the lower levels, L0 is using sublevel count for its score, and not the byte size (see calculateL0Score).

sumeerbhola · 2021-07-08T11:49:24Z

I have been eyeing that adjustment too, but in the context of compensated sizes. I'm not sure it makes sense to adjust higher level's scores by a lower level score that is elevated because of pending range deletions. One alternative, if we keep the adjustment, is to compute both compensated and uncompensated scores, using the uncompensated score as the divisor in the adjustment.

That sounds like a very reasonable change to make, and being less radical than throwing out the whole score-division heuristic it would probably have fewer unintended regressions.

Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.

Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix #1181.

Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.

Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix #1181.

jbowens added the C-bug Something isn't working label Jul 1, 2021

jbowens mentioned this issue Jul 2, 2021

compaction: remove nonzero-seqnum splitting logic #1182

Merged

jbowens mentioned this issue Jul 6, 2021

kv/kvserver: split incoming snapshot user keys into multiple sstables cockroachdb/cockroach#67284

Closed

jbowens mentioned this issue Jul 9, 2021

db: use uncompensated scores to smoothe compaction picking scores #1189

Closed

jbowens closed this as completed in #1182 Jul 13, 2021

nvanbenschoten mentioned this issue Jul 13, 2021

bulkio: investigate “long tail” index backfill performance cockroachdb/cockroach#67535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compaction: excessively large L6 sstables #1181

compaction: excessively large L6 sstables #1181

jbowens commented Jul 1, 2021

jbowens commented Jul 1, 2021 •

edited

Loading

jbowens commented Jul 3, 2021

sumeerbhola commented Jul 6, 2021

jbowens commented Jul 6, 2021

jbowens commented Jul 6, 2021 •

edited

Loading

sumeerbhola commented Jul 7, 2021

jbowens commented Jul 7, 2021

sumeerbhola commented Jul 8, 2021 •

edited

Loading

sumeerbhola commented Jul 8, 2021

compaction: excessively large L6 sstables #1181

compaction: excessively large L6 sstables #1181

Comments

jbowens commented Jul 1, 2021

jbowens commented Jul 1, 2021 • edited Loading

jbowens commented Jul 3, 2021

sumeerbhola commented Jul 6, 2021

jbowens commented Jul 6, 2021

jbowens commented Jul 6, 2021 • edited Loading

sumeerbhola commented Jul 7, 2021

jbowens commented Jul 7, 2021

sumeerbhola commented Jul 8, 2021 • edited Loading

sumeerbhola commented Jul 8, 2021

jbowens commented Jul 1, 2021 •

edited

Loading

jbowens commented Jul 6, 2021 •

edited

Loading

sumeerbhola commented Jul 8, 2021 •

edited

Loading