Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction: excessively large L6 sstables #1181

Closed
jbowens opened this issue Jul 1, 2021 · 9 comments · Fixed by #1182
Closed

compaction: excessively large L6 sstables #1181

jbowens opened this issue Jul 1, 2021 · 9 comments · Fixed by #1182
Labels
C-bug Something isn't working

Comments

@jbowens
Copy link
Collaborator

jbowens commented Jul 1, 2021

In cockroachlabs/support#1063, we saw two instances of excessively large sstables, both in L6: 5020623 (300 G) and 7432597 (133 G).

The compaction of 5020623 took 92 minutes, and the compaction of 7432597 took 45 minutes. These two compactions overlapped in time, reducing the compaction concurrency available elsewhere during their entire duration. It seems there is a bug somewhere in the compaction output splitting logic.

@jbowens jbowens added the C-bug Something isn't working label Jul 1, 2021
@jbowens
Copy link
Collaborator Author

jbowens commented Jul 1, 2021

This manual compaction test demonstrates the behavior. On master this test case produces a single output file rather than the two in this one's expectation.:

define target-file-sizes=(1, 1, 1)
L1
  a.RANGEDEL.10:z
  b.SET.11:foo
  c.SET.11:foo
L2
  c.SET.0:foo
  d.SET.0:foo
----
1:
  000004:[a-z]
2:
  000005:[c-d]

compact a-z L1
----
2:
  000006:[b#0,SET-b#0,SET]
  000007:[c#0,SET-c#0,SET]

One observation here is that in this test case, a.RANGEDEL.10:z will be elided so the nonzero-seqnum logic is being overly protective. In this case, it would be okay to split outputs because the range deletion will not be part of the sstable or the bounds.

Maybe a more interesting test case is this one:

define target-file-sizes=(1, 1, 1) snapshots=(12)
L1
  a.RANGEDEL.13:z
  b.SET.11:foo
  c.SET.11:foo
L2
  c.SET.0:foo
  d.SET.0:foo
----
1:
  000004:[a-z]
2:
  000005:[c-d]

compact a-z L1
----
# ???

If there's an open snapshot between the range deletion and all its covered data, the output splitter is forced to put all the keys in the same sstable. I think it makes sense to avoid zeroing sequence numbers while the rangedel fragmenter contains a range tombstone not in the last snapshot stripe.

jbowens added a commit to jbowens/pebble that referenced this issue Jul 2, 2021
This commit contains two changes related to cockroachdb#1181, fixing two ways
excessively large output tables may be created.

The first adjusts the nonZeroSeqNumSplitter to allow splitting if the
compaction is allowing zeroing of sequence numbers and all of the
pending range deletions will be elided. This avoids the problematic case
where a very wide range deletion at a low sequence number sits in the
rangedel fragmenter, preventing splitting sstables. Previously if the
compaction was actively zeroing sequence numbers, we were unable to
split compaction output until the range deletion's end.

The second adjusts the logic around when to zero sequence numbers,
preventing zeroing if there's a range deletion tombstone in the
fragmenter that covers the current key but didn't drop it because of
an open snapshot. This ensures that we can split the compaction output
even if a wide range deletion must be added to the output table.

Fix cockroachdb#1181.
@jbowens
Copy link
Collaborator Author

jbowens commented Jul 3, 2021

Neither of the fixes in #1182 account for the scenario where multiple sstables have already had all their keys' sequence numbers zeroed, and then they're compacted with a sstable containing just a single wide range tombstone but there's a snapshot preventing elision. In that case, the compaction would be forced to join all the sstables together into a single large sstable. I'm not sure what we can do about that other than try to pick other compactions (eg #872).

The grandparent splitting should help us some in this case too, by limiting the range deletion's overlap to maxGrandparentOverlapBytes (~10x the target file size).

@sumeerbhola
Copy link
Collaborator

two instances of excessively large sstables, both in L6: 5020623 (300 G) and 7432597 (133 G).

Aren't all the range tombstones written in the CockroachDB context limited to a CockroachDB range, which is limited in size to 512MB and split when exceeded? I am trying to understand if these fixes, which are well worth doing, are expected to solve the observed issue.

@jbowens
Copy link
Collaborator Author

jbowens commented Jul 6, 2021

Aren't all the range tombstones written in the CockroachDB context limited to a CockroachDB range, which is limited in size to 512MB and split when exceeded? I am trying to understand if these fixes, which are well worth doing, are expected to solve the observed issue.

Yeah, that's true. I suspect that there were many range tombstones at lower sequence numbers covering the entire key space that was consolidated into a file. This can happen if the node received its data through snapshot rebalancing, because we include a tombstone covering the range's keyspan in the ingested sstables. (cockroachdb/cockroach#44048).

The compaction continually iterated over range tombstone, followed by a span of zero-seqnum keys, repeat.

@jbowens
Copy link
Collaborator Author

jbowens commented Jul 6, 2021

In cockroachlabs/support#1063, at times there were many compactions from L5 into L6 and zero compactions out of L0, all the while L0 read-amplification climbed steadily. By level sizes, L5 was less than 1/10th the total database size.

I think this is a side effect of the large files, but I'd like some verification that my logic makes sense. In this instance, the node had just been offline for several hours for the manual compaction. When it was brought back up it needed to clear all of its replicas. Range tombstones were written range-by-range, and eventually enough were written to L5 to trigger the compaction into the ~100 GB L6 file. This large compaction took 45m, all the while new range tombstones were added to L5 which increased L5's compensated size. Many of these tombstones increasing the score of L5 couldn't be compacted yet because of the massive in-progress compaction over a conflicting keyspace, so we pursued less productive compactions that did not reduce L5's score significantly.

jbowens added a commit to jbowens/pebble that referenced this issue Jul 6, 2021
This commit contains two changes related to cockroachdb#1181, fixing two ways
excessively large output tables may be created.

The first adjusts the nonZeroSeqNumSplitter to allow splitting if the
compaction is allowing zeroing of sequence numbers and all of the
pending range deletions will be elided. This avoids the problematic case
where a very wide range deletion at a low sequence number sits in the
rangedel fragmenter, preventing splitting sstables. Previously if the
compaction was actively zeroing sequence numbers, we were unable to
split compaction output until the range deletion's end.

The second adjusts the logic around when to zero sequence numbers,
preventing zeroing if there's a range deletion tombstone in the
fragmenter that covers the current key but didn't drop it because of
an open snapshot. This ensures that we can split the compaction output
even if a wide range deletion must be added to the output table.

Fix cockroachdb#1181.
@sumeerbhola
Copy link
Collaborator

at times there were many compactions from L5 into L6 and zero compactions out of L0, all the while L0 read-amplification climbed steadily. By level sizes, L5 was less than 1/10th the total database size.
... while new range tombstones were added to L5 which increased L5's compensated size. Many of these tombstones increasing the score of L5 couldn't be compacted yet because of the massive in-progress compaction over a conflicting keyspace, so we pursued less productive compactions ...

  • This sounds like a reasonable theory. Can we verify that these couldn't be compacted based on looking at the key range of the huge L6 file and looking for L5 files that have range tombstones as one of the end keys?
  • Regarding "less productive compactions" I am not convinced of the scoring adjustment made in https://github.com/cockroachdb/pebble/blame/31f3ea68e61c41e601e53c9501d406f7678e2d5f/compaction_picker.go#L731-L765 which was introduced in the commit here 730159f. I think the comment in that commit that says "This trade-off seems ok" for trading off a reduction in write amp from 7.4 to 6.3 for an increase in read amp from 26 to 54 is open to debate (I think we often have CPU and IOPs to spare when these bad LSMs happen, so can afford some increase in write-amp). I'm still partial to the idea that an inverted LSM is not inherently terrible. The original L0 sublevels prototype had the idea that as long as the size(level i)/size(level i-1) did not become higher than 10, it was ok to not prioritize compactions out from level i => level i+1 and allow level i-1 to continue compacting to level i. One option is to keep doing the adjustment as currently done for all levels other than L0 since they have a read amp of 1, and gate that adjustment for L0 on sub-level-count < A || size(Lbase)/size(L0) > B, where B = 10 and A is some read-amp that we are willing to tolerate, say 5. Thresholds introduce a sudden change in behavior, so we may want something else that achieves a similar objective.

@jbowens
Copy link
Collaborator Author

jbowens commented Jul 7, 2021

Can we verify that these couldn't be compacted based on looking at the key range of the huge L6 file and looking for L5 files that have range tombstones as one of the end keys?

I hacked a little script together, and it checks out. There were 398 L5 sstables that overlapped the huge L6 file, and all 398 had range tombstone end keys. The uncompensated file size sum of all these L5 sstables was 12 GB.

I am not convinced of the scoring adjustment made in ...

I have been eyeing that adjustment too, but in the context of compensated sizes. I'm not sure it makes sense to adjust higher level's scores by a lower level score that is elevated because of pending range deletions. One alternative, if we keep the adjustment, is to compute both compensated and uncompensated scores, using the uncompensated score as the divisor in the adjustment.

One option is to keep doing the adjustment as currently done for all levels other than L0 since they have a read amp of 1, and gate that adjustment for L0 on sub-level-count < A || size(Lbase)/size(L0) > B, where B = 10 and A is some read-amp that we are willing to tolerate, say 5. Thresholds introduce a sudden change in behavior, so we may want something else that achieves a similar objective.

That makes sense to me. Have we previously more generally explored using sublevel-count as an input into L0's score? Combing through these manifests, I noticed many tiny (1 KB) sstables with rangedel end boundaries ingested into L0. These files can increase read-amplification, don't obey flush splits and don't significantly increase the byte-size of L0.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jul 8, 2021

Have we previously more generally explored using sublevel-count as an input into L0's score?

Unlike the lower levels, L0 is using sublevel count for its score, and not the byte size (see calculateL0Score).

@sumeerbhola
Copy link
Collaborator

I have been eyeing that adjustment too, but in the context of compensated sizes. I'm not sure it makes sense to adjust higher level's scores by a lower level score that is elevated because of pending range deletions. One alternative, if we keep the adjustment, is to compute both compensated and uncompensated scores, using the uncompensated score as the divisor in the adjustment.

That sounds like a very reasonable change to make, and being less radical than throwing out the whole score-division heuristic it would probably have fewer unintended regressions.

jbowens added a commit to jbowens/pebble that referenced this issue Jul 9, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 9, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 12, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 12, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 13, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 13, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 13, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit that referenced this issue Jul 13, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix #1181.
jbowens added a commit to jbowens/pebble that referenced this issue Jul 13, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix cockroachdb#1181.
jbowens added a commit that referenced this issue Jul 14, 2021
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at
nonzero-seqnum keys was responsible for the creation of arbitrarily
large files (#1181). The condition that the `nonZeroSeqNumSplitter`
guards against is no longer possible.

Fix #1181.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants