-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compaction: excessively large L6 sstables #1181
Comments
This manual compaction test demonstrates the behavior. On master this test case produces a single output file rather than the two in this one's expectation.:
One observation here is that in this test case, Maybe a more interesting test case is this one:
If there's an open snapshot between the range deletion and all its covered data, the output splitter is forced to put all the keys in the same sstable. I think it makes sense to avoid zeroing sequence numbers while the rangedel fragmenter contains a range tombstone not in the last snapshot stripe. |
This commit contains two changes related to cockroachdb#1181, fixing two ways excessively large output tables may be created. The first adjusts the nonZeroSeqNumSplitter to allow splitting if the compaction is allowing zeroing of sequence numbers and all of the pending range deletions will be elided. This avoids the problematic case where a very wide range deletion at a low sequence number sits in the rangedel fragmenter, preventing splitting sstables. Previously if the compaction was actively zeroing sequence numbers, we were unable to split compaction output until the range deletion's end. The second adjusts the logic around when to zero sequence numbers, preventing zeroing if there's a range deletion tombstone in the fragmenter that covers the current key but didn't drop it because of an open snapshot. This ensures that we can split the compaction output even if a wide range deletion must be added to the output table. Fix cockroachdb#1181.
Neither of the fixes in #1182 account for the scenario where multiple sstables have already had all their keys' sequence numbers zeroed, and then they're compacted with a sstable containing just a single wide range tombstone but there's a snapshot preventing elision. In that case, the compaction would be forced to join all the sstables together into a single large sstable. I'm not sure what we can do about that other than try to pick other compactions (eg #872). The grandparent splitting should help us some in this case too, by limiting the range deletion's overlap to |
Aren't all the range tombstones written in the CockroachDB context limited to a CockroachDB range, which is limited in size to 512MB and split when exceeded? I am trying to understand if these fixes, which are well worth doing, are expected to solve the observed issue. |
Yeah, that's true. I suspect that there were many range tombstones at lower sequence numbers covering the entire key space that was consolidated into a file. This can happen if the node received its data through snapshot rebalancing, because we include a tombstone covering the range's keyspan in the ingested sstables. (cockroachdb/cockroach#44048). The compaction continually iterated over range tombstone, followed by a span of zero-seqnum keys, repeat. |
In cockroachlabs/support#1063, at times there were many compactions from L5 into L6 and zero compactions out of L0, all the while L0 read-amplification climbed steadily. By level sizes, L5 was less than 1/10th the total database size. I think this is a side effect of the large files, but I'd like some verification that my logic makes sense. In this instance, the node had just been offline for several hours for the manual compaction. When it was brought back up it needed to clear all of its replicas. Range tombstones were written range-by-range, and eventually enough were written to L5 to trigger the compaction into the ~100 GB L6 file. This large compaction took 45m, all the while new range tombstones were added to L5 which increased L5's compensated size. Many of these tombstones increasing the score of L5 couldn't be compacted yet because of the massive in-progress compaction over a conflicting keyspace, so we pursued less productive compactions that did not reduce L5's score significantly. |
This commit contains two changes related to cockroachdb#1181, fixing two ways excessively large output tables may be created. The first adjusts the nonZeroSeqNumSplitter to allow splitting if the compaction is allowing zeroing of sequence numbers and all of the pending range deletions will be elided. This avoids the problematic case where a very wide range deletion at a low sequence number sits in the rangedel fragmenter, preventing splitting sstables. Previously if the compaction was actively zeroing sequence numbers, we were unable to split compaction output until the range deletion's end. The second adjusts the logic around when to zero sequence numbers, preventing zeroing if there's a range deletion tombstone in the fragmenter that covers the current key but didn't drop it because of an open snapshot. This ensures that we can split the compaction output even if a wide range deletion must be added to the output table. Fix cockroachdb#1181.
|
I hacked a little script together, and it checks out. There were 398 L5 sstables that overlapped the huge L6 file, and all 398 had range tombstone end keys. The uncompensated file size sum of all these L5 sstables was 12 GB.
I have been eyeing that adjustment too, but in the context of compensated sizes. I'm not sure it makes sense to adjust higher level's scores by a lower level score that is elevated because of pending range deletions. One alternative, if we keep the adjustment, is to compute both compensated and uncompensated scores, using the uncompensated score as the divisor in the adjustment.
That makes sense to me. Have we previously more generally explored using sublevel-count as an input into L0's score? Combing through these manifests, I noticed many tiny (1 KB) sstables with rangedel end boundaries ingested into L0. These files can increase read-amplification, don't obey flush splits and don't significantly increase the byte-size of L0. |
Unlike the lower levels, L0 is using sublevel count for its score, and not the byte size (see |
That sounds like a very reasonable change to make, and being less radical than throwing out the whole score-division heuristic it would probably have fewer unintended regressions. |
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
Remove the `nonZeroSeqNumSplitter`. The logic preventing splitting at nonzero-seqnum keys was responsible for the creation of arbitrarily large files (cockroachdb#1181). The condition that the `nonZeroSeqNumSplitter` guards against is no longer possible. Fix cockroachdb#1181.
In cockroachlabs/support#1063, we saw two instances of excessively large sstables, both in L6: 5020623 (300 G) and 7432597 (133 G).
The compaction of 5020623 took 92 minutes, and the compaction of 7432597 took 45 minutes. These two compactions overlapped in time, reducing the compaction concurrency available elsewhere during their entire duration. It seems there is a bug somewhere in the compaction output splitting logic.
The text was updated successfully, but these errors were encountered: