Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: don't prioritize compaction of pinned range tombstones #872

Closed
jbowens opened this issue Sep 1, 2020 · 9 comments
Closed

perf: don't prioritize compaction of pinned range tombstones #872

jbowens opened this issue Sep 1, 2020 · 9 comments

Comments

@jbowens
Copy link
Collaborator

jbowens commented Sep 1, 2020

With the introduction of min-overlapping ratio heuristic #707, Pebble started prioritizing compaction of range tombstones by inflating start-level file sizes by an estimate of data covered by range tombstones.

Prioritizing the compaction of range tombstones has a few benefits:

  1. Disk space is reclaimed promptly.
  2. These compactions suffer less write amplification than their uncompensated input file sizes / overlapping ratio suggest.
  3. Moving broad tombstones into lower levels allows ingested sstables to be ingested into lower levels. (storage: avoid excessively wide range tombstones during Raft snapshot reception cockroach#44048).

But if an open snapshot prevents a range tombstone from dropping keys, these first two benefits do not apply. Additionally, if the output level is L6, these compactions may have a negative effect of cementing tombstones into the bottommost level (#517 (comment)) where they're only cleared by low-priority elision-only compactions.

We might want to improve prioritization of elision-only compactions. However, seeing as these compactions add write amplification that otherwise could've been avoided, maybe we should try to avoid prioritizing compactions that are unlikely to reclaim disk space. This could be done through using uncompensated file sizes during compaction picking under some conditions, like when a start level file's largest sequence number does not fall in the last snapshot stripe.

@jbowens jbowens changed the title perf: don't incentivize compaction of pinned range tombstones perf: don't prioritize compaction of pinned range tombstones Sep 2, 2020
@petermattis
Copy link
Collaborator

This could be done through using uncompensated file sizes during compaction picking under some conditions, like when a start level file's largest sequence number does not fall in the last snapshot stripe.

So the compensated size would only apply if the range tombstone can actually have an effect on the records in lower level sstables? This sounds similar to the logic needed for deletion-only compactions. I think this would be a worthwhile enhancement.

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 3, 2020

So the compensated size would only apply if the range tombstone can actually have an effect on the records in lower level sstables? This sounds similar to the logic needed for deletion-only compactions. I think this would be a worthwhile enhancement.

I wonder in practice how frequently range tombstones fall into the same snapshot stripe as keys they drop, but not into the last snapshot stripe. It would be a minimal change to only compensate files whose sequence numbers are less than the oldest snapshot. If we wanted to get more granular, we could add more granular accounting similar to deletion-only hints into the file metadata's table stats.

@petermattis
Copy link
Collaborator

It would be a minimal change to only compensate files whose sequence numbers are less than the oldest snapshot.

Yeah. It would be interesting to experiment with that and see if it has any negative effect on the clearrange roachtest.

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 11, 2020

I gave it a try with a clearrange roachtest. The graphs look similar for both master and 40ec60c94b4f5c6c3b81d7da45e21f9519347797. However, looking at one random node's LSM in the logs, you can see the tradeoff of ingestions being forced into higher levels during the IMPORT (causing higher write amplification) versus lower write amplification during the DROP TABLE when there are few ingestions to obstruct. I want to try another run of clearrange on master, but this time logging the number of keys that get elided but don't fall in the final snapshot stripe.

master:
Screen Shot 2020-09-11 at 4 30 54 PM
Screen Shot 2020-09-11 at 4 31 05 PM

40ec60c94b4f5c6c3b81d7da45e21f9519347797:
Screen Shot 2020-09-11 at 3 04 38 PM
Screen Shot 2020-09-11 at 3 04 52 PM


post-IMPORT

master:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1    12 M       -   524 M       -       -       -       -   527 M       -       -       -     1.0›
‹      0         0     0 B    0.00   515 M   296 M     212     0 B       0    30 M      85     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2         6    15 M    0.99    28 M     0 B       0   698 K       2    73 M      41    78 M       1     2.6›
‹      3         8    56 M    0.15   141 M   248 M      20    24 M       8   338 M     137   346 M       1     2.4›
‹      4       226   3.1 G    1.07   348 M   3.4 G     236    77 M      22   489 M     112   490 M       1     1.4›
‹      5      1514    20 G    1.08   337 M    69 G   5.0 K   447 M      84   448 M      38   448 M       1     1.3›
‹      6      5842   128 G       -    51 G    82 G   5.6 K   535 M      65   114 G   4.4 K   115 G       1     2.3›
‹  total      7596   151 G       -   155 G   154 G    11 K   1.1 G     181   271 G   4.9 K   116 G       5     1.7›
‹  flush        75›
‹compact      3566   4.9 G          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         0     0 B›
‹ bcache     246 K   3.3 G   54.2%  (score == hit-rate)›
‹ tcache     2.2 K   1.3 M   99.4%  (score == hit-rate)›
‹ titers         2›
‹ filter         -       -   97.3%  (score == utility)›

40ec60c94b4f5c6c3b81d7da45e21f9519347797:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1    45 M       -   404 M       -       -       -       -   407 M       -       -       -     1.0›
‹      0         0     0 B    0.00   362 M   165 M     119     0 B       0    18 M      75     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2        26    81 M    1.13    90 M     0 B       0   282 K       1   122 M      53   125 M       1     1.4›
‹      3        41   603 M    0.99    34 M   6.7 G     476   6.5 M       4    95 M      56   129 M       1     2.8›
‹      4       370   4.1 G    1.23   771 M    36 G   2.6 K   5.3 G     401   1.2 G     173   1.2 G       1     1.5›
‹      5      1076    23 G    1.23    36 G    59 G   4.2 K   1.7 G     162    65 G   3.8 K    65 G       1     1.8›
‹      6      4479   128 G       -    76 G    56 G   3.8 K    90 M       6   233 G   6.3 K   233 G       1     3.1›
‹  total      5992   156 G       -   158 G   157 G    11 K   7.1 G     574   458 G    10 K   300 G       5     2.9›
‹  flush        63›
‹compact      6365    34 G          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         9   422 M›
‹ bcache     241 K   3.2 G   25.7%  (score == hit-rate)›
‹ tcache     1.7 K  1006 K   98.2%  (score == hit-rate)›
‹ titers         4›
‹ filter         -       -   97.2%  (score == utility)›

post-DROP TABLE

master:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1   8.7 M       -   266 M       -       -       -       -   270 M       -       -       -     1.0›
‹      0         0     0 B    0.00   261 M   309 M     969     0 B       0    31 M     413     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2         0     0 B    0.00   329 M    79 M      27   4.4 M      43   506 M     443   511 M       0     1.5›
‹      3         0     0 B    0.00   424 M    70 M      83    14 M      21   786 M     265   795 M       0     1.9›
‹      4         1   992 K    0.09   303 M   457 M      57   277 M      53   1.2 G     770   1.4 G       1     4.1›
‹      5        81   502 K    0.10   1.1 G   773 K     241   1.1 G     179   6.0 G     434   6.9 G       1     5.7›
‹      6         2   2.8 M       -    11 G   3.2 G     327   1.7 G     152    14 G     172    29 G       1     1.3›
‹  total        84   4.2 M       -   4.3 G   4.1 G   1.7 K   3.0 G     448    27 G   2.5 K    38 G       3     6.3›
‹  flush       410›
‹compact      2059     0 B          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         0     0 B›
‹ bcache       705   7.0 M   72.9%  (score == hit-rate)›
‹ tcache        84    50 K   99.0%  (score == hit-rate)›
‹ titers         0›
‹ filter         -       -   81.1%  (score == utility)›

40ec60c94b4f5c6c3b81d7da45e21f9519347797:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1   2.4 M       -   385 M       -       -       -       -   390 M       -       -       -     1.0›
‹      0         0     0 B    0.00   388 M   3.2 M     783     0 B       0    32 M     316     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2         0     0 B    0.00    19 M   315 M       6   4.3 M      16   295 M     269   301 M       0    15.8›
‹      3         0     0 B    0.00   342 M    24 K      14    53 M      24   657 M     208   666 M       0     1.9›
‹      4         0     0 B    0.00   168 M   174 M     119   672 M      80   287 M      71   316 M       0     1.7›
‹      5         1   2.4 M    0.11   669 M   588 M     120   310 M      45   2.0 G     340   2.1 G       1     3.0›
‹      6         2    12 M       -   5.1 G   3.2 G     246    61 M      20    13 G     137    16 G       1     2.5›
‹  total         3    14 M       -   4.7 G   4.3 G   1.3 K   1.1 G     185    21 G   1.3 K    20 G       2     4.4›
‹  flush       315›
‹compact      1202     0 B          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         0     0 B›
‹ bcache       925    29 M   76.5%  (score == hit-rate)›
‹ tcache         3   1.8 K   99.2%  (score == hit-rate)›
‹ titers         0›
‹ filter         -       -   86.5%  (score == utility)›

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 14, 2020

When replicating a range, are all the keys that Cockroach reads under a range prefix? Have we ever considered Pebble-level 'range snapshots' to provide snapshot semantics for only a key range? When initializing a compaction's snapshot list, we could skip any 'range snapshots' that have ranges disjoint from the compaction's key space.

The goal would be to drop range tombstones and the data they cover faster during rebalancing during IMPORT, in hopes of being able to ingest more tables directly into L6.

@petermattis
Copy link
Collaborator

When replicating a range, are all the keys that Cockroach reads under a range prefix?

Unfortunately no. Each CRDB range comes with 3 distinct key ranges: the range-id local key range, the range local key range, and the user key range. See rditer/replica_data_iter.go for more details.

Have we ever considered Pebble-level 'range snapshots' to provide snapshot semantics for only a key range? When initializing a compaction's snapshot list, we could skip any 'range snapshots' that have ranges disjoint from the compaction's key space.

I've had this idea, though never explored it in detail. I'm not even sure if I've ever written it down. The part I worried about was what to do for iteration on a "range" snapshot. What semantics would we use if you iterate outside of the range the snapshot was defined on. One possibility is that the "range" snapshot provides an implicit Lower and Upper bound on iteration. This idea might be a non-starter due to the 3 key ranges CRDB needs to read from to replicate a Range.

@jbowens
Copy link
Collaborator Author

jbowens commented Jul 3, 2021

I've had this idea, though never explored it in detail. I'm not even sure if I've ever written it down. The part I worried about was what to do for iteration on a "range" snapshot. What semantics would we use if you iterate outside of the range the snapshot was defined on. One possibility is that the "range" snapshot provides an implicit Lower and Upper bound on iteration. This idea might be a non-starter due to the 3 key ranges CRDB needs to read from to replicate a Range.

When you create a pebble iterator from a snapshot, you could be required to provide one of the ranges provided when the snapshot was created. You create one pebble iterator per snapshotted-range. In CockroachDB, we could hide that detail by writing an iterator implementation that steps between the individual pebble iterators. It seems like most of the benefit of snapshotting just a key range is in the user key range, where the bulk of the data lives. If the extra pebble iterators pose a problem, we could snapshot all of the local range and just the necessary part of the user key range?

@jbowens
Copy link
Collaborator Author

jbowens commented Aug 8, 2022

#1189

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants