perf: don't prioritize compaction of pinned range tombstones #872

jbowens · 2020-09-01T23:49:52Z

With the introduction of min-overlapping ratio heuristic #707, Pebble started prioritizing compaction of range tombstones by inflating start-level file sizes by an estimate of data covered by range tombstones.

Prioritizing the compaction of range tombstones has a few benefits:

Disk space is reclaimed promptly.
These compactions suffer less write amplification than their uncompensated input file sizes / overlapping ratio suggest.
Moving broad tombstones into lower levels allows ingested sstables to be ingested into lower levels. (storage: avoid excessively wide range tombstones during Raft snapshot reception cockroach#44048).

But if an open snapshot prevents a range tombstone from dropping keys, these first two benefits do not apply. Additionally, if the output level is L6, these compactions may have a negative effect of cementing tombstones into the bottommost level (#517 (comment)) where they're only cleared by low-priority elision-only compactions.

We might want to improve prioritization of elision-only compactions. However, seeing as these compactions add write amplification that otherwise could've been avoided, maybe we should try to avoid prioritizing compactions that are unlikely to reclaim disk space. This could be done through using uncompensated file sizes during compaction picking under some conditions, like when a start level file's largest sequence number does not fall in the last snapshot stripe.

petermattis · 2020-09-02T12:25:11Z

This could be done through using uncompensated file sizes during compaction picking under some conditions, like when a start level file's largest sequence number does not fall in the last snapshot stripe.

So the compensated size would only apply if the range tombstone can actually have an effect on the records in lower level sstables? This sounds similar to the logic needed for deletion-only compactions. I think this would be a worthwhile enhancement.

jbowens · 2020-09-03T21:35:07Z

So the compensated size would only apply if the range tombstone can actually have an effect on the records in lower level sstables? This sounds similar to the logic needed for deletion-only compactions. I think this would be a worthwhile enhancement.

I wonder in practice how frequently range tombstones fall into the same snapshot stripe as keys they drop, but not into the last snapshot stripe. It would be a minimal change to only compensate files whose sequence numbers are less than the oldest snapshot. If we wanted to get more granular, we could add more granular accounting similar to deletion-only hints into the file metadata's table stats.

petermattis · 2020-09-04T14:15:27Z

It would be a minimal change to only compensate files whose sequence numbers are less than the oldest snapshot.

Yeah. It would be interesting to experiment with that and see if it has any negative effect on the clearrange roachtest.

jbowens · 2020-09-11T22:18:59Z

I gave it a try with a clearrange roachtest. The graphs look similar for both master and 40ec60c94b4f5c6c3b81d7da45e21f9519347797. However, looking at one random node's LSM in the logs, you can see the tradeoff of ingestions being forced into higher levels during the IMPORT (causing higher write amplification) versus lower write amplification during the DROP TABLE when there are few ingestions to obstruct. I want to try another run of clearrange on master, but this time logging the number of keys that get elided but don't fall in the final snapshot stripe.

master:

40ec60c94b4f5c6c3b81d7da45e21f9519347797:

post-`IMPORT`

master:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1    12 M       -   524 M       -       -       -       -   527 M       -       -       -     1.0›
‹      0         0     0 B    0.00   515 M   296 M     212     0 B       0    30 M      85     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2         6    15 M    0.99    28 M     0 B       0   698 K       2    73 M      41    78 M       1     2.6›
‹      3         8    56 M    0.15   141 M   248 M      20    24 M       8   338 M     137   346 M       1     2.4›
‹      4       226   3.1 G    1.07   348 M   3.4 G     236    77 M      22   489 M     112   490 M       1     1.4›
‹      5      1514    20 G    1.08   337 M    69 G   5.0 K   447 M      84   448 M      38   448 M       1     1.3›
‹      6      5842   128 G       -    51 G    82 G   5.6 K   535 M      65   114 G   4.4 K   115 G       1     2.3›
‹  total      7596   151 G       -   155 G   154 G    11 K   1.1 G     181   271 G   4.9 K   116 G       5     1.7›
‹  flush        75›
‹compact      3566   4.9 G          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         0     0 B›
‹ bcache     246 K   3.3 G   54.2%  (score == hit-rate)›
‹ tcache     2.2 K   1.3 M   99.4%  (score == hit-rate)›
‹ titers         2›
‹ filter         -       -   97.3%  (score == utility)›

40ec60c94b4f5c6c3b81d7da45e21f9519347797:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1    45 M       -   404 M       -       -       -       -   407 M       -       -       -     1.0›
‹      0         0     0 B    0.00   362 M   165 M     119     0 B       0    18 M      75     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2        26    81 M    1.13    90 M     0 B       0   282 K       1   122 M      53   125 M       1     1.4›
‹      3        41   603 M    0.99    34 M   6.7 G     476   6.5 M       4    95 M      56   129 M       1     2.8›
‹      4       370   4.1 G    1.23   771 M    36 G   2.6 K   5.3 G     401   1.2 G     173   1.2 G       1     1.5›
‹      5      1076    23 G    1.23    36 G    59 G   4.2 K   1.7 G     162    65 G   3.8 K    65 G       1     1.8›
‹      6      4479   128 G       -    76 G    56 G   3.8 K    90 M       6   233 G   6.3 K   233 G       1     3.1›
‹  total      5992   156 G       -   158 G   157 G    11 K   7.1 G     574   458 G    10 K   300 G       5     2.9›
‹  flush        63›
‹compact      6365    34 G          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         9   422 M›
‹ bcache     241 K   3.2 G   25.7%  (score == hit-rate)›
‹ tcache     1.7 K  1006 K   98.2%  (score == hit-rate)›
‹ titers         4›
‹ filter         -       -   97.2%  (score == utility)›

post-`DROP TABLE`

master:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1   8.7 M       -   266 M       -       -       -       -   270 M       -       -       -     1.0›
‹      0         0     0 B    0.00   261 M   309 M     969     0 B       0    31 M     413     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2         0     0 B    0.00   329 M    79 M      27   4.4 M      43   506 M     443   511 M       0     1.5›
‹      3         0     0 B    0.00   424 M    70 M      83    14 M      21   786 M     265   795 M       0     1.9›
‹      4         1   992 K    0.09   303 M   457 M      57   277 M      53   1.2 G     770   1.4 G       1     4.1›
‹      5        81   502 K    0.10   1.1 G   773 K     241   1.1 G     179   6.0 G     434   6.9 G       1     5.7›
‹      6         2   2.8 M       -    11 G   3.2 G     327   1.7 G     152    14 G     172    29 G       1     1.3›
‹  total        84   4.2 M       -   4.3 G   4.1 G   1.7 K   3.0 G     448    27 G   2.5 K    38 G       3     6.3›
‹  flush       410›
‹compact      2059     0 B          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         0     0 B›
‹ bcache       705   7.0 M   72.9%  (score == hit-rate)›
‹ tcache        84    50 K   99.0%  (score == hit-rate)›
‹ titers         0›
‹ filter         -       -   81.1%  (score == utility)›

40ec60c94b4f5c6c3b81d7da45e21f9519347797:

‹__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp›
‹    WAL         1   2.4 M       -   385 M       -       -       -       -   390 M       -       -       -     1.0›
‹      0         0     0 B    0.00   388 M   3.2 M     783     0 B       0    32 M     316     0 B       0     0.1›
‹      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0›
‹      2         0     0 B    0.00    19 M   315 M       6   4.3 M      16   295 M     269   301 M       0    15.8›
‹      3         0     0 B    0.00   342 M    24 K      14    53 M      24   657 M     208   666 M       0     1.9›
‹      4         0     0 B    0.00   168 M   174 M     119   672 M      80   287 M      71   316 M       0     1.7›
‹      5         1   2.4 M    0.11   669 M   588 M     120   310 M      45   2.0 G     340   2.1 G       1     3.0›
‹      6         2    12 M       -   5.1 G   3.2 G     246    61 M      20    13 G     137    16 G       1     2.5›
‹  total         3    14 M       -   4.7 G   4.3 G   1.3 K   1.1 G     185    21 G   1.3 K    20 G       2     4.4›
‹  flush       315›
‹compact      1202     0 B          (size == estimated-debt)›
‹ memtbl         1    64 M›
‹zmemtbl         0     0 B›
‹   ztbl         0     0 B›
‹ bcache       925    29 M   76.5%  (score == hit-rate)›
‹ tcache         3   1.8 K   99.2%  (score == hit-rate)›
‹ titers         0›
‹ filter         -       -   86.5%  (score == utility)›

jbowens · 2020-09-14T13:25:08Z

When replicating a range, are all the keys that Cockroach reads under a range prefix? Have we ever considered Pebble-level 'range snapshots' to provide snapshot semantics for only a key range? When initializing a compaction's snapshot list, we could skip any 'range snapshots' that have ranges disjoint from the compaction's key space.

The goal would be to drop range tombstones and the data they cover faster during rebalancing during IMPORT, in hopes of being able to ingest more tables directly into L6.

petermattis · 2020-09-14T13:36:22Z

When replicating a range, are all the keys that Cockroach reads under a range prefix?

Unfortunately no. Each CRDB range comes with 3 distinct key ranges: the range-id local key range, the range local key range, and the user key range. See rditer/replica_data_iter.go for more details.

Have we ever considered Pebble-level 'range snapshots' to provide snapshot semantics for only a key range? When initializing a compaction's snapshot list, we could skip any 'range snapshots' that have ranges disjoint from the compaction's key space.

I've had this idea, though never explored it in detail. I'm not even sure if I've ever written it down. The part I worried about was what to do for iteration on a "range" snapshot. What semantics would we use if you iterate outside of the range the snapshot was defined on. One possibility is that the "range" snapshot provides an implicit Lower and Upper bound on iteration. This idea might be a non-starter due to the 3 key ranges CRDB needs to read from to replicate a Range.

jbowens · 2021-07-03T21:02:10Z

I've had this idea, though never explored it in detail. I'm not even sure if I've ever written it down. The part I worried about was what to do for iteration on a "range" snapshot. What semantics would we use if you iterate outside of the range the snapshot was defined on. One possibility is that the "range" snapshot provides an implicit Lower and Upper bound on iteration. This idea might be a non-starter due to the 3 key ranges CRDB needs to read from to replicate a Range.

When you create a pebble iterator from a snapshot, you could be required to provide one of the ranges provided when the snapshot was created. You create one pebble iterator per snapshotted-range. In CockroachDB, we could hide that detail by writing an iterator implementation that steps between the individual pebble iterators. It seems like most of the benefit of snapshotting just a key range is in the user key range, where the bulk of the data lives. If the extra pebble iterators pose a problem, we could snapshot all of the local range and just the necessary part of the user key range?

jbowens · 2022-08-08T21:31:53Z

#1189

github-actions · 2024-01-31T11:01:13Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

jbowens changed the title ~~perf: don't incentivize compaction of pinned range tombstones~~ perf: don't prioritize compaction of pinned range tombstones Sep 2, 2020

jbowens mentioned this issue Mar 5, 2021

perf: relative positioning through broad range tombstones is slow #1070

Closed

jbowens mentioned this issue Jul 3, 2021

compaction: excessively large L6 sstables #1181

Closed

jbowens mentioned this issue Jul 19, 2021

db: improve visibility into snapshots #1204

Closed

github-actions bot added the no-issue-activity label Jan 31, 2024

jbowens closed this as completed Jan 31, 2024

jbowens added this to Storage Jun 4, 2024

jbowens moved this to Done in Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: don't prioritize compaction of pinned range tombstones #872

perf: don't prioritize compaction of pinned range tombstones #872

jbowens commented Sep 1, 2020 •

edited

Loading

petermattis commented Sep 2, 2020

jbowens commented Sep 3, 2020

petermattis commented Sep 4, 2020

jbowens commented Sep 11, 2020

jbowens commented Sep 14, 2020

petermattis commented Sep 14, 2020

jbowens commented Jul 3, 2021

jbowens commented Aug 8, 2022

github-actions bot commented Jan 31, 2024

perf: don't prioritize compaction of pinned range tombstones #872

perf: don't prioritize compaction of pinned range tombstones #872

Comments

jbowens commented Sep 1, 2020 • edited Loading

petermattis commented Sep 2, 2020

jbowens commented Sep 3, 2020

petermattis commented Sep 4, 2020

jbowens commented Sep 11, 2020

post-IMPORT

post-DROP TABLE

jbowens commented Sep 14, 2020

petermattis commented Sep 14, 2020

jbowens commented Jul 3, 2021

jbowens commented Aug 8, 2022

github-actions bot commented Jan 31, 2024

jbowens commented Sep 1, 2020 •

edited

Loading

post-`IMPORT`

post-`DROP TABLE`