perf: support user-defined "guards" for sstable boundaries #517

petermattis · 2020-02-04T19:31:40Z

When flushing and compacting sstables, Pebble needs to decide how to break up the resulting output into sstables. We currently use two signals: sstable size, and the boundaries of sstables in the "grandparent" level (i.e. the level below the output level). These signals are reasonable for randomly distributed writes, but can cause problems for skewed writes. Consider the following LSM structure:

L0 a------------------------z
L1 ab de gh jk mn pq st vw yz

L0 contains a single sstable spanning [a,z]. L1 contains 9 sstables. A compaction from L0 to L1 will need to include all of the L1 sstables. It is possible that L0 only contains 2 keys: a and z. This problem is easy to construct if a and z are written to the memtable near each other in time and then flushed as flushing currently always produces a single sstable.

This situation appears to arise in practice within CRDB due to the presence of "local" vs "global" keys. Most operations involve only "global" keys. When a "local" key is operated on, it will end up generating an L0 sstable that spans a large swath of the global key space.

The idea behind "guards" is to allow the user (i.e. CockroachDB) some control over sstable boundaries. CockroachDB would define guards at the boundary between the local and global keyspace. It may also define guards at the boundary between SQL tables. Such guards would ensure that we're segregating sstables along the guard boundaries so that an L0 sstable can't cover a huge fraction of the key space.

Note that "guards" would almost certainly be specified via a callback and not upfront. That is, we'd want to allow the user to specify a callback which can return true if there should be an sstable boundary between two user keys.

Jira issue: PEBBLE-178

The text was updated successfully, but these errors were encountered:

tbg · 2020-02-11T09:22:46Z

This situation appears to arise in practice within CRDB due to the presence of "local" vs "global" keys. Most operations involve only "global" keys. When a "local" key is operated on, it will end up generating an L0 sstable that spans a large swath of the global key space.

Most operations involve a local and a global key, right? Every raft command updates LeaseAppliedState. (This doesn't change anything about the problem or your suggested solution).

petermattis · 2020-02-11T12:27:25Z

Most operations involve a local and a global key, right? Every raft command updates LeaseAppliedState. (This doesn't change anything about the problem or your suggested solution).

Yes. How much would a change here affect normal operation is a different question? Sstables below L0 are partitioned and I would expect that partitioning to reduce the impact of wide spanning tables. Partitioning L0 at the local/global boundary would allow concurrent L0->Lbase compactions in situations where they are currently blocked. This is something @sumeerbhola has mentioned recently. I think this is a fruitful area for exploration, but I'm tempering my hopes.

Add `Options.TablePartitioner` hook which allows the user to specify a required partition between 2 user-keys. `TablePartitioner` is called during flush and compaction before outputting a new key to an sstable which already contains at least 1 key. TODO One complication from table partitioning is that it can create L0 tables which overlap in seqnum space. In order to support partitioned L0, we'd have to relax the invariant checks in `manifest.CheckOrdering`. Doing so will make Pebble incompatible with RocksDB. In order for partitioning to not naively increase read amplification, we'll want provide some sort of partitioned view of the sstables in `Version.Files[0]`. `mergingIter` will then need to be made aware of the partitioning. We may want to adjust the compaction picking heuristics to not expand compaction inputs across the partition boundary. See #517

jbowens · 2020-08-13T17:40:02Z

Would it make sense to define guards at range boundaries or are those too ephemeral? The problem of excessively wide sstables also applies to lower levels. In the large bank import I just did, one node had much higher disk write I/O, which sustained well after the IMPORT finished.

Here's the LSM of a 'normal' node:


__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    22 M       -    19 G       -       -       -       -    20 G       -       -       -     1.0
      0         0     0 B    0.00    20 G    60 M      15     0 B       0   371 M     987     0 B       0     0.0
      1        21    62 M    0.97   347 M     0 B       0   1.1 M       1    18 G   5.2 K    18 G       1    53.1
      2        31   153 M    0.23   164 M     0 B       0   896 K       4   516 M     112   528 M       1     3.1
      3         7   420 M    0.06   957 K   2.0 G      35   810 K       1   1.5 M       1   1.7 M       1     1.6
      4       197   3.2 G    0.05   1.6 G   1.6 G      35   142 K       1   2.9 G     198   2.9 G       1     1.8
      5     11819   691 G    1.00    16 M   4.3 T    75 K    91 M       6    16 M       1    16 M       1     1.0
      6     81616   6.9 T       -   3.6 T   3.4 T    75 K    91 M       7   6.4 T    70 K   6.4 T       1     1.8
  total     93691   7.6 T       -   7.7 T   7.7 T   150 K   186 M      20    14 T    76 K   6.5 T       6     1.8
  flush       373
compact     63999    89 G          (size == estimated-debt)
 memtbl         1    64 M
zmemtbl         0     0 B
   ztbl         0     0 B
 bcache     483 K   7.5 G   41.0%  (score == hit-rate)
 tcache      68 K    40 M   99.9%  (score == hit-rate)
 titers         4
 filter         -       -   88.7%  (score == utility)

A little less than half of the data was ingested directly into L6.

Here's the LSM of the node with higher write I/O:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    84 K       -   8.0 G       -       -       -       -   8.1 G       -       -       -     1.0
      0         3   100 K    2.05   8.1 G   504 M     856     0 B       0   222 M   1.8 K     0 B       2     0.0
      1        13    36 M    0.98   208 M     0 B       0   1.7 M       1   8.8 G   2.4 K   8.9 G       1    43.1
      2         8    52 M    0.09    58 M     0 B       0   1.6 M       1   138 M      28   143 M       1     2.4
      3         3    65 M    0.03     0 B    62 M       2   2.3 M       1     0 B       0     0 B       1     0.0
      4     12405   642 G    3.54    60 M   4.1 T    73 K   120 M       4   108 M      29   108 M       1     1.8
      5     44371   1.8 T    3.54   2.7 T   3.6 T    79 K   828 G    14 K   4.8 T    95 K   4.8 T       1     1.8
      6     49079   5.3 T       -   5.3 T    25 G     556    56 M      15   9.6 T   107 K   9.6 T       1     1.8
  total    105882   7.7 T       -   7.8 T   7.8 T   153 K   828 G    14 K    22 T   206 K    14 T       8     2.8
  flush       703
compact    119148   7.6 T          (size == estimated-debt)
 memtbl         1    64 M
zmemtbl         2   128 M
   ztbl        23   1.1 G
 bcache     478 K   7.4 G   36.9%  (score == hit-rate)
 tcache      65 K    38 M   99.8%  (score == hit-rate)
 titers         9
 filter         -       -   91.4%  (score == utility)

Only 25 G were ingested into L6, forcing the rest of the ingestions into L5 and L4. More than 12 hours after the IMPORT finished, this node is still compacting to reshape the LSM.

jbowens · 2020-08-13T17:45:46Z

Maybe user-defined guards aren't a solution for that scenario.

We could try to dynamically detect these low-level sstables that are very broad relative to the distribution of keys in higher levels and split them up, anticipating that doing so will allow cheap move compactions.

petermattis · 2020-08-13T17:47:41Z

Only 25 G were ingested into L6, forcing the rest of the ingestions into L5 and L4. More than 12 hours after the IMPORT finished, this node is still compacting to reshape the LSM.

The last time I saw something similar happen, the problem was an sstable containing an overly wide range tombstone ingested into L6 that was preventing future ingestions into L6. See cockroachdb/cockroach#44048.

jbowens · 2020-08-13T21:36:04Z

I confirmed that there are range tombstones in L6, although I'm not sure how to confirm their breadth.

                  L0          L1          L2          L3         L4          L5          L6           TOTAL
count             7           15          16          3          20547       41998       57654        120240
seq num
  smallest        19611449    1097131     274394      13619      141546      8254        5899         5899
  largest         19613557    19611446    19352573    2644259    19613558    19613559    19611448     19613559
size
  data            192 K       27 M        120 M       64 M       482 G       1.7 T       6.1 T        8.3 T
    blocks        20          1966        8628        2531       18984467    68036424    246584115    333618151
  index           674 B       74 K        329 K       77 K       560 M       2.0 G       7.1 G        9.6 G
    blocks        7           15          16          3          20547       41998       57654        120240
    top-level     0 B         0 B         0 B         0 B        0 B         0 B         0 B          0 B
  filter          1.7 K       1.7 M       4.4 M       802 K      5.4 G       20 G        638 K        25 G
  raw-key         34 K        43 M        122 M       17 M       117 G       421 G       1.5 T        2.0 T
  raw-value       503 K       35 M        192 M       64 M       474 G       1.7 T       6.0 T        8.1 T
records
  set             706         280 K       1.9 M       621 K      4.7 G       17 G        61 G         82 G
  delete          632         1.5 M       2.8 M       53 K       12 K        1.0 K       5.2 K        4.4 M
  range-delete    3           1.4 K       7.2 K       3          12 K        8           5.2 K        26 K
  merge           0           0           0           0          0           0           0            0

petermattis · 2020-08-13T22:21:31Z

I confirmed that there are range tombstones in L6, although I'm not sure how to confirm their breadth.

You can use debug pebble sstable layout -v to output the range tombstones and see if we extends to /Max on the upper bound.

sumeerbhola · 2020-08-14T12:45:13Z

Stepping back for a moment to the original motivation for this issue, now that we split flushes based on both L0 FlushSplitKeys and Lbase (grandparent) limits, I wonder why we need explicit guards.

We control width well inside Pebble, at least in theory, but ingested sstables are outside our control. Should we provide split points to the ingestion logic to ensure it does not produce wide sstables?
Though if a wide sstable ends up being ingested to L6, we probably don't have enough info within Pebble to make it narrower in the first place. Also, if there is an sstable with a range tombstone being written to L6, isn't that tombstone now known to be useless? If so, should we mark it for immediate compaction?

petermattis · 2020-08-14T14:33:26Z

Stepping back for a moment to the original motivation for this issue, now that we split flushes based on both L0 FlushSplitKeys and Lbase (grandparent) limits, I wonder why we need explicit guards.

I'm not at all sure explicit guards are necessary.

We control width well inside Pebble, at least in theory, but ingested sstables are outside our control. Should we provide split points to the ingestion logic to ensure it does not produce wide sstables?

CRDB does split ingested sstables along the local/global key space. That is hardcoded, though. It is almost like we want an abstraction on top of sstable.Writer that incorporates the compaction splitting logic that Bilal is refactoring right now.

Though if a wide sstable ends up being ingested to L6, we probably don't have enough info within Pebble to make it narrower in the first place. Also, if there is an sstable with a range tombstone being written to L6, isn't that tombstone now known to be useless? If so, should we mark it for immediate compaction?

This is a good point. @jbowens did add logic recently to perform elision compactions to recompact L6 files that contain tombstones so I think the mark for compaction bit is essentially done. It is possible we should also be prioritizing compactions of L6 files containing tombstones because normally they would be lower priority than other compactions and thus get starved.

jbowens · 2020-08-14T16:06:23Z

This is a good point. @jbowens did add logic recently to perform elision compactions to recompact L6 files that contain tombstones so I think the mark for compaction bit is essentially done. It is possible we should also be prioritizing compactions of L6 files containing tombstones because normally they would be lower priority than other compactions and thus get starved.

Good point. I dumped the manifest on that problematic node, L6 has a lot of range tombstones interspersed with live data.

  added:         L6 413069:134146543<#0-#18602260>[/Table/53/1/73257433095/0/1597185668.403736843,0#0,SET-/Table/53/1/73263648701/0/1597185668.403736843,0#72057594037927935,RANGEDEL] (2020-08-13T20:10:06Z)
  added:         L6 413070:88438794<#0-#18602260>[/Table/53/1/73263648701/0/1597185668.403736843,0#18602260,RANGEDEL-/Table/53/1/73264444674/0/1597185668.403736843,0#0,SET] (2020-08-13T20:10:07Z)
  added:         L6 426841:134146900<#0-#19489728>[/Table/53/1/73264444675/0/1597185668.403736843,0#0,SET-/Table/53/1/73267655394/0/1597185668.403736843,0#72057594037927935,RANGEDEL] (2020-08-13T21:17:50Z)
  added:         L6 426849:88417360<#0-#19489728>[/Table/53/1/73267655394/0/1597185668.403736843,0#19489728,RANGEDEL-/Table/53/1/73269452800/0/1597185668.403736843,0#0,SET] (2020-08-13T21:17:52Z)
  added:         L6 393674:134146858<#0-#17394540>[/Table/53/1/73269452801/0/1597185668.403736843,0#0,SET-/Table/53/1/73272663522/0/1597185668.403736843,0#72057594037927935,RANGEDEL] (2020-08-13T18:39:02Z)
  added:         L6 393675:88415810<#0-#17394540>[/Table/53/1/73272663522/0/1597185668.403736843,0#17394540,RANGEDEL-/Table/53/1/73274460906/0/1597185668.403736843,0#0,SET] (2020-08-13T18:39:04Z)

When a range rebalances away, we take a snapshot for sending and write a range tombstone, right? Maybe we're too eager in compacting these range tombstones that are incapable of reclaiming disk space because of open snapshots. The 'compensated sizes' for these sstables are weighted as if they'll drop everything, but in reality we'll just rewrite the sstables and possibly prevent cheaper delete-only compactions.

If we disincentivized these ineffectual compactions, we could cheaply drop some of the L6 sstables with delete-only compactions, but we'd still be left with the sstable containing the range tombstone. If it no longer overlaps with L6, it might be move compacted into L6 with the broad range tombstone that blocks future ingestion into L6. Like @petermattis said, the L6 elision-only compactions are low priority right now and wouldn't trigger during an IMPORT where there's lots of automatic compaction.

Maybe this isn't a huge problem though because if the bulk of the data was already dropped from L6, the min-overlapping ratio heuristic will prioritize compacting into the smaller L6 range tombstone sstable as soon as there is any significant amount of overlapping data in L5. When that happens, the broad tombstone would be elided.

petermattis · 2020-08-14T16:45:18Z

When a range rebalances away, we take a snapshot for sending and write a range tombstone, right? Maybe we're too eager in compacting these range tombstones that are incapable of reclaiming disk space because of open snapshots. The 'compensated sizes' for these sstables are weighted as if they'll drop everything, but in reality we'll just rewrite the sstables and possibly prevent cheaper delete-only compactions.

When a range rebalances away, the leaseholder replica creates a Pebble snapshot and sends the data to the new replica using the snapshot. Only after the data has been sent is one of the existing replicas considered for deletion. I don't think the scenario you described here would happen as described. But because Pebble snapshots are global, we could be seeing something like it due to snapshots being sent for other ranges on the same node.

jbowens · 2021-01-29T20:32:53Z

CockroachDB would define guards at the boundary between the local and global keyspace. It may also define guards at the boundary between SQL tables.

I wonder if guards at SQL table boundaries might help with concurrent pre-ordered IMPORT performance. If there are no overlapping sstables in higher levels, ingested sstables slot directly into L6. I think this is typical behavior for the bank workload import, because bank produces ordered rows and only IMPORTs into a single table.

As I understand it, during an ordered IMPORT, most of the data ends up being ingested. But sometimes chunks of imported data are written through normal writes if the produced sstable is smaller than kv.bulk_io_write.small_write_size. If you're importing into multiple tables, these normal writes will likely end up in L0 sstables spanning SQL table boundaries, which will force future ingested sstables into higher levels.

In this scenario, I think SQL table boundary guards would allow most ingested sstables to slot directly into L6. If the source data for a table is split into multiple files, you can still have normal writes that end up overlapping ingested sstables, but maybe in that world it's better to drop the kv.bulk_io_write.small_write_size optimization or push it into the Pebble ingest codepath.

Here's a code comment regarding that kv.bulk_io_write.small_write_size optimization:

				// If this SST is "too small", the fixed costs associated with adding an
				// SST – in terms of triggering flushes, extra compactions, etc – would
				// exceed the savings we get from skipping regular, key-by-key writes,
				// and we're better off just putting its contents in a regular batch.
				// This isn't perfect: We're still incurring extra overhead constructing
				// SSTables just for use as a wire-format, but the rest of the
				// implementation of bulk-ingestion assumes certainly semantics of the
				// AddSSTable API - like ingest at arbitrary timestamps or collision
				// detection - making it is simpler to just always use the same API
				// and just switch how it writes its result.

github-actions · 2022-07-25T11:01:40Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

github-actions · 2024-01-29T11:01:35Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

petermattis · 2024-01-29T14:00:38Z

@jbowens Any interest in keeping this issue open? Or are approaches like #2366 more likely to be implemented?

jbowens · 2024-02-05T15:26:45Z

I think it's worth keeping open for now. I can imagine a few specific uses that might benefit from explicit guards but be challenging to handle with a generalized approach like #2366.

petermattis mentioned this issue Feb 4, 2020

perf: generalized "move" compactions #518

Open

petermattis mentioned this issue Feb 14, 2020

[WIP] db: add support for user-defined sstable partitioning #536

Open

petermattis mentioned this issue Feb 25, 2020

perf: compaction improvements meta issue #552

Open

jbowens mentioned this issue Sep 1, 2020

perf: don't prioritize compaction of pinned range tombstones #872

Closed

jbowens mentioned this issue Apr 27, 2022

compaction: consider splitting heuristics to aid sequential ingests #1671

Closed

github-actions bot added the no-issue-activity label Jul 25, 2022

jbowens removed the no-issue-activity label Jul 25, 2022

nicktrav added the A-write-amp potential to reduce write amplification label Aug 3, 2022

This was referenced Dec 12, 2022

db: consider splitting ingested sstables #2181

Closed

storage: block property filter on tenant,table,index cockroachdb/cockroach#93427

Open

jbowens mentioned this issue Feb 27, 2023

db: lower-level gap heuristic #2366

Open

github-actions bot added the no-issue-activity label Jan 29, 2024

github-actions bot removed the no-issue-activity label Jan 30, 2024

jbowens mentioned this issue Feb 5, 2024

storage,kv: consider reorganizing local keyspace keys to reduce w-amp cockroachdb/cockroach#118762

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: support user-defined "guards" for sstable boundaries #517

perf: support user-defined "guards" for sstable boundaries #517

petermattis commented Feb 4, 2020 •

edited by cockroach-jira-scripts

Loading

tbg commented Feb 11, 2020

petermattis commented Feb 11, 2020

jbowens commented Aug 13, 2020

jbowens commented Aug 13, 2020

petermattis commented Aug 13, 2020

jbowens commented Aug 13, 2020

petermattis commented Aug 13, 2020

sumeerbhola commented Aug 14, 2020

petermattis commented Aug 14, 2020

jbowens commented Aug 14, 2020

petermattis commented Aug 14, 2020

jbowens commented Jan 29, 2021

github-actions bot commented Jul 25, 2022

github-actions bot commented Jan 29, 2024

petermattis commented Jan 29, 2024

jbowens commented Feb 5, 2024

perf: support user-defined "guards" for sstable boundaries #517

perf: support user-defined "guards" for sstable boundaries #517

Comments

petermattis commented Feb 4, 2020 • edited by cockroach-jira-scripts Loading

tbg commented Feb 11, 2020

petermattis commented Feb 11, 2020

jbowens commented Aug 13, 2020

jbowens commented Aug 13, 2020

petermattis commented Aug 13, 2020

jbowens commented Aug 13, 2020

petermattis commented Aug 13, 2020

sumeerbhola commented Aug 14, 2020

petermattis commented Aug 14, 2020

jbowens commented Aug 14, 2020

petermattis commented Aug 14, 2020

jbowens commented Jan 29, 2021

github-actions bot commented Jul 25, 2022

github-actions bot commented Jan 29, 2024

petermattis commented Jan 29, 2024

jbowens commented Feb 5, 2024

petermattis commented Feb 4, 2020 •

edited by cockroach-jira-scripts

Loading