Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/sql: index backfill and truncate temporarily crater write throughput on large tables #62672

Closed
nvanbenschoten opened this issue Mar 26, 2021 · 27 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-schema-changes C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 26, 2021

Note: This issue started off as an investigation into the disruptive impact of a TRUNCATE on foreground traffic. However, we later observed that this problem generalizes to index backfills as well. See #62672 (comment) below.

We've seen that TRUNCATE operations on large, write-heavy tables are very disruptive. They can cause a dramatic drop in throughput that takes 10s of minutes to recover from.

During a recent investigation, we determined that this was due to how TRUNCATE impacts the load distribution in KV. TRUNCATE replaces the existing table keyspace with a new one, initially backed by a single range. This causes all load that had been previously well-distributed over the cluster to be directed at a single leaseholder. This leaseholder struggles to serve the load while also splitting into multiple ranges (due to range size and load) and balancing these new ranges across the cluster.

While this hotspot persists, throughput is severely impacted.

Reproduction

roachprod create ${USER}-truncate --nodes 20 --clouds=aws --aws-machine-type-ssd=m5d.8xlarge
roachprod stage  ${USER}-truncate release v20.2.6
roachprod start  ${USER}-truncate:1-16 --store-count=2
roachprod ssh ${USER}-truncate:17 --    './cockroach workload init kv {pgurl:1}'
roachprod ssh ${USER}-truncate:17-20 -- './cockroach workload run kv --concurrency=256 {pgurl:1-16}'
# after 10 minutes in other terminal
roachprod ssh ${USER}-truncate:1 -- './cockroach sql --insecure -e "TRUNCATE TABLE kv.kv"'

This causes the following effect:

Screen Shot 2021-03-26 at 1 03 43 PM

Notice that throughput does not return to its pre-truncated level for at least 10 minutes. Things are likely even worse with larger clusters.

Next steps

  • Determine why load-based splitting is not kicking in immediately after the truncate to creating many ranges.
  • Determine why load-based rebalancing is not kicking in immediately after the truncate to spread out the many ranges. Could the wave of range merges due to the loss of load on the old keyspace be starving out new rebalancing? (see storage: merge less aggressively after load-based splitting #41317)
  • Determine whether there's something better we can do to avoid creating a single-range hotspot in the first place. Can we use table statistics to prepare a more appropriate keyspace to switch over to? If we could pre-split and pre-scatter the keyspace before hitting it with load, we'd be in a much better position.

gz#8170

Epic: CRDB-2625

gz#8886

gz#9032

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-investigation Further steps needed to qualify. C-label will change. A-kv-distribution Relating to rebalancing and leasing. A-schema-changes labels Mar 26, 2021
@ajwerner
Copy link
Contributor

Have you tried on master?

@nvanbenschoten
Copy link
Member Author

No, I have not, but I can. What do we expect to be different on master?

@ajwerner
Copy link
Contributor

I could have sworn that we mapped this slowdown to cockroachdb/pebble#981 months ago.

@nvanbenschoten
Copy link
Member Author

Here is the behavior on 8b552c7. Looks mostly the same:

Screen Shot 2021-03-26 at 2 17 40 PM

During this reproduction, I explored why ranges were being split after the TRUNCATE, as there had been some confusion over whether this was due to size or load. It does look like ranges are being split due to load (which is good!), but at a rate of only about 2 splits per range per minute.

I210326 18:07:19.110863 8560363 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r85/3:‹/{Table/53/1/9…-Max}›] 698  initiating a split of this range at key ‹/Table/53/2/-1186540398063376634› [r154] (‹load at key /Table/53/2/-1186540398063376634 (41252.18 splitQPS, 1609.92 batches/sec, 1417.86 raft mutations/sec)›)‹›
I210326 18:07:30.323017 9009701 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r154/3:‹/{Table/53/2/-…-Max}›] 835  initiating a split of this range at key ‹/Table/53/2/4882772393790623368› [r155] (‹load at key /Table/53/2/4882772393790623368 (50996.76 splitQPS, 23954.83 batches/sec, 1318.69 raft mutations/sec)›)‹; r154/4 is being probed (may or may not need a Raft snapshot)›
I210326 18:07:41.469750 9515716 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r155/3:‹/{Table/53/2/4…-Max}›] 940  initiating a split of this range at key ‹/Table/53/2/7071725178704415234› [r156] (‹load at key /Table/53/2/7071725178704415234 (23063.61 splitQPS, 11357.11 batches/sec, 935.31 raft mutations/sec)›)‹; r155/6 is waiting for a Raft snapshot›
I210326 18:07:41.772243 9531429 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r154/3:‹/Table/53/2/{-11865…-488277…}›] 942  initiating a split of this range at key ‹/Table/53/2/1964751474247208420› [r437] (‹load at key /Table/53/2/1964751474247208420 (27728.04 splitQPS, 15743.88 batches/sec, 1058.71 raft mutations/sec)›)‹›
I210326 18:07:43.768370 9621888 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r85/3:‹/Table/53/{1/91650…-2/-1186…}›] 959  initiating a split of this range at key ‹/Table/53/2/-5890102151327250091› [r438] (‹load at key /Table/53/2/-5890102151327250091 (43029.29 splitQPS, 19624.21 batches/sec, 1779.29 raft mutations/sec)›)‹›
I210326 18:07:52.533886 10161357 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r156/3:‹/{Table/53/2/7…-Max}›] 996  initiating a split of this range at key ‹/Table/53/2/8189182208545922397› [r439] (‹load at key /Table/53/2/8189182208545922397 (15987.60 splitQPS, 7103.84 batches/sec, 633.06 raft mutations/sec)›)‹; r156/4 is waiting for a Raft snapshot›
I210326 18:07:52.864334 10183632 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r437/3:‹/Table/53/2/{196475…-488277…}›] 1004  initiating a split of this range at key ‹/Table/53/2/3378926683732238032› [r440] (‹load at key /Table/53/2/3378926683732238032 (21990.95 splitQPS, 9876.53 batches/sec, 762.01 raft mutations/sec)›)‹›
I210326 18:07:55.058714 10303293 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r438/3:‹/Table/53/2/-{589010…-118654…}›] 1014  initiating a split of this range at key ‹/Table/53/2/-3419204553766014128› [r441] (‹load at key /Table/53/2/-3419204553766014128 (25009.98 splitQPS, 16445.23 batches/sec, 1273.94 raft mutations/sec)›)‹›
I210326 18:07:55.775786 10339893 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r155/3:‹/Table/53/2/{488277…-707172…}›] 1024  initiating a split of this range at key ‹/Table/53/2/6014787348318689137› [r442] (‹load at key /Table/53/2/6014787348318689137 (13105.03 splitQPS, 7185.43 batches/sec, 683.46 raft mutations/sec)›)‹; r155/6 is waiting for a Raft snapshot›
I210326 18:07:57.774800 10468756 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r154/3:‹/Table/53/2/{-11865…-196475…}›] 1043  initiating a split of this range at key ‹/Table/53/2/386339293792052538› [r443] (‹load at key /Table/53/2/386339293792052538 (27119.77 splitQPS, 10744.07 batches/sec, 893.66 raft mutations/sec)›)‹›
I210326 18:08:03.618528 10781201 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r439/3:‹/{Table/53/2/8…-Max}›] 1067  initiating a split of this range at key ‹/Table/53/2/8705645444285876232› [r444] (‹load at key /Table/53/2/8705645444285876232 (10842.10 splitQPS, 4383.45 batches/sec, 415.41 raft mutations/sec)›)‹›
I210326 18:08:03.954046 10798192 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r440/3:‹/Table/53/2/{337892…-488277…}›] 1071  initiating a split of this range at key ‹/Table/53/2/4157848410948475527› [r445] (‹load at key /Table/53/2/4157848410948475527 (15798.49 splitQPS, 6536.19 batches/sec, 530.17 raft mutations/sec)›)‹›
I210326 18:08:06.244836 10909176 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r441/3:‹/Table/53/2/-{341920…-118654…}›] 1074  initiating a split of this range at key ‹/Table/53/2/-2180130495874161662› [r446] (‹load at key /Table/53/2/-2180130495874161662 (23021.75 splitQPS, 10670.59 batches/sec, 880.91 raft mutations/sec)›)‹›
I210326 18:08:06.854375 10937864 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r442/3:‹/Table/53/2/{601478…-707172…}›] 1077  initiating a split of this range at key ‹/Table/53/2/6540827195747335702› [r777] (‹load at key /Table/53/2/6540827195747335702 (10540.73 splitQPS, 5180.68 batches/sec, 458.39 raft mutations/sec)›)‹›
I210326 18:08:15.019081 11277085 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r445/3:‹/Table/53/2/4{157848…-882772…}›] 1124  initiating a split of this range at key ‹/Table/53/2/4512809824461552487› [r778] (‹load at key /Table/53/2/4512809824461552487 (8171.23 splitQPS, 3895.43 batches/sec, 351.14 raft mutations/sec)›)‹›
I210326 18:08:17.907134 11391792 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r777/3:‹/Table/53/2/{654082…-707172…}›] 1145  initiating a split of this range at key ‹/Table/53/2/6795481456936418537› [r779] (‹load at key /Table/53/2/6795481456936418537 (5932.83 splitQPS, 2942.06 batches/sec, 293.04 raft mutations/sec)›)‹›
I210326 18:08:18.785298 11423467 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r155/3:‹/Table/53/2/{488277…-601478…}›] 1159  initiating a split of this range at key ‹/Table/53/2/5423887395245668889› [r780] (‹load at key /Table/53/2/5423887395245668889 (13188.91 splitQPS, 5903.44 batches/sec, 614.73 raft mutations/sec)›)‹›
I210326 18:08:19.111082 11435322 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r85/3:‹/Table/53/{1/91650…-2/-5890…}›] 1165  initiating a split of this range at key ‹/Table/53/2/-7642450099548430550› [r781] (‹load at key /Table/53/2/-7642450099548430550 (38065.34 splitQPS, 15501.95 batches/sec, 1982.71 raft mutations/sec)›)‹›
I210326 18:08:29.797078 11902284 kv/kvserver/replica_command.go:398 ⋮ [n10,split,s19,r440/3:‹/Table/53/2/{337892…-415784…}›] 1207  initiating a split of this range at key ‹/Table/53/2/3732974195458568389› [r782] (‹load at key /Table/53/2/3732974195458568389 (12008.18 splitQPS, 4861.11 batches/sec, 512.34 raft mutations/sec)›)‹›

I wonder whether there are any easy wins here to be had regarding not resetting a range's load-based splitting tracker during a load-based split:

// Reset the splitter now that the bounds of the range changed.
r.loadBasedSplitter.Reset()

@ajwerner
Copy link
Contributor

Not resetting but rather splitting out the samples makes a lot of sense.

@nvanbenschoten
Copy link
Member Author

Determine whether there's something better we can do to avoid creating a single-range hotspot in the first place. Can we use table statistics to prepare a more appropriate keyspace to switch over to? If we could pre-split and pre-scatter the keyspace before hitting it with load, we'd be in a much better position.

I think we're going to need to do something like this if we want to make a TRUNCATE fully non-disruptive. As @lunevalex astutely pointed out, the ramp-up period after a TRUNCATE looks a lot like immediately after the load is started, though maybe a fixed multiple slower. Even if we could make this just as efficient as the initial ramp-up period, that likely still wouldn't be sufficient, as that still takes 2-3 minutes to reach peak throughput, and longer on larger clusters.

@petermattis
Copy link
Collaborator

Determine whether there's something better we can do to avoid creating a single-range hotspot in the first place. Can we use table statistics to prepare a more appropriate keyspace to switch over to? If we could pre-split and pre-scatter the keyspace before hitting it with load, we'd be in a much better position.

We could split the new range along the same boundaries as the pre-truncated table. Or some fraction of those boundaries (e.g. every Nth range boundary). I bet we could hack together something which did this using crdb_internal.ranges in order to verify that it did address the problem. This is presuming that the data inserted into the table post-truncate has the same distribution as the data pre-truncate, but we'd also need to make that assumption if we used table statistics.

@nvanbenschoten
Copy link
Member Author

I like this idea a lot. Using the existing range boundaries instead of trying to recreate them from table statistics seems easier, more precise, and has the potential of piggybacking on top of existing load-based splits in case of uneven (but also not divergent) load distribution.

@aliher1911 do you have interest in doing some exploration here? We can walk through the relevant code during our pod meeting later today.

@andreimatei
Copy link
Contributor

If we pre-split, would we use the range descriptor's "sticky bit" with some arbitrary expiration to keep the splits around for a while?

@ajwerner
Copy link
Contributor

If we did add the sticky bit, it might make sense to jitter the expiration times.

@nvanbenschoten
Copy link
Member Author

Yes, we'd set some split expiration, similar to what IMPORT does. I was thinking 10 minutes.

Jittering the expiration times is a good idea. We should do that in other places as well to avoid waves of simultaneous range merges.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 2, 2021
@nvanbenschoten
Copy link
Member Author

I hacked together something for this to test out whether pre-splitting based on existing range boundaries would help out, as suggested by @petermattis in #62672 (comment). I then realized @jordanlewis was already testing this out on his LARGE DATA BANK stream in a much cleaner way - sorry to step on your toes! I didn't realize you had actually taken the stream suggestion. I'll let you take over from here.

Even the hacky version demonstrated a major improvement. Here's a screenshot of a series of three TRUNCATE operations using the same setup as listed above. TRUNCATE operations occur under three different configurations:

  • 9:24: pre-splitting enabled, range merging enabled
  • 9:29: pre-splitting enabled, range merging disabled to avoid a few load-based range merges after load halts on the truncated table, which we found to be disruptive in kv/sql: GC after TRUNCATE temporarily craters write throughput #62700 and will likely be fixed by improving Raft snapshots
  • 9:34: pre-splitting disabled

Screen Shot 2021-04-02 at 5 42 14 PM

As we see, pre-splitting the new keyspace into some number of ranges and scattering before switching over the load made a very large difference. We should proceed with that approach.

@jordanlewis
Copy link
Member

Haha, no worries @nvanbenschoten! I didn't tell you I was going to work on this on the stream. I also haven't been able to figure out how to nicely test as well as you have here, so hopefully we can merge our two approaches into something that is great.

craig bot pushed a commit that referenced this issue Apr 29, 2021
63043: sql: TRUNCATE preserves split points of indexes r=jordanlewis a=jordanlewis

Touches #62672

Previously, TRUNCATE unconditionally blew away all index split points
when creating the replacement empty indexes. This manifested in some bad
behavior when performing TRUNCATE on tables or indexes that had heavy
load and were heavily split already, because all of a sudden all of the
traffic that was nicely dispersed across all of the ranges would
redirect to the single new range that TRUNCATE created.

The bad performance would remediate over time as the database re-split
the new ranges, but preserving the splits across the index swap
boundaries is a faster way to get there.

Release note (sql change): TRUNCATE is now less disruptive on tables
with a lot of concurrent traffic.

64395: roachtest: improve varz regexp r=andreimatei a=andreimatei

It was recognizing:
follower_reads_success_count{store="1"} 27606
but not:
sql_select_count  1.652807e+06

Release note: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@nvanbenschoten
Copy link
Member Author

I ran through the original reproduction steps on master. The result...

Screen Shot 2021-05-05 at 2 58 12 PM

Things look significantly better. We do see a very minor latency blip during the pre-split-and-scatter and then later as the new table splits further due to load, but these are inconsequential compared to the latency spikes and corresponding throughput craters we were seeing before.

Nice job on this @jordanlewis!

@aayushshah15
Copy link
Contributor

aayushshah15 commented May 17, 2021

I believe we're seeing a very analogous series of events play out due to an index backfill. The same customer that prompted this issue has seen similar impact to foreground traffic on a table the moment an index backfill is kicked off.

On 20.2.8, with @nvanbenschoten's repro steps from above, but instead of TRUNCATE, kicking off an index creation (1:44) gets us a very similar set of observable symptoms:
backfill_20 2 8

The rate of load-based range splits seems to be about the same as what @nvanbenschoten observed in his repro -- I'm seeing 1 split per range once every ~25 seconds.

The story is similar on master but we don't see a dip in Replicas per Node like we do on 20.2, presumably due to #64201.

From my very limited understanding of this area, it does seem to check out. The index backfiller only seems to create size based splits as it ingests SSTs, so the moment an index backfill is kicked off, all write traffic will be bottlenecked due to the (new) secondary index being consolidated to a small set of ranges until load based splitting catches up.

rp run ${USER}-backfill:1-16 -- grep -ER "/Table/53/3/.*initiating a split of this range" logs/ | ag -o "[0-9]+.[0-9]+ splitQPS"
>
...
3216.26 splitQPS
3216.26 splitQPS
3318.35 splitQPS
3318.35 splitQPS
25791.39 splitQPS
24843.94 splitQPS
11742.71 splitQPS
11575.02 splitQPS
...

I reckon a fix to this would be very similar to #63043 in theory. However, we might not want to do any downsampling in the case of an index backfill.

I'm re-opening this issue for now since it has a ton of valuable context.

cc @dt as you were looking into this.

@aayushshah15 aayushshah15 reopened this May 17, 2021
@aayushshah15 aayushshah15 changed the title kv/sql: TRUNCATE temporarily craters write throughput on large tables kv/sql: index backfill and truncate temporarily crater write throughput on large tables May 17, 2021
@nvanbenschoten
Copy link
Member Author

Thanks for writing this up @aayushshah15!

The story is similar on master but we don't see a dip in Replicas per Node like we do on 20.2, presumably due to #64201.

Beyond the impact on Replicas per Node, did this have any effect on latency or throughput?

I reckon a fix to this would be very similar to #63043 in theory. However, we might not want to do any downsampling in the case of an index backfill.

It seems a little less clear what we should do in this case, because we don't trivially know the distribution of rows that will land in the secondary index. With TRUNCATE, things were simpler, because we had some reason to believe that the indexes would look similar to how they looked before the TRUNCATE.

Have you given any thought to this? Perhaps we could sample the existing PK and perform the secondary index translation to get a sense for what the secondary index will look like ahead of time?

Also, have we confirmed that this is still a pressing issue for the customer? Have we seen the behavior of an index backfill now that the Pebble write stall issues have been resolved? I have to imagine that AddSSTable requests were some of the most impacted operations when file creation was taking 60ms and leading to full engine stalls. So it seems very possible that even though there is work to do here, it no longer has the same urgency for the customer.

As much as we can, I think we should re-assess all of the items this customer is interested in to determine whether they were being exacerbated by the write stall issues.

@aayushshah15
Copy link
Contributor

Beyond the impact on Replicas per Node, did this have any effect on latency or throughput

On master:
backfill_master_may17

On 20.2.8:
backfill_20 2 8_fullview

So it seems like there's no real difference besides the less aggressive range-merges.

Have we seen the behavior of an index backfill now that the Pebble write stall issues have been resolved?

Just to confirm, you're referring to cockroachdb/pebble#1125, right? The run on master included the corresponding vendor bump for this change, so it seems like it did not have much of impact here. (That makes sense right? If the consolidation of all the write traffic to a handful of ranges[1] is the bottleneck then we wouldn't expect those pebble fixes to help, would we?)

because we don't trivially know the distribution of rows that will land in the secondary index

Have you given any thought to this? Perhaps we could sample the existing PK and perform the secondary index translation to get a sense for what the secondary index will look like ahead of time?

That's a good point. I'm having a hard time understanding your proposal though. Is the idea that, since the system already some idea of the mapping from a primary index's data distribution to its load distribution, we can somehow come up with a way to form a mapping between a secondary index's data distribution to its load distribution (and pre-split based on that)? Maybe I'm totally misunderstanding you and this is not what you mean at all.

[1]: Range counts on the new secondary index right after the backfill is kicked off
while true
do
        rp ssh ${USER}-backfill:1 -- './cockroach sql --insecure -e "SELECT count(*) FROM [ SHOW RANGES FROM INDEX kv.kv@test ]"';
        sleep 3
done
>
count
2
count
2
count
2
count
3
count
3
count
4
count
4
count
4
count
6
...

@dt
Copy link
Member

dt commented May 17, 2021

If this hypothesis is correct, we should see this dip in SQL throughput just by replacing the index backfill itself with a sleep 300, and not reading/adding SSTables at all, right?

If so, then I wonder if it might help to instead ramp traffic to the index, e.g. making new DELETE_ONLY_RAMP and DELETE_AND_WRITE_ONLY_RAMP state that we step through where SQL writes only probabilistically opt to update the new index, and then let load-based splitting take care of it, based on real observed traffic. Then after the ramp period, we step to the current DELETE_ONLY, then DELETE_AND_WRITE_ONLY, wait for one version and do the regular backfill? During the ramp phase we could probabilistically choose to include the update based on time elapsed since the state change or something.

@dt
Copy link
Member

dt commented May 17, 2021

That said, when we looked at a customer's cluster that had seen throughput fall to single-digit percentage of what it was before the CREATE INDEX, it stayed that way for a long time -- much longer than one or two minutes. While it was in that state, we enabled diagnostics collection for one of the slow queries on the table and observed it waited nearly 1s for quota pool (it usually ran in single-digit millis). Given it stayed in that state for a long period, maybe just waiting for load-based splitting wouldn't help after all.

@aayushshah15
Copy link
Contributor

aayushshah15 commented May 17, 2021

That said, when we looked at a customer's cluster that had seen throughput fall to single-digit percentage of what it was before the CREATE INDEX, it stayed that way for a long time -- much longer than one or two minutes.

That's more a function of the size of the table and the overall traffic to that table, right? Like, in this reproduction, the entire cluster had ~400 ranges before the backfill. It seems like it would just be a longer climb back to the peak if the table had more data. I can run a much larger repro than this and confirm this.

While it was in that state, we enabled diagnostics collection for one of the slow queries on the table and observed it waited nearly 1s for quota pool (it usually ran in single-digit millis).

Would you mind pointing me to the trace event that you saw?

It also seems like what we saw here is mostly expected -- given that the proposal quota pool has a size of 8mb (per range) with the default log truncation threshold. If all write traffic was consolidated to a single range at the beginning, we would expect the quota pool to step in and backpressure write traffic. This would've been exacerbated by the write-heavy nature of the customer's workload.

@dt
Copy link
Member

dt commented May 17, 2021

Unfortunately the trace was collected and viewed during a live debugging session with the customer. I don't know if it was retained, but can follow-up with you offline for details.

@nvanbenschoten
Copy link
Member Author

Just to confirm, you're referring to cockroachdb/pebble#1125, right? The run on master included the corresponding vendor bump for this change, so it seems like it did not have much of impact here. (That makes sense right? If the consolidation of all the write traffic to a handful of ranges[1] is the bottleneck then we wouldn't expect those pebble fixes to help, would we?)

I'm also referring to the encryption=plain config that was causing file creation and deletion to take 60+ ms (which is what allowed cockroachdb/pebble#1125 to become such an issue). It makes sense that it wouldn't have an impact on this test, but another important question is whether the immediate motivation for this issue was due to index backfills triggering and/or waiting on write stalls. In other words, is this issue (even including the TRUNCATE part) motivated by the 3 minutes of reduced write throughput, or were things significantly worse on the customer's cluster due to the write stall issue? David's comment in #62672 (comment) provides some evidence for that. And if so, are they still this bad? Now that both of those issues are fixed, things may look very different on their cluster.

I'm not saying we shouldn't continue exploring how to improve this, just that we should re-evaluate (in part, for the customer's sake) how bad things remain now that the other potentially related issues have been resolved.

@aayushshah15
Copy link
Contributor

@lunevalex suggested an approach to mitigate the foreground impact that we seem to have previously considered for some other use cases as well.

The idea was that we would perform 2 backfills -- one that corresponds to when the index creation is kicked off, say with ts = 10. Until this first backfill completes, we would make it such that foreground write traffic on the table doesn't update this index. Say this first backfill completes at ts = 20. At this point, we start a second backfill that ingests mutations from timestamps [11, 20] while also having foreground write traffic update the index (i.e. what we currently do with the index when it is DELETE_AND_WRITE_ONLY). This second backfill should be able to use a time-bound iterator, and should be considerably faster because of that.

This should get us out of the main issue of load consolidation on a handful of ranges, as by the time ongoing writes have to update the new index, that index at least has a bunch of size based splits. Notably though, this approach wouldn't capture any load-based splits until write traffic has to start updating the index, so it probably warrants some prototyping to figure out if this'll be adequate.

cc @shermanCRL and @pbardea, we should discuss where the prototyping / implementation work items for this should be routed, as the work is mostly going to reside in areas owned by bulk.

@dt
Copy link
Member

dt commented May 26, 2021

See also: #36850

@dt
Copy link
Member

dt commented May 28, 2021

#36850 could also be a solution to a lots-of-SQL-traffic-hitting-un-split-index-span, but it is a pretty heavy-weight way to get some key span to split and would be a non-trivial undertaking. If all we need is to split the span, we might want to look at some shorter-term solutions which might offer quicker wins. Off the top of my head:

  1. sample 1% of rows from the table, generate index keys from those rows and split at them
  2. use the sampling already in table stats to pick some split points
  3. Instead of jumping straight to DLETE/DELETE_AND_WRITE_ONLY, bleed sql traffic onto the new span slowly: Add a ramp stage of these states during which has a % chance of emitting the delete/write so load-based splitting can do its thing. Once we've been in the ramp phase for some time, we could move to the actual DELETE/WRITE only.
    3b. Or even quicker than waiting for load-based splitting, during the ramp stage we could have SQL include an adminSplit with the batch, in addition to the delete/write, and add a If-QPS-Above=5 flag to that split so it is a noop if we're split enough.

We've evaluated #36850 a couple times over the years, but we've always said we want to do some experiments to validate its benefits before we sink too much effort into it because it is non-trivial. See #54955

@aayushshah15
Copy link
Contributor

@dt suggested that we should confirm that the backfill itself doesn't have a role to play in this. I ran with a 5 minute sleep right before the index backfiller begins constructing and ingesting SSTs and saw the following.
backfill_400kb

We're seeing one valley right when the index creation starts, and a slightly shorter dip during the backfill (albeit with a pretty severe impact to p99 latencies). The first crater seems to decisively point to load-consolidation as we were suspecting before. For instance, we see the CPU utilization on one of the nodes surge along with a corresponding dip for all the other nodes.
backfill_cpu_consolidation

The cratering we were seeing when the backfill starts seemed to be caused due to the disruptive impact of SST ingestion as @nvanbenschoten had suspected before:
rocksdb.flushes
backfill_flushes

To validate this, I ran with kv.bulk_io_write.small_write_size set to 40mb (it defaults to 400kb). This pretty much eliminated the second throughput blip (along with the surge in p99 latencies) that we were seeing when the backfill began.
backfill_ssts_as_writes

@lunevalex
Copy link
Collaborator

I am going to close this, as we done everything we are going to do in KV here and there are other issues to track the proposed changes to the index backfiller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-schema-changes C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

9 participants