Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/sql: GC after TRUNCATE temporarily craters write throughput #62700

Closed
nvanbenschoten opened this issue Mar 28, 2021 · 15 comments · Fixed by #64201
Closed

kv/sql: GC after TRUNCATE temporarily craters write throughput #62700

nvanbenschoten opened this issue Mar 28, 2021 · 15 comments · Fixed by #64201
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-investigation Further steps needed to qualify. C-label will change.

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 28, 2021

This issue described the back half of #62672.

After waiting out the GC TLL of a TRUNCATE or DROP TABLE, a series of ClearRange requests are issued to clear out space. This can cause a dramatic drop in write throughput on other tables that takes 10s of minutes to recover from. In the words of the customer that helped us diagnose this: "it's a write availability time bomb."

Reproduction

The easiest way to reproduce this is to use the clearrange/checks=false roachtest, with a slight adaptation to run load during the cleanup:

diff --git a/pkg/cmd/roachtest/clearrange.go b/pkg/cmd/roachtest/clearrange.go
index e557cdd17b..25d8b4d10c 100644
--- a/pkg/cmd/roachtest/clearrange.go
+++ b/pkg/cmd/roachtest/clearrange.go
@@ -73,7 +73,7 @@ func runClearRange(ctx context.Context, t *test, c *cluster, aggressiveChecks bo

        if t.buildVersion.AtLeast(version.MustParse("v19.2.0")) {
                conn := c.Conn(ctx, 1)
-               if _, err := conn.ExecContext(ctx, `SET CLUSTER SETTING kv.bulk_io_write.concurrent_addsstable_requests = $1`, c.spec.NodeCount); err != nil {
+               if _, err := conn.ExecContext(ctx, `SET CLUSTER SETTING kv.bulk_io_write.concurrent_addsstable_requests = $1`, 8); err != nil {
                        t.Fatal(err)
                }
                conn.Close()
@@ -114,6 +114,11 @@ func runClearRange(ctx context.Context, t *test, c *cluster, aggressiveChecks bo
        }()

        m := newMonitor(ctx, c)
+       m.Go(func(ctx context.Context) error {
+               c.Run(ctx, c.Node(1), `./cockroach workload init kv`)
+               c.Run(ctx, c.All(), `./cockroach workload run kv --concurrency=32 --duration=1h30m`)
+               return nil
+       })
        m.Go(func(ctx context.Context) error {
                conn := c.Conn(ctx, 1)
                defer conn.Close()
@@ -132,7 +137,7 @@ func runClearRange(ctx context.Context, t *test, c *cluster, aggressiveChecks bo

                // Set a low TTL so that the ClearRange-based cleanup mechanism can kick in earlier.
                // This could also be done after dropping the table.
-               if _, err := conn.ExecContext(ctx, `ALTER TABLE bigbank.bank CONFIGURE ZONE USING gc.ttlseconds = 30`); err != nil {
+               if _, err := conn.ExecContext(ctx, `ALTER TABLE bigbank.bank CONFIGURE ZONE USING gc.ttlseconds = 1800`); err != nil {
                        return err
                }

This configures the test to import a 2 TB table, ramp write load up on a different table over 30 minutes, clear the 2 TB table, and then observe the impact on the foreground traffic. The effect is not pretty:

Screen Shot 2021-03-28 at 2 46 43 PM

Other notes

In attempting to reproduce, I also ran this with a configuration that looked more similar to the relevant customer's cluster - 20 nodes, 32 vCPU per node, 2 SSDs per node, and a 4 TB table. This looked very similar:

note: this is a 10 minute time scale, not 30 minute like above

Screen Shot 2021-03-27 at 10 49 58 PM

Screen Shot 2021-03-27 at 10 52 18 PM

Epic: CRDB-2627

@nvanbenschoten nvanbenschoten added C-investigation Further steps needed to qualify. C-label will change. A-kv-distribution Relating to rebalancing and leasing. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-transactions Relating to MVCC and the transactional model. labels Mar 28, 2021
@lunevalex
Copy link
Collaborator

@nvanbenschoten was this run with the Pebble fix cockroachdb/pebble#1085?

@nvanbenschoten
Copy link
Member Author

No, it was not. This was running v20.2.6.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 2, 2021
This was the original intention with the `SnapshotRequest_Strategy`
structure and was prototyped in cockroachdb#25134, but we never pushed it over
the finish line because we did not find cases where SST ingestion
was disruptive enough to warrant the extra complexity. cockroachdb#62700 tells
a different story.
@nvanbenschoten
Copy link
Member Author

I did some exploration on master to see where things stand today, where things are going wrong, and what we can do about it. With the same setup as above, I saw roughly the same behavior. Things may have fared slightly better, as throughput never hit 0 and latency never spiked above 1s, but the GC was still quite disruptive.

Screen Shot 2021-04-01 at 5 17 09 PM

As expected, the GC is followed by a wave of range merges that are in some way correlated with the destabilization.

Screen Shot 2021-04-01 at 5 30 42 PM


I then ran a few experiments to get an idea of cause and effect, as well as to determine whether there are any short-term mitigations here. We had speculated that a wave of range merges was responsible for destabilizing the cluster due to the rebalancing load imposed by the merge themselves along with the rebalancing load that fell out of the sudden change in cluster balance.

So first, I played with the kv.range_merge.queue_interval cluster setting, which dictates the rate at which a store will perform range merges. The rate defaults to 1 merge per second per store. The test was intentionally dropping this interval to 0, leading to no throttling at all. I tested with bumping the interval to 10s, leading to 1 range merge per 10 seconds per store. This demonstrated a large improvement in behavior.

Screen Shot 2021-04-01 at 5 16 29 PM

In this run, GC was followed by a steady rate of range merges and almost no impact to foreground traffic.

Screen Shot 2021-04-01 at 5 30 57 PM

This was a pretty clear indication that range merges are in some way responsible for the instability we are seeing here - either directly or indirectly. We actually already knew this to some extent, as we had seen early that with range merges disabled entirely, the GC was mostly non-disruptive. Then, when we enabled range merges, the instability immediately hit. So this test once again confirmed this fact. It also demonstrated that reducing the aggressiveness of range merges is a good short-term mitigation for this problem. So to summarize, the following should make TRUNCATE much less disruptive, even in earlier versions of CRDB:

SET CLUSTER SETTING kv.range_merge.queue_interval = '10s';

I then started digging into why a wave of range merges on a keyspace disjoint from the load was so disruptive. In this setup, we have very few ranges outside of those in the table being GCed, so the idea that range merges were leading to other rebalancing activity didn't quite add up. And yet, we still saw serious disruption due to range merges. So it seemed likely that the range merges and the empty range rebalancing they lead to was itself causing problems.

One theory I had was that we've been failing to account for a fixed externality of Raft snapshots - the need to flush the LSM's overlapping memtables when ingesting a Raft snapshot. This is a cost that a Raft snapshot currently (since #38932) has to pay, regardless of how small the snapshot is. It is also a cost that may be passed on to concurrent writers. As detailed in the documentation on (*pebble.DB).Ingest:

Note that if the mutable memtable overlaps with ingestion, a flush of the memtable is forced equivalent to DB.Flush. Additionally, subsequent mutations that get sequence numbers larger than the ingestion sequence number get queued up behind the ingestion waiting for it to complete. This can produce a noticeable hiccup in performance. See cockroachdb/pebble#25 for an idea for how to fix this hiccup.

I put together a prototype of a change we had originally (see #25134) intended to make when addressing #16954. Instead of always using external SST ingestion during Raft snapshot application, we would only do so for ranges above some size threshold. This allows us to use the LSM's standard write path when applying Raft snapshots below a size threshold, avoiding the memtable flush and "clean up" compactions of small SSTs. This prototype is hacked together in #63013.

Using this prototype, I ran the test, once again with range merging set to maximum aggressiveness (kv.range_merge.queue_interval = '0s'). The results are very promising.

Screen Shot 2021-04-01 at 11 58 03 PM

Screen Shot 2021-04-01 at 11 58 22 PM

In the test, we saw that there was a small blip in throughput when the GC went into effect due to p99 latency jumping from 30ms to about 60ms. However, throughput immediately recovered even as all 10k ranges were merged away in seconds. I'd like to re-run this experiment a few times to confirm that I'm not fooling myself, but this looks very good and hints that we may be on to something with the theory about SST ingestion during snapshot application being the cause of foreground latency.


One other thing I noticed when drafting this prototype is that we do not call PreIngestDelay during Raft snapshot ingestion like we do before ingesting an SST in an AddSSTable Raft proposal. Is this intentional? Could this help prevent rapid rebalancing from trashing L0? cc. @petermattis

@petermattis
Copy link
Collaborator

Great sleuthing, @nvanbenschoten.

One other thing I noticed when drafting this prototype is that we do not call PreIngestDelay during Raft snapshot ingestion like we do before ingesting an SST in an AddSSTable Raft proposal. Is this intentional? Could this help prevent rapid rebalancing from trashing L0? cc. @petermattis

I don't recall if that was intentional, and I don't know if it would help. The snapshot pre-ingest delay stuff was added by the Bulk IO team and we may simply have never thought to add it to the Raft snapshot ingestion path. On the other hand, delaying Raft snapshot ingestion will block other operations on the replica. Could that be problematic? I'm too distant from this area of the code now to judge. Certainly seems like a possible easy thing to experiment with.

I put together a prototype of a change we had originally (see #25134) intended to make when addressing #16954. Instead of always using external SST ingestion during Raft snapshot application, we would only do so for ranges above some size threshold. This allows us to use the LSM's standard write path when applying Raft snapshots below a size threshold, avoiding the memtable flush and "clean up" compactions of small SSTs. This prototype is hacked together in #63013.

Not ingesting very small sstables sounds like a good idea, though I don't have a good idea for how to make a principled decision on the size threshold at which to do ingestion vs the standard write path. Perhaps @sumeerbhola has some ideas. We could also consider doing this at the Pebble layer. cockroachdb/pebble#25 is one possibility for making ingestions faster. Another is that a very small ingested sstable could be transformed in an addition to the memtable. The WAL entry would be a pointer to the ingested sstable and the "commit" would iterate over the sstable and add the entries to the memtable. The upside to this approach is that it feels like a storage level concern. The downside is that either cockroachdb/pebble#25 or another approach would not be backward compatible so we'd have to gate using that on a cluster version.

@jbowens
Copy link
Collaborator

jbowens commented Apr 2, 2021

Nice @nvanbenschoten!

Another is that a very small ingested sstable could be transformed in an addition to the memtable. The WAL entry would be a pointer to the ingested sstable and the "commit" would iterate over the sstable and add the entries to the memtable. The upside to this approach is that it feels like a storage level concern.

Pebble also has the context to determine whether or not the files overlap the memtable. Maybe we we incorporate that into the heuristic of whether or not to apply the ingest as a memtable addition.

@sumeerbhola
Copy link
Collaborator

I don't have an opinion regarding what to do regarding a backportable fix. But going forward, given the number of places we ingest sstables, and how sensitive customers are likely to be wrt high percentile latency hiccups in an OLTP setting, cockroachdb/pebble#25 seems like the right approach (regardless of sstable size).

@nvanbenschoten
Copy link
Member Author

We could also consider doing this at the Pebble layer.

I am generally very in favor of doing this in Pebble, as SST ingestion is used in a number of places, like Sumeer is saying. I suspect that this is in some way related to the concerns we periodically hear about index creation being disruptive to foreground traffic.

That said, it is at least worth mentioning the I/O we could save by avoiding SST ingestion for small Raft snapshots. Today, each Raft snapshot results in 4+ SSTs, most of which are very small. In situations like we see here, all of them are very small. So we end up writing out 4+ small SSTs to the filesystem, then ingesting them, then quickly compacting them away. A WriteBatch avoids this wasted I/O, so it should be more efficient. How much more efficient? I don't know - so it would be a premature optimization to do anything here if more general solutions exist. And the downside is of course the added code complexity of needing to maintain an SST ingestion path and a WriteBatch path for Raft snapshots. But a WriteBatch path may be a more reasonable backport target.

@petermattis
Copy link
Collaborator

That said, it is at least worth mentioning the I/O we could save by avoiding SST ingestion for small Raft snapshots. Today, each Raft snapshot results in 4+ SSTs, most of which are very small. In situations like we see here, all of them are very small. So we end up writing out 4+ small SSTs to the filesystem, then ingesting them, then quickly compacting them away. A WriteBatch avoids this wasted I/O, so it should be more efficient. How much more efficient? I don't know - so it would be a premature optimization to do anything here if more general solutions exist. And the downside is of course the added code complexity of needing to maintain an SST ingestion path and a WriteBatch path for Raft snapshots. But a WriteBatch path may be a more reasonable backport target.

This is a good point. We need to be mindful of what is a more reasonable backport target. The proposed improvement to Pebble's sstable ingestion will not be backportable. The creation of the 4 small sstables will also result in 4 fsyncs. The WriteBatch approach avoids that, doing a single fsync of the Pebble WAL.

@nvanbenschoten
Copy link
Member Author

Unfortunately, I no longer believe that the analysis above is complete. About a week ago, I began working on a patch that would avoid the IngestExternalFiles calls when applying small Raft snapshots in a manner that I believed would be backportable. Unlike the first prototype, which built a WriteBatch instead of writing sst files as the Raft snapshot streamed in, this more targetted approach continued to write the sst files, but would then read them in and copy them into a WriteBatch and apply that if it determined that the WriteBatch would be reasonably sized. The patch is here.

I then ran a few more tests to confirm that the patch did address the issue, as I had seen before with the slightly different approach to avoiding the sst ingestion. Unfortunately, I was not able to demonstrate the improvement. I did see that we were no longer rapidly flushing memtables during the GC period, so the patch was doing what it intended to.

Screen Shot 2021-04-06 at 7 49 59 PM

without patch

Screen Shot 2021-04-06 at 11 32 13 PM

with patch

However, I once again saw throughput crater after a short period of time, eventually bottoming out at 0:

Screen Shot 2021-04-06 at 7 48 30 PM

So I went back to the original patch from above that I had seen demonstrate such an improvement. Sure enough, I had a hard time reproducing those results, as I was still seeing cases where throughput was cratering.

So it became clear that the previous diagnosis was not correct. There is more to the story.

I started exploring hardware metrics to try to correlate the degraded throughput with degredation of storage performance. There was certainly a correlation, as log commit latency often spiked during the GC period, as did the IOPS In Progress metric, but there wasn't much more to go off here.

Screen Shot 2021-04-06 at 7 49 46 PM

Screen Shot 2021-04-06 at 7 49 37 PM


I came back to the issue this week. This time, I focused less on storage-level performance degradation and more on distribution-level operations that may be causing stalls. The throughput hit began to look too severe to be explained by degraded storage performance, even if log commit latencies were spiking into the 800ms range. In each of the previous reproductions, there had been a period of performance degradation by about 60%, which can be mostly explained by a jump in storage latency. However, this had always been followed by a full outage period where qps hit 0, often repeatedly, which was even more concerning.

After a few tests, the following correlation caught my eye:

Screen Shot 2021-04-22 at 10 58 54 PM

Screen Shot 2021-04-22 at 10 58 58 PM

Notice that throughput is inversely correlated with the combined occurrences of NotLeaseHolderErrors and RangeNotFoundErrors. The first of these errors is thrown when a request is routed to a follower replica who the DistSender thought had the lease, but who did not. It is a sign of lease transfers. The second of these errors is thrown when a request is routed to a store that does not contain a replica for the destination range. It is a sign of rebalancing or range merging.

This was all a bit surprising, as the expectation was that merging and rebalancing were occurring on the dropped keyspace that was being GCed, but not on the keyspace of the active table which was seeing all the load. And yet, requests were hitting these errors.

I began playing with cluster settings to try to disable all lease transfers. The theory was that due to the range merges on the inactive table, we were moving around the leases on the active table to balance qps and count. The thought was that storage was slowing down in response to lots of compactions due to the range merges on the inactive table, then we started moving around leases, the lease transfers would be quite slow because storage is slow (especially in the tail), so we would have seconds-long blips of unavailability because a range is unavailable between the time that it begins a lease transfer and the time that the lease transfer applies on the new leaseholder. In that way, a lease transfer is a lot like a range-wide lock. Since the load was randomly distributed across all ranges, the second a range became unavailable, everyone would pile up on it and qps across the cluster would go to 0.

To test this out, I made an attempt to prevent all lease transfers. I tested with the following cluster settings:

SET CLUSTER SETTING kv.allocator.min_lease_transfer_interval = '20m';
SET CLUSTER SETTING kv.allocator.load_based_lease_rebalancing.enabled = false;
SET CLUSTER SETTING kv.allocator.load_based_rebalancing = 'off';
set cluster setting kv.allocator.range_rebalance_threshold = .25;

Unfortunately, this also did not help.


Next, I started thinking about the RangeNotFoundErrors, because I couldn't quite explain the rebalancing on the active table. I pulled the system.rangelog on the active table over the time period of the degraded qps to get a sense for what was happening over the course of this instability:

> SELECT table_id FROM crdb_internal.tables WHERE name = 'kv';

  table_id
------------
        57


> SELECT
    "eventType", count(*)
FROM
    system.rangelog
WHERE
    crdb_internal.pretty_key(decode(info::JSONB->'UpdatedDesc'->>'start_key', 'base64'), 0) LIKE '/57%'
    AND "timestamp" BETWEEN '2021-04-23 02:45:00' AND '2021-04-23 03:00:00'
GROUP BY
    "eventType";

   eventType   | count
---------------+--------
  split        |   119
  add_voter    |   393
  remove_voter |   399
  merge        |   100

What jumped out wasn't the rebalancing, but the range merges. Why on earth were we merging ranges in the active table? The answer to this lies in why these ranges existed in the first place - they were load-based splits! So when qps dropped on the cluster, the mergeQueue decided that the split points were no longer needed, so it removed them by merging ranges. This led to more of a slowdown, and more merges. And the merge cascade continued as the poor load struggled to keep up as it was bounced around between nodes and stalled on merges and lease transfers. Furthermore, as ranges merged, the load began to consolidate back on a few nodes instead of being well distributed across the cluster.

There's some history here, as something similar was tracked in #41317. We closed that issue by introducing a new kv.range_split.by_load_merge_delay cluster setting, which dictates the delay that range splits created due to load will wait before considering being merged away. By default, a load-based split will not be merged away if it was created less than 5 minutes ago. This helps, but it's also not a complete solution, as is discussed in the issue. The problem is that the delay doesn't prevent a load-based split from immediately being merged away if the load is suddenly cut after 5 minutes. To implement something like that, we'd need to track the history of qps on a range.

So to test this theory, I ran another test with kv.range_split.by_load_merge_delay set to 2 hours, so that load-based splits would never be merged away. The result:

Screen Shot 2021-04-23 at 12 51 52 AM

What we see is that qps still drops by about 60% during the range merge load, as we had seen previously. However, there was no follow-up period of reverberating full throughput stalls, which were bad on this 16 node cluster and way worse on the 72 node cluster we saw before. Instead, qps dips while inactive ranges are merging and log commit latency grows, and then recovers immediately after log commit latency recovers. So it looks like the merge cascade of load-based splits can explain the extended unavailability we have seen in these clusters (this time, I triple checked 😃).

This also explains why I somehow got lucky enough to fool myself up above. This "merge cascade on performance dip" behavior is quite unstable. Once you hit it, the system melts down. But you also won't hit it if you never see an initial performance dip. So for one reason or anything, maybe because the memtable stall theory wasn't totally off-base, my experiment with that change never triggered the beginning of the cascade - slowing down qps just enough to allow a single range in the active table to merge - so they never saw performance collapse.


So there are a few takeaways from this:

First, we should revisit #41317. Load-based splitting is a great tool, but if performance depends on the splits that it creates and the splits disappear the moment load slows down, it creates a very unstable situation.

Second, we should determine what short-term suggestions we can make with this new understanding. My guess is that the following would serve the current user who is running into this quite well, as it would prevent load-based splits created between the TRUNCATE and GC from being merged away during any temporary instability during the GC:

SET CLUSTER SETTING kv.range_split.by_load_merge_delay = <2 x gc.ttlseconds>

Finally, we should continue investigating the degraded (but not stalled) performance while ranges on the inactive table are merging. Unlike with the more confounding stalls, we see a clear correlation between the reduced throughput and all of (disk read bytes, disk write bytes, disk read IOPS, log commit tail latency, and LSM compactions). So on the surface, this looks less bizarre. This may be a result of the test making range merges extra aggressive, but there is likely something to learn here.

@sumeerbhola
Copy link
Collaborator

First, we should revisit #41317. Load-based splitting is a great tool, but if performance depends on the splits that it creates and the splits disappear the moment load slows down, it creates a very unstable situation.

(from a past life) In addition to that, do we take ongoing count and latency of merges into account when deciding whether to start a new merge. If we took resource overload signals into account before deciding to merge, we would potentially fare better.

So when qps dropped on the cluster, the mergeQueue decided that the split points were no longer needed, so it removed them by merging ranges. This led to more of a slowdown, and more merges.

It is unfortunate that we cannot differentiate between low load because the external load is actually low, versus the closed loop load generators are experiencing a slow system and therefore the load rate has reduced. Is there an explicit queue somewhere that we could measure at the Replica level to realize that load may not have reduced? Or seeing if the raft proposal quota is decreasing?
The current admission control work is at the granularity of a node, so queueing there wouldn't tell us anything at the granularity of a range (though it could be used as a coarse off switch to stop merges involving this node).

@petermattis
Copy link
Collaborator

It is unfortunate that we cannot differentiate between low load because the external load is actually low, versus the closed loop load generators are experiencing a slow system and therefore the load rate has reduced. Is there an explicit queue somewhere that we could measure at the Replica level to realize that load may not have reduced? Or seeing if the raft proposal quota is decreasing?

I was thinking something similar when I read @nvanbenschoten's write-up. Splitting a range based on load to that range makes sense. For merging, simply looking at the load on the ranges to be merged seems insufficient. I think a coarse switch might be ok. Don't we do something similar to throttle SQL stats collection. I'm imagining something where we could look at CPU, disk IOPS, compaction debt, etc and temporarily disable range merges if they are high.

@nvanbenschoten
Copy link
Member Author

In addition to that, do we take ongoing count and latency of merges into account when deciding whether to start a new merge. If we took resource overload signals into account before deciding to merge, we would potentially fare better.

We don't. The only thing we do is throttle merges based on the kv.range_merge.queue_interval cluster setting, which defaults to 1s per store but which this test drops to 0. I ran the test with the increased kv.range_split.by_load_merge_delay and the 1s delay between merges and things look pretty stable, with only minor performance dips that correlate with compactions (as we would expect from a write-heavy workload):

Screen Shot 2021-04-23 at 4 18 37 PM

Screen Shot 2021-04-23 at 4 21 24 PM

Don't we do something similar to throttle SQL stats collection.

For SQL stats collection, we throttle based on CPU utilization on the current node here:

if s.maxFractionIdle > 0 {
// Look at CRDB's average CPU usage in the last 10 seconds:
// - if it is lower than cpuUsageMinThrottle, we do not throttle;
// - if it is higher than cpuUsageMaxThrottle, we throttle all the way;
// - in-between, we scale the idle time proportionally.
usage := s.flowCtx.Cfg.RuntimeStats.GetCPUCombinedPercentNorm()
if usage > cpuUsageMinThrottle {
fractionIdle := s.maxFractionIdle
if usage < cpuUsageMaxThrottle {
fractionIdle *= (usage - cpuUsageMinThrottle) /
(cpuUsageMaxThrottle - cpuUsageMinThrottle)
}
if log.V(1) {
log.Infof(
ctx, "throttling to fraction idle %.2f (based on usage %.2f)", fractionIdle, usage,
)
}
elapsed := timeutil.Now().Sub(lastWakeupTime)
// Throttle the processor according to fractionIdle.
// Wait time is calculated as follows:
//
// fraction_idle = t_wait / (t_run + t_wait)
// ==> t_wait = t_run * fraction_idle / (1 - fraction_idle)
//
wait := time.Duration(float64(elapsed) * fractionIdle / (1 - fractionIdle))
if wait > maxIdleSleepTime {
wait = maxIdleSleepTime
}
timer.Reset(wait)
select {
case <-timer.C:
timer.Read = true
break
case <-s.flowCtx.Stopper().ShouldQuiesce():
break
}
}
lastWakeupTime = timeutil.Now()
}

The way this works is if the cpu usage is below 25%, we don't throttle at all. If it is between 25% and 75%, sleep for some fraction of maxFractionIdle proportional to the CPU usage. And if it is above 75%, we sleep for the full maxFractionIdle.

It is unfortunate that we cannot differentiate between low load because the external load is actually low, versus the closed loop load generators are experiencing a slow system and therefore the load rate has reduced. Is there an explicit queue somewhere that we could measure at the Replica level to realize that load may not have reduced? Or seeing if the raft proposal quota is decreasing?

Splitting a range based on load to that range makes sense. For merging, simply looking at the load on the ranges to be merged seems insufficient. I think a coarse switch might be ok. ... I'm imagining something where we could look at CPU, disk IOPS, compaction debt, etc and temporarily disable range merges if they are high.

I'm still trying to understand how to think about these kinds of approaches, so if you two have references to prior art in this area, I'd find that very helpful. There's also discussion from @sumeerbhola about other areas where we could think along these lines in #57247 and #57248.

The challenges to this kind of throttling behavior seem to be that 1) we need cluster-wide information to make good decisions, otherwise we may not properly account for unbalanced load (one way or the other), 2) we need to determine which signals are appropriate to look at and which ones aren't (the classic admission control problem), and 3) it seems quite easy to accidentally design a throttling heuristic that results in range merges never running under certain workload characteristics. We may want some form of probabilistic throttling to avoid that kind of pathology.

An alternative approach I've been thinking of along the lines of probabilistic throttling is to keep track on the leaseholder of the first time a range was run through the mergeQueue and could have been merged because its load was low enough. The time would be reset whenever the mergeQueue processes a replica and the load is determined to be sufficiently high. The probability that we actually will merge a range when run through the mergeQueue would start low and grow as this duration grows.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 26, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for cockroachdb#62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the
`TestMergeQueue/non-collocated` subtest fail.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 26, 2021
Closes cockroachdb#62700.
Re-addresses cockroachdb#41317.

This commit reworks how queries-per-second measurements are used when
determining whether to merge two ranges together. At a high-level, the change
moves from a scheme where the QPS over the last second on the LHS and RHS ranges
are combined and compared against a threshold (half the load-based split
threshold) to a scheme where the maximum QPS measured over the past 5 minutes
(configurable) on the LHS and RHS ranges are combined and compared against said
threshold.

The commit makes this change to avoid thrashing and to avoid overreacting to
temporary fluctuations in load. These overreactions lead to general instability
in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead
to cluster-wide meltdowns where a transient slowdown can trigger a wave of range
merges, which can slow the cluster down further, which can lead to more merges,
etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and
it is even worse on large ones, where range merges don't just interrupt traffic,
but also result in a centralization of load in a previously well-distributed
dataset, undoing all of the hard work of load-based splitting and rebalancing
and creating serious hotspots.

The commit improves this situation by introducing a form of memory into the
load-based split `Decider`. This is the object which was previously only
responsible for measuring queries-per-second on a range and triggering the
process of finding a load-based split point. The object is now given an
additional role of taking the second-long QPS samples that it measures and
aggregating them together to track the maximum historical QPS over a
configurable retention period. This maximum QPS measurement can be used to
prevent load-based splits from being merged away until the resulting ranges have
consistently remained below a certain QPS threshold for a sufficiently long
period of time.

The `mergeQueue` is taught how to use this new source of information. It is also
taught that it should be conservative about imprecision in this QPS tracking,
opting to skip a merge rather than perform one when the maximum QPS measurement
has not been tracked for long enough. This means that range merges will
typically no longer fire within 5 minutes of a lease transfer. This seems fine,
as there are almost never situations where a range merge is desperately needed
and we should risk making a bad decision in order to perform one.

I've measured this change on the `clearrange` roachtest that we made heavy use
of in cockroachdb#62700. As expected, it has the same effect as bumping up the
`kv.range_split.by_load_merge_delay` high enough such that ranges never merge on
the active table. Here's a screenshot of a recent run. We still see a period of
increased tail latency and reduced throughput, which has a strong correlation
with Pebble compactions. However, we no longer see the subsequent cluster outage
that used to follow, where ranges on the active table would begin to merge and
throughput would fall to 0 and struggle to recover, bottoming out repeatedly.

<todo insert images>

Release note (performance improvement): Range merges are no longer triggered if
a range has seen significant load over the previous 5 minutes, instead of only
considering the last second. This improves stability, as load-based splits will
no longer rapidly disappear during transient throughput dips.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 26, 2021
Informs cockroachdb#62700.

This commit bumps the default value for the `kv.range_merge.queue_interval`
cluster setting from 1s to 5s. This setting serves as a per-store rate limit on
the number of range merges that will be initiated. We've seen in a few issues
like cockroachdb#62700 that excessive range merge traffic can cause instability in a
cluster. There's very little reason to be aggressive about range merging, as
range merges are rarely needed with any urgency. However, there are good reasons
to be conservative about them.

This change can also be justified as a (late) reaction to the increased max
range size from 64MB to 512MB. A range merge may need to rebalance replicas in a
range, so its cost can be a function of the sizes of ranges. This means that if
range merges are now more expensive, we should be running them less frequently.
@sumeerbhola
Copy link
Collaborator

The only thing we do is throttle merges based on the kv.range_merge.queue_interval cluster setting, which defaults to 1s per store but which this test drops to 0.

I want to clarify one thing: does the maxConcurrency of 1 in merge_queue.go mean that there is only 1 merge being run per node in the cluster and that a new merge won't start at that node before the previous one has completed (I am guessing the call to AdminMerge in mergeQueue.process is synchronous but I am not sure)?

Regarding the rate limiting that we do for SQL stats and your concerns about "need cluster-wide information to make good decisions";ending up with a design that accidentally doesn't merge; probabilistic throttling: these are all good points and I don't have any clear answers from prior art I have encountered:

  • I'm generally wary of rate based throttling schemes that have some lower bound on rate, since it is never clear to me how that lower bound was chosen (like the maxIdleSleepTime in sampler.go). These tend to work well in early stages of the system, but multiple different activities doing their own throttles can still collectively overload the system.
  • For the sql sampler stuff we can more easily subject it to admission control. I am wary of doing the same for KV admin commands -- I was expecting they would usually bypass admission control, and good judgement would be exercised at a higher level on whether to submit such BatchRequests. Thoughts?
  • Keeping a bounded number of merges active in the whole system, that take into account that the system is not overloaded, and the nodes involved in those merges are not overloaded, does work reasonably in a serverless setting where we are controlling the provisioning. For instance if we say we won't do merges when certain resource utilization signals are > 80% we can arrange things such that we take action to add resources when utilization goes higher than 80%. It seems more complicated in on-prem settings since it is one more thing that customers need to worry about.
  • Everything I have seen in this area does use cluster-wide information. This tends to come about naturally when there is a central master that is tracking load information, and doing splits/merges/rebalancing.

craig bot pushed a commit that referenced this issue Apr 27, 2021
64199: kv: avoid panic on TestMergeQueue/non-collocated failure r=nvanbenschoten a=nvanbenschoten

Informs #63009.
Informs #64056.

In #63009/#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for #62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.

The fact that `TestMergeQueue/non-collocated` is the test failing means that we may want to have @aayushshah15 take over this investigation, since he's made changes in that area recently. What do you two think?

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot pushed a commit that referenced this issue Apr 30, 2021
64080: kvserver: improve error message for rare lease errors r=andreimatei a=andreimatei

In some rare cases, the status of a lease in relation to a
request/timestamp can't be determined. For the request's client this
results in a NotLeaseholderError. This patch improves the message of
that error.

In particular, this test failure[1] seems to show that a node couldn't
verify that an existing lease is expired because its liveness gossiped
info was stale. This sounds interesting and if the test fails again this
improved message should help.

[1] #57932 (comment)

Release note: None

64239: kv: bump kv.range_merge.queue_interval to 5s r=nvanbenschoten a=nvanbenschoten

Informs #62700.

This commit bumps the default value for the `kv.range_merge.queue_interval` cluster setting from 1s to 5s. This setting serves as a per-store rate limit on the frequency at which range merges will be initiated. We've seen in a few issues, including #62700, that excessive range merge traffic can cause instability in a cluster. There's very little reason to be aggressive about range merging, as range merges are rarely needed with any urgency. However, there are good reasons to be conservative about them.

This change can also be justified as a (late) reaction to the increased max range size from 64MB to 512MB. A range merge may need to rebalance replicas in a range, so its cost can be a function of the sizes of ranges. This means that if range merges are now more expensive, we should be running them less frequently.

64260: roachpb: remove EndTxn's DeprecatedCanCommitAtHigherTimestamp field r=nvanbenschoten a=nvanbenschoten

The field was replaced with the more general BatchRequest.CanForwardReadTimestamp
in 972915d. This commit completes the migration to remove the old flag.

We attempted this before, in 93d5eb9, but had to back that out in 4189938 because
we still needed compatibility with v20.1 nodes at the time. That is no longer the
case.

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten
Copy link
Member Author

does the maxConcurrency of 1 in merge_queue.go mean that there is only 1 merge being run per node in the cluster and that a new merge won't start at that node before the previous one has completed (I am guessing the call to AdminMerge in mergeQueue.process is synchronous but I am not sure)?

This is roughly correct, though the limit is actually per store, not per node.

I am wary of doing the same for KV admin commands -- I was expecting they would usually bypass admission control, and good judgement would be exercised at a higher level on whether to submit such BatchRequests. Thoughts?

I agree with this. I'm not confident that we would be able to construct an accurate cost-model of each admin command that fully accounts for all of its immediate costs and all of its externalities. So I think throttling here needs to be performed by whoever is deciding to issue the admin command.

Keeping a bounded number of merges active in the whole system, that take into account that the system is not overloaded, and the nodes involved in those merges are not overloaded, does work reasonably in a serverless setting where we are controlling the provisioning. For instance if we say we won't do merges when certain resource utilization signals are > 80% we can arrange things such that we take action to add resources when utilization goes higher than 80%. It seems more complicated in on-prem settings since it is one more thing that customers need to worry about.

This is an interesting point. This kind of approach becomes possible when we closely monitor the health and control the resources in a system. But requiring an operator to be in the loop becomes much less appealing for on-prem deployments.

Everything I have seen in this area does use cluster-wide information. This tends to come about naturally when there is a central master that is tracking load information, and doing splits/merges/rebalancing.

Also interesting. Does the rate at which these centralized decisions get applied tend to scale linearly with available resources or are constant limits imposed?

craig bot pushed a commit that referenced this issue May 3, 2021
64201: kv: rationalize load-based range merging r=nvanbenschoten a=nvanbenschoten

Closes #62700.
Fully addresses #41317.

This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold.

The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in #41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in #62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots.

The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time.

The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one.

I've measured this change on the `clearrange` roachtest that we made heavy use of in #62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly.

<img width="1323" alt="Screen Shot 2021-04-26 at 12 32 53 AM" src="https://user-images.githubusercontent.com/5438456/116037215-c8f66300-a635-11eb-8ff2-9e7db4baee8d.png">

<img width="986" alt="Screen Shot 2021-04-26 at 12 33 04 AM" src="https://user-images.githubusercontent.com/5438456/116037225-cc89ea00-a635-11eb-8f2c-a40b2b3e47a7.png">

Instead of what we originally saw, which looked like:

<img width="1305" alt="Screen Shot 2021-03-27 at 10 52 18 PM" src="https://user-images.githubusercontent.com/5438456/112763884-53668b00-8fd4-11eb-9ebc-61d9494eca10.png">

Release note (performance improvement): Range merges are no longer triggered if a range has seen significant load over the previous 5 minutes, instead of only considering the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.

Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.

cc @cockroachdb/kv 

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 03061a8 May 3, 2021
@sumeerbhola
Copy link
Collaborator

Does the rate at which these centralized decisions get applied tend to scale linearly with available resources or are constant limits imposed?

For something like limiting the number of ongoing merges, I've seen simple heuristics like no merges at mean cluster utilization > N%. And when < N%, no more than a mean of M ongoing merges/node. So yes, it does scale with deployed resources but not necessarily in a sophisticated way. Looks like we already have an M which is the number of stores per node.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 17, 2021
Informs cockroachdb#62700.

This commit bumps the default value for the `kv.range_merge.queue_interval`
cluster setting from 1s to 5s. This setting serves as a per-store rate limit on
the number of range merges that will be initiated. We've seen in a few issues
like cockroachdb#62700 that excessive range merge traffic can cause instability in a
cluster. There's very little reason to be aggressive about range merging, as
range merges are rarely needed with any urgency. However, there are good reasons
to be conservative about them.

This change can also be justified as a (late) reaction to the increased max
range size from 64MB to 512MB. A range merge may need to rebalance replicas in a
range, so its cost can be a function of the sizes of ranges. This means that if
range merges are now more expensive, we should be running them less frequently.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 8, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for cockroachdb#62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the
`TestMergeQueue/non-collocated` subtest fail.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 8, 2021
Closes cockroachdb#62700.
Re-addresses cockroachdb#41317.

This commit reworks how queries-per-second measurements are used when
determining whether to merge two ranges together. At a high-level, the change
moves from a scheme where the QPS over the last second on the LHS and RHS ranges
are combined and compared against a threshold (half the load-based split
threshold) to a scheme where the maximum QPS measured over the past 5 minutes
(configurable) on the LHS and RHS ranges are combined and compared against said
threshold.

The commit makes this change to avoid thrashing and to avoid overreacting to
temporary fluctuations in load. These overreactions lead to general instability
in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead
to cluster-wide meltdowns where a transient slowdown can trigger a wave of range
merges, which can slow the cluster down further, which can lead to more merges,
etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and
it is even worse on large ones, where range merges don't just interrupt traffic,
but also result in a centralization of load in a previously well-distributed
dataset, undoing all of the hard work of load-based splitting and rebalancing
and creating serious hotspots.

The commit improves this situation by introducing a form of memory into the
load-based split `Decider`. This is the object which was previously only
responsible for measuring queries-per-second on a range and triggering the
process of finding a load-based split point. The object is now given an
additional role of taking the second-long QPS samples that it measures and
aggregating them together to track the maximum historical QPS over a
configurable retention period. This maximum QPS measurement can be used to
prevent load-based splits from being merged away until the resulting ranges have
consistently remained below a certain QPS threshold for a sufficiently long
period of time.

The `mergeQueue` is taught how to use this new source of information. It is also
taught that it should be conservative about imprecision in this QPS tracking,
opting to skip a merge rather than perform one when the maximum QPS measurement
has not been tracked for long enough. This means that range merges will
typically no longer fire within 5 minutes of a lease transfer. This seems fine,
as there are almost never situations where a range merge is desperately needed
and we should risk making a bad decision in order to perform one.

I've measured this change on the `clearrange` roachtest that we made heavy use
of in cockroachdb#62700. As expected, it has the same effect as bumping up the
`kv.range_split.by_load_merge_delay` high enough such that ranges never merge on
the active table. Here's a screenshot of a recent run. We still see a period of
increased tail latency and reduced throughput, which has a strong correlation
with Pebble compactions. However, we no longer see the subsequent cluster outage
that used to follow, where ranges on the active table would begin to merge and
throughput would fall to 0 and struggle to recover, bottoming out repeatedly.

<todo insert images>

Release note (performance improvement): Range merges are no longer considered if
a range has seen significant load over the previous 5 minutes, instead of being
considered as long as a range has low load over the last second. This improves
stability, as load-based splits will no longer rapidly disappear during transient
throughput dips.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 16, 2021
Informs cockroachdb#63009.
Informs cockroachdb#64056.

In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic.
I don't know quite what's going on here, but when working on a patch for cockroachdb#62700,
I managed to hit this panic reliably by accidentally breaking all range merges.

After a bit of debugging, it became clear that we were always hitting a panic in
the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest,
`TestMergeQueue/non-collocated`, was moving the RHS range to a different node,
failing to merge the two range, and failing itself. This soft failure was being
drowned out by the hard failure in the next subtest.

This commit replaces the crash with a failure that looks something like the
following when range merges are completely disabled:
```
--- FAIL: TestMergeQueue (0.34s)
    test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827
    test_log_scope.go:74: use -show-logs to present logs inline
    --- FAIL: TestMergeQueue/both-empty (0.00s)
        client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/lhs-undersize (0.00s)
        client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/combined-threshold (0.00s)
        client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00"
    --- FAIL: TestMergeQueue/non-collocated (0.03s)
        client_merge_test.go:4236: replica doesn't exist
    --- FAIL: TestMergeQueue/sticky-bit (0.00s)
        client_merge_test.go:4243: right-hand side range not found
    --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s)
        client_merge_test.go:4268: right-hand side range not found
```

I expect that under stress on master, we will see the
`TestMergeQueue/non-collocated` subtest fail.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 16, 2021
Closes cockroachdb#62700.
Re-addresses cockroachdb#41317.

This commit reworks how queries-per-second measurements are used when
determining whether to merge two ranges together. At a high-level, the change
moves from a scheme where the QPS over the last second on the LHS and RHS ranges
are combined and compared against a threshold (half the load-based split
threshold) to a scheme where the maximum QPS measured over the past 5 minutes
(configurable) on the LHS and RHS ranges are combined and compared against said
threshold.

The commit makes this change to avoid thrashing and to avoid overreacting to
temporary fluctuations in load. These overreactions lead to general instability
in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead
to cluster-wide meltdowns where a transient slowdown can trigger a wave of range
merges, which can slow the cluster down further, which can lead to more merges,
etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and
it is even worse on large ones, where range merges don't just interrupt traffic,
but also result in a centralization of load in a previously well-distributed
dataset, undoing all of the hard work of load-based splitting and rebalancing
and creating serious hotspots.

The commit improves this situation by introducing a form of memory into the
load-based split `Decider`. This is the object which was previously only
responsible for measuring queries-per-second on a range and triggering the
process of finding a load-based split point. The object is now given an
additional role of taking the second-long QPS samples that it measures and
aggregating them together to track the maximum historical QPS over a
configurable retention period. This maximum QPS measurement can be used to
prevent load-based splits from being merged away until the resulting ranges have
consistently remained below a certain QPS threshold for a sufficiently long
period of time.

The `mergeQueue` is taught how to use this new source of information. It is also
taught that it should be conservative about imprecision in this QPS tracking,
opting to skip a merge rather than perform one when the maximum QPS measurement
has not been tracked for long enough. This means that range merges will
typically no longer fire within 5 minutes of a lease transfer. This seems fine,
as there are almost never situations where a range merge is desperately needed
and we should risk making a bad decision in order to perform one.

I've measured this change on the `clearrange` roachtest that we made heavy use
of in cockroachdb#62700. As expected, it has the same effect as bumping up the
`kv.range_split.by_load_merge_delay` high enough such that ranges never merge on
the active table. Here's a screenshot of a recent run. We still see a period of
increased tail latency and reduced throughput, which has a strong correlation
with Pebble compactions. However, we no longer see the subsequent cluster outage
that used to follow, where ranges on the active table would begin to merge and
throughput would fall to 0 and struggle to recover, bottoming out repeatedly.

<todo insert images>

Release note (performance improvement): Range merges are no longer considered if
a range has seen significant load over the previous 5 minutes, instead of being
considered as long as a range has low load over the last second. This improves
stability, as load-based splits will no longer rapidly disappear during transient
throughput dips.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants