kv: rationalize load-based range merging #64201

nvanbenschoten · 2021-04-26T06:22:06Z

Closes #62700.
Fully addresses #41317.

This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold.

The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in #41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in #62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots.

The commit improves this situation by introducing a form of memory into the load-based split Decider. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time.

The mergeQueue is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one.

I've measured this change on the clearrange roachtest that we made heavy use of in #62700. As expected, it has the same effect as bumping up the kv.range_split.by_load_merge_delay high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly.

Instead of what we originally saw, which looked like:

Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.

cc @cockroachdb/kv

cockroach-teamcity · 2021-04-26T06:22:13Z

This change is

nvanbenschoten · 2021-04-26T14:59:34Z

Even on a 24-node cluster, things look much better. We still only see a performance hit due to merges on the inactive table and corresponding compactions, and no longer any collapse due to bad rebalancing/merging decisions on the active table. And as a reminder, this test is merging the inactive table extra aggressively. kv.range_merge.queue_interval defaults to 1s, but the test drops it to 0s. Without this, I would expect the transient slowdown to be less severe.

irfansharif

Range merges are no longer triggered if a range has seen significant load over the previous 5 minutes, instead of only considering the last second.

We'll want to reword the release note a bit I think, I'm not sure it's saying what we want it to say.

Given we now have this QPS memory, do we still need split expirations?

cockroach/pkg/kv/kvserver/split_queue.go

Lines 231 to 250 in fa32402

    
           // Add a small delay (default of 5m) to any subsequent attempt to merge 
        
           // this range split away. While the merge queue does takes into account 
        
           // load to avoids merging ranges that would be immediately re-split due 
        
           // to load-based splitting, it doesn't take into account historical 
        
           // load. So this small delay is the only thing that prevents split 
        
           // points created due to load from being immediately merged away after 
        
           // load is stopped, which can be a problem for benchmarks where data is 
        
           // first imported and then the workload begins after a small delay. 
        
           var expTime hlc.Timestamp 
        
           if expDelay := SplitByLoadMergeDelay.Get(&sq.store.cfg.Settings.SV); expDelay > 0 { 
        
           	expTime = sq.store.Clock().Now().Add(expDelay.Nanoseconds(), 0) 
        
           } 
        
           if _, pErr := r.adminSplitWithDescriptor( 
        
           	ctx, 
        
           	roachpb.AdminSplitRequest{ 
        
           		RequestHeader: roachpb.RequestHeader{ 
        
           			Key: splitByLoadKey, 
        
           		}, 
        
           		SplitKey:       splitByLoadKey, 
        
           		ExpirationTime: expTime,

Aside, to further reduce the impact of the merge queue, what would we advice customers to do? Given we're not doing any pacing ourselves, we'd just suggest bumping up kv.range_merge.queue_interval to something more acceptable for their given cluster size?

Reviewed 18 of 18 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/client_merge_test.go, line 4077 at r1 (raw file):

	defer leaktest.AfterTest(t)()
	// TODO(nvanbenschoten): re-skip this test before merging.
	// skip.WithIssue(t, 64056, "flaky test")

Bump.

pkg/kv/kvserver/client_merge_test.go, line 4236 at r1 (raw file):

	})

	t.Run("load-based-merging", func(t *testing.T) {

Nice test.

pkg/kv/kvserver/client_merge_test.go, line 4255 at r1 (raw file):

			// measurement from both sides to be sufficiently stable and reliable,
			// meaning that it was a maximum measurement over some extended period of
			// time.

Some form of this comment would be well placed near the merge queue code, if not already there.

pkg/kv/kvserver/helpers_test.go, line 398 at r1 (raw file):

// assist load-based split (and merge) decisions.
func (r *Replica) LoadBasedSplitter() *split.Decider {
	return &r.loadBasedSplitter

Should r.loadBasedSplitter simply be a pointer instead?

pkg/kv/kvserver/batcheval/cmd_range_stats.go, line 50 at r1 (raw file):

		// TODO(nvanbenschten): this needs to be migrated in. As is, we will confuse
		// old versions. We'll need to keep using GetLastSplitQPS in mixed-version
		// clusters. Do this before merging.

Bump.

Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#64199, we found that the flake was likely due to the non-collocated subtest, so this commit un-skips the parent test and skips only this single subtest.

Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.

nvanbenschoten

Thanks for the review!

We'll want to reword the release note a bit I think, I'm not sure it's saying what we want it to say.

Done.

Given we now have this QPS memory, do we still need split expirations?

Good point. I've reworded the comment for now, and left a TODO to remove entirely once we can rely entirely on this new mechanism.

Aside, to further reduce the impact of the merge queue, what would we advice customers to do? Given we're not doing any pacing ourselves, we'd just suggest bumping up kv.range_merge.queue_interval to something more acceptable for their given cluster size?

If customers are seeing this remain an issue, then bumping kv.range_merge.queue_interval would be the first suggestion. We bumped the default for this value up to 5s in #64239. The cluster size itself shouldn't matter, because the rate limit scales by store count.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/kv/kvserver/client_merge_test.go, line 4077 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Bump.

Done. I added a precursor commit on here which un-skips most of TestMergeQueue and only skips the non-collocated subtest, which we found to be the likely problem in #64199.

pkg/kv/kvserver/client_merge_test.go, line 4255 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Some form of this comment would be well placed near the merge queue code, if not already there.

Good point. Done.

pkg/kv/kvserver/helpers_test.go, line 398 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Should r.loadBasedSplitter simply be a pointer instead?

We're pretty deliberate about allowing the split.Decider to stored by value, so I think this is intentional:

cockroach/pkg/kv/kvserver/split/decider.go

Lines 54 to 57 in 684e753

    
           // Init initializes a Decider (which is assumed to be zero). The signature allows 
        
           // embedding the Decider into a larger struct outside of the scope of this package 
        
           // without incurring a pointer reference. This is relevant since many Deciders 
        
           // may exist in the system at any given point in time.

pkg/kv/kvserver/batcheval/cmd_range_stats.go, line 50 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Bump.

Done. I avoided the cluster version and I did this in a way that should allow us to backport this change. I'd appreciate a second pair of eyes to check that I have the migration lined up correctly.

irfansharif

Reviewed 1 of 18 files at r2, 9 of 18 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)

pkg/kv/kvserver/batcheval/cmd_range_stats.go, line 50 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Done. I avoided the cluster version and I did this in a way that should allow us to backport this change. I'd appreciate a second pair of eyes to check that I have the migration lined up correctly.

We'd backport this to 21.1.0? Or are we aiming for a patch release? The former would let us remove the extra fields in 21.2. The migration itself LGTM.

nvanbenschoten

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/kv/kvserver/batcheval/cmd_range_stats.go, line 50 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We'd backport this to 21.1.0? Or are we aiming for a patch release? The former would let us remove the extra fields in 21.2. The migration itself LGTM.

Good question. Unfortunately, I don't think we'll be able to get this into 21.1.0, which means that everything will be pushed back a release. It's too bad, but not worth the risk of rushing this in.

craig · 2021-05-03T23:04:04Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested a review from irfansharif April 26, 2021 06:22

nvanbenschoten force-pushed the nvanbenschoten/loadBasedMerge branch 2 times, most recently from 029276b to 0abf30a Compare April 26, 2021 13:20

irfansharif approved these changes Apr 26, 2021

View reviewed changes

kv: un-skip most of TestMergeQueue

25cc3d7

Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#64199, we found that the flake was likely due to the non-collocated subtest, so this commit un-skips the parent test and skips only this single subtest.

nvanbenschoten force-pushed the nvanbenschoten/loadBasedMerge branch from 0abf30a to 27c724c Compare May 2, 2021 22:59

nvanbenschoten force-pushed the nvanbenschoten/loadBasedMerge branch from 27c724c to 03061a8 Compare May 2, 2021 23:02

nvanbenschoten commented May 2, 2021

View reviewed changes

irfansharif self-requested a review May 3, 2021 19:41

irfansharif approved these changes May 3, 2021

View reviewed changes

nvanbenschoten commented May 3, 2021

View reviewed changes

craig bot merged commit 7e33719 into cockroachdb:master May 3, 2021

nvanbenschoten deleted the nvanbenschoten/loadBasedMerge branch May 5, 2021 03:07

nvanbenschoten added the backport-21.1.x label May 13, 2021

aayushshah15 mentioned this pull request May 17, 2021

kv/sql: index backfill and truncate temporarily crater write throughput on large tables #62672

Closed

nvanbenschoten mentioned this pull request May 17, 2021

release-21.1: kv: rationalize load-based range merging #65362

Merged

nvanbenschoten mentioned this pull request Dec 22, 2021

kvserver: allow empty ranges to be merged regardless of QPS #71899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: rationalize load-based range merging #64201

kv: rationalize load-based range merging #64201

nvanbenschoten commented Apr 26, 2021 •

edited

Loading

cockroach-teamcity commented Apr 26, 2021

nvanbenschoten commented Apr 26, 2021

irfansharif left a comment

nvanbenschoten left a comment

irfansharif left a comment

nvanbenschoten left a comment

craig bot commented May 3, 2021

	// Add a small delay (default of 5m) to any subsequent attempt to merge
	// this range split away. While the merge queue does takes into account
	// load to avoids merging ranges that would be immediately re-split due
	// to load-based splitting, it doesn't take into account historical
	// load. So this small delay is the only thing that prevents split
	// points created due to load from being immediately merged away after
	// load is stopped, which can be a problem for benchmarks where data is
	// first imported and then the workload begins after a small delay.
	var expTime hlc.Timestamp
	if expDelay := SplitByLoadMergeDelay.Get(&sq.store.cfg.Settings.SV); expDelay > 0 {
	expTime = sq.store.Clock().Now().Add(expDelay.Nanoseconds(), 0)
	}
	if _, pErr := r.adminSplitWithDescriptor(
	ctx,
	roachpb.AdminSplitRequest{
	RequestHeader: roachpb.RequestHeader{
	Key: splitByLoadKey,
	},
	SplitKey: splitByLoadKey,
	ExpirationTime: expTime,

	// Init initializes a Decider (which is assumed to be zero). The signature allows
	// embedding the Decider into a larger struct outside of the scope of this package
	// without incurring a pointer reference. This is relevant since many Deciders
	// may exist in the system at any given point in time.

kv: rationalize load-based range merging #64201

kv: rationalize load-based range merging #64201

Conversation

nvanbenschoten commented Apr 26, 2021 • edited Loading

cockroach-teamcity commented Apr 26, 2021

nvanbenschoten commented Apr 26, 2021

irfansharif left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented May 3, 2021

nvanbenschoten commented Apr 26, 2021 •

edited

Loading