[prototype] storage: Make rebalance decisions at store-level #26608

a-robinson · 2018-06-11T21:15:15Z

As outlined in recent comments on #26059, we need to bring back some
form of stats-based rebalancing in order to perform well on TPC-C
without manual partitioning and replica placement.

This commit contains a prototype that demonstrates the effectiveness of
changing our approach to making rebalancing decisions from making them
in the replicate queue, which operates on arbitrarily ordered replicas
of the ranges on a store, to making them at a higher-level. This
prototype makes them at a cluster level by running the logic on only one
node, but my real proposal is to make them at the store level.

This change in abstraction reflects what a human would do if asked to
even out the load on a cluster given perfect information about
everything happening in the cluster:

First, determine which stores have the most load on them (or overfull
-- but for the prototype I only considered the one dimension that
affects TPC-C the most)
Decide whether the most loaded stores are so overloaded that action
needs to be taken.
Examine the hottest replicas on the store (maybe not the absolute
hottest in practice, since moving that one could disrupt user traffic,
but in the prototype this seems to work fine) and attempt to move them
to under-utilized stores. If this can be done simply by transferring
leases to under-utilized stores, then do so. If moving leases isn't
enough, then also rebalance replicas from the hottest store to
under-utilized stores.
Repeat periodically to handle changes in load or cluster membership.

In a real version of this code, the plan is roughly:

Each store will independently run their own control loop like this
that is only responsible for moving leases/replicas off itself, not off
other stores. This avoids needing a centralized coordinator, and will
avoid the need to use the raft debug endpoint as long as we start
gossiping QPS per store info, since the store already has details about
the replicas on itself.
The existing replicate queue will stop making decisions motivated by
balance. It will switch to only making decisions based on
constraints/diversity/lease preferences, which is still needed since
the new store-level logic will only check for store-level balance,
not that all replicas' constraints are properly met.
The new code will have to avoid violating constraints/diversity/lease
preferences.
The new code should consider range count, disk fullness, and maybe
writes per second as well.
In order to avoid making decisions based on bad data, I'd like to
extend lease transfers to pass along QPS data to the new leaseholder
and preemptive snapshots to pass along WPS data to the new replica.
This may not be strictly necessary, as shown by the success of this
prototype, but should make for more reliable decision making.

I tested this out on TPC-C 5k on 15 nodes and am able to consistently
get 94% efficiency, which is the max I've seen using a build of the
workload generator that erroneously includes the ramp-up period in its
final stats. The first run with this code only got 85% because it took a
couple minutes to make all the lease transfers it wanted, but then all
subsequent runs got the peak efficiency while making negligibly few
lease transfers.

Note that I didn't even have to implement replica rebalancing to get
these results, which oddly contradicts my previous claims. However, I
believe that's because I did the initial split/scatter using a binary
containing #26438, so the replicas were already better scattered than by
default. I ran TPC-C on that build without these changes a couple times,
though, and didn't get better than 65% efficiency, so the scatter wasn't
the cause of the good results here.

Touches #26059, #17979

Release note: None

per-node exec.success metric over a few runs with this change, showing the per-node KV QPS converging over time:

per-node goroutine counts over the same few runs, showing the overloaded outlier node getting healthier and converging to the other nodes:

I can write up a more formal RFC if folks think the write-up above (plus the success of this prototype) is insufficient as a proposal, but it's a good summary of my current plans.

@BramGruneir @nvanbenschoten @petermattis

cockroach-teamcity · 2018-06-11T21:15:24Z

This change is

a-robinson · 2018-06-11T21:21:05Z

Underlying the reasoning to change from the per-replica replicate queue to a new store-level construct is that the replicate queue doesn't have a great sense of how the replica it's looking at compares to the other replicas on the store. We have the 10th/25th/50th/75th/90th percentiles, but in practice those don't appear to be nearly as useful as being able to grab the top N replicas on the store and pick which of them should be moved.

And even if the replicate queue could determine that the replica it was looking at was the hottest replica, or the Nth hottest replica, it's going to be important for it to also know whether the other hot replicas are able to be moved. If the top 17 replicas are movable, maybe we don't want to move the 18th hottest replicas. But if the top 17 replicas can't be moved due to zone constraints and leaseholder preferences, then we probably do want to move the 18th hottest replica. It's going to be a lot easier (and will require breaking way fewer abstractions) to do that from something operating at the store-level than at the replica-level.

nvanbenschoten · 2018-06-11T23:19:54Z

Those results look really nice! Did you try performing the experiment on a 30 node cluster as well? I'm sure this will be even more helpful there.

One detail that pops out to me is that we're basing everything on qps without making a distinction between read and write qps. This is interesting because I'd expect transferring a lease for a range with high read-qps to be more impactful than transferring the lease for a range with high write-qps, which probably requires a full replica rebalance to alleviate an overloaded node. Do you think it makes sense to track these separately?

I can write up a more formal RFC if folks think the write-up above (plus the success of this prototype) is insufficient as a proposal, but it's a good summary of my current plans.

I don't necessarily think a full RFC is necessary for any changes here (especially as you're prototyping them), but it would probably be a good idea to eventually go back and update https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20161026_leaseholder_rebalancing.md#future-directions with your new findings and plan.

Review status: complete! 0 of 0 LGTMs obtained

pkg/server/prototype_allocator.go, line 29 at r1 (raw file):

)

var testQPSThreshold = settings.RegisterNonNegativeFloatSetting(

The prototype doesn't actually use this, right?

Comments from Reviewable

As outlined in recent comments on cockroachdb#26059, we need to bring back some form of stats-based rebalancing in order to perform well on TPC-C without manual partitioning and replica placement. This commit contains a prototype that demonstrates the effectiveness of changing our approach to making rebalancing decisions from making them in the replicate queue, which operates on arbitrarily ordered replicas of the ranges on a store, to making them at a higher-level. This prototype makes them at a cluster level by running the logic on only one node, but my real proposal is to make them at the store level. This change in abstraction reflects what a human would do if asked to even out the load on a cluster given perfect information about everything happening in the cluster: 1. First, determine which stores have the most load on them (or overfull -- but for the prototype I only considered the one dimension that affects TPC-C the most) 2. Decide whether the most loaded stores are so overloaded that action needs to be taken. 3. Examine the hottest replicas on the store (maybe not the absolute hottest in practice, since moving that one could disrupt user traffic, but in the prototype this seems to work fine) and attempt to move them to under-utilized stores. If this can be done simply by transferring leases to under-utilized stores, then do so. If moving leases isn't enough, then also rebalance replicas from the hottest store to under-utilized stores. 4. Repeat periodically to handle changes in load or cluster membership. In a real versino of this code, the plan is roughly: 1. Each store will independently run their own control loop like this that is only responsible for moving leases/replicas off itself, not off other stores. This avoids needing a centralized coordinator, and will avoid the need to use the raft debug endpoint as long as we start gossiping QPS per store info, since the store already has details about the replicas on itself. 2. The existing replicate queue will stop making decisions motivated by balance. It will switch to only making decisions based on constraints/diversity/lease preferences, which is still needed since the new store-level logic will only check for store-level balance, not that all replicas' constraints are properly met. 3. The new code will have to avoid violating constraints/diversity/lease preferences. 4. The new code should consider range count, disk fullness, and maybe writes per second as well. 5. In order to avoid making decisions based on bad data, I'd like to extend lease transfers to pass along QPS data to the new leaseholder and preemptive snapshots to pass along WPS data to the new replica. This may not be strictly necessary, as shown by the success of this prototype, but should make for more reliable decision making. I tested this out on TPC-C 5k on 15 nodes and am able to consistently get 94% efficiency, which is the max I've seen using a build of the workload generator that erroneously includes the ramp-up period in its final stats. The first run with this code only got 85% because it took a couple minutes to make all the lease transfers it wanted, but then all subsequent runs got the peak efficiency while making negligibly few lease transfers. Note that I didn't even have to implement replica rebalancing to get these results, which oddly contradicts my previous claims. However, I believe that's because I did the initial split/scatter using a binary containing cockroachdb#26438, so the replicas were already better scattered than by default. I ran TPC-C on that build without these changes a couple times, though, and didn't get better than 65% efficiency, so the scatter wasn't the cause of the good results here. Touches cockroachdb#26059, cockroachdb#17979 Release note: None

With this update, TPC-C 10k on 30 went from overloaded to running at peak efficiency over the course of about 4 hours (the manual partitioning approach takes many hours to move all the replicas as well, for a point of comparison). This is without having to run the replica scatter from cockroachdb#26438. Doing a 5 minute run to get a result that doesn't include all the rebalancing time shows: _elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 290.9s 124799.1 97.0% 548.6 486.5 872.4 1140.9 2281.7 10200.5 I think it may have a small bug in it still, since at one point early on one of the replicas from the warehouse table on the node doing the relocating thought that it had 16-17k QPS, which wasn't true by any other metric in the system. Restarting the node fixed it though. I'm not too concerned about the bug, since I assume I just made a code mistake, not that anything about the approach fundamentally leads to a random SQL table replica gets 10s of thousands of QPS. Range 1 is also back to getting a ton of QPS (~3k) even though I raised the range cache size from 1M to 50M. Looking at slow query traces shows a lot of range lookups, way more than I'd expect given that ranges weren't moving around at the time of the traces. Release note: None

a-robinson · 2018-06-14T16:01:52Z

This appears to do the trick on tpc-c 10k, as explained in my latest commit message:

nvanbenschoten · 2018-06-14T20:37:54Z

Nice!

over the course of about 4 hours (the manual
partitioning approach takes many hours to move all the replicas as well,
for a point of comparison)

Was the rate of rebalancing steady throughout this entire 4 hour period?

Range 1 is also back to getting a ton of QPS (~3k) even though I raised
the range cache size from 1M to 50M. Looking at slow query traces shows
a lot of range lookups, way more than I'd expect given that ranges
weren't moving around at the time of the traces.

Did this eventually settle down? If not, I think it's worth exploring at some point. We're probably running into the issue mentioned in #18988, which will result in about 2x more range lookups than strictly necessary during heavy rebalancing scenerios. There might also be other issues lurking as well.

Review status: complete! 0 of 0 LGTMs obtained

Comments from Reviewable

a-robinson · 2018-06-14T20:58:37Z

Was the rate of rebalancing steady throughout this entire 4 hour period?

Not the entire time. About 2 hours were actually spent primarily doing lease transfers, which makes sense given the code but probably made the rebalancing period take longer than needed. Judging by the logs, that was because of the bug I mentioned where a replica on n1 had a ridiculously high reported QPS. Once that resolved itself, though, the rest of the time was a pretty steady stream of replica adds/removes.

Did this eventually settle down? If not, I think it's worth exploring at some point. We're probably running into the issue mentioned in #18988, which will result in about 2x more range lookups than strictly necessary during heavy rebalancing scenerios. There might also be other issues lurking as well.

Not that I ever saw. Agreed that it's worth getting to the bottom of.

a-robinson · 2018-06-15T20:34:51Z

If not, I think it's worth exploring at some point. We're probably running into the issue mentioned in #18988, which will result in about 2x more range lookups than strictly necessary during heavy rebalancing scenerios.

That's certainly part of it. But what isn't clear is why the non-meta descriptors are being evicted. For example, check out this partial excerpt of events on an arbitrarily chosen range on one of the nodes. We keep evicting it and re-adding it to the range cache even though it isn't changing at all:

I180615 19:25:44.241605 903330514 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2438/10/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:25:49.341302 903397366 kv/range_cache.go:578  [s1,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:26:07.868229 903462027 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2431/9/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:26:13.973296 903495353 kv/range_cache.go:578  [s1,r55617/4:/Table/56/1/24{30-40},n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:29:21.724366 904226264 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2439/8/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:29:21.882479 904246100 kv/range_cache.go:578  [client=10.142.0.54:60202,user=root,txn=4f910e13,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:04.364989 904812433 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2438/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:04.624679 904920683 kv/range_cache.go:578  [client=10.142.0.54:39730,user=root,txn=a79b8f61,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:51.906363 905107678 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2434/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:32:52.032818 905120871 kv/range_cache.go:578  [client=10.142.0.54:58400,user=root,txn=7f61f484,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:14.223086 905210365 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2435/2/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:14.299220 905210708 kv/range_cache.go:578  [s1,r55617/4:/Table/56/1/24{30-40},n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:46.655785 905316841 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2436/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:33:46.662854 905349469 kv/range_cache.go:578  [client=10.142.0.54:37896,user=root,txn=6366a404,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:34:50.580809 905608042 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2436/10/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:34:50.622198 905608720 kv/range_cache.go:578  [s1,r55617/4:/Table/56/1/24{30-40},n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:38:55.529085 906576259 kv/range_cache.go:457  [s1,r55617/4:/Table/56/1/24{30-40},n1] evict cached descriptor: key=/Table/53/1/2438/1/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]
I180615 19:38:58.620422 906589469 kv/range_cache.go:578  [client=10.142.0.54:38534,user=root,txn=978a2b88,n1] adding descriptor: key=/Meta2/Table/53/1/2440/0 desc=r55673:/Table/53/1/24{30/0-40/0} [(n18,s19):5, (n17,s18):2, (n24,s25):4, next=6]

a-robinson · 2018-06-15T22:21:25Z

Almost all of the evictions in my current tpc-c 10k cluster are due to context cancellations for QueryTxn and HeartbeatTxn requests. We evict range cache entries whenever we get a SendError, which includes errors due to context cancellation. A cancelled context seems like a poor reason to invalidate a range cache entry, and a spectacularly bad reason to invalidate a meta2 range cache entry.

a-robinson · 2018-06-15T22:21:53Z

Sorry, I forgot to include the relevant code in my last comment:

cockroach/pkg/kv/dist_sender.go

Lines 1033 to 1039 in 718f6bd

    
           case *roachpb.SendError, *roachpb.RangeNotFoundError: 
        
           	// We've tried all the replicas without success. Either 
        
           	// they're all down, or we're using an out-of-date range 
        
           	// descriptor. Invalidate the cache and try again with the new 
        
           	// metadata. 
        
           	log.Event(ctx, "evicting range descriptor on send error and backoff for re-lookup") 
        
           	if err := evictToken.Evict(ctx); err != nil {

a-robinson · 2018-06-15T23:24:57Z

And now that I've added in a patch for that, I'm seeing some other questionable behavior causing a lot of range cache evictions. It appears as though whenever we get a RangeNotFound error from one of the replicas, we don't even bother trying the other known replicas. Is there a reason for that?

Although it looks like the biggest issue is still our sending of HeartbeatTxn requests to r1, which we were supposed to have fixed in #26238, but is still happening even though my build includes that change. cc @spencerkimball

Example logs from logspy grepping for r1:

I180615 23:19:44.457027 3577906 kv/leaseholder_cache.go:71  [client=10.142.0.54:37694,user=root,txn=7226f5ad,n26] r1: lookup leaseholder: 7
I180615 23:19:44.457041 3577906 kv/dist_sender.go:1229  [client=10.142.0.54:37694,user=root,txn=7226f5ad,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.465616 3405758 kv/range_iter.go:174  [client=10.142.0.54:39656,user=root,txn=52767306,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.465639 3405758 kv/range_iter.go:207  [client=10.142.0.54:39656,user=root,txn=52767306,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.465661 3405758 kv/leaseholder_cache.go:71  [client=10.142.0.54:39656,user=root,txn=52767306,n26] r1: lookup leaseholder: 7
I180615 23:19:44.465691 3405758 kv/dist_sender.go:1229  [client=10.142.0.54:39656,user=root,txn=52767306,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.475011 3578003 kv/range_iter.go:174  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.475038 3578003 kv/range_iter.go:207  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.475061 3578003 kv/leaseholder_cache.go:71  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] r1: lookup leaseholder: 7
I180615 23:19:44.475096 3578003 kv/dist_sender.go:1229  [client=10.142.0.54:49640,user=root,txn=5734eab5,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.477202 3526183 kv/range_iter.go:174  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.477226 3526183 kv/range_iter.go:207  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.477249 3526183 kv/leaseholder_cache.go:71  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] r1: lookup leaseholder: 7
I180615 23:19:44.477274 3526183 kv/dist_sender.go:1229  [client=10.142.0.54:49814,user=root,txn=ed9e9e23,ts=1529104765.652544134,0,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.484863 3577985 kv/range_iter.go:174  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.484875 3577985 kv/range_iter.go:207  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.484892 3577985 kv/leaseholder_cache.go:71  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] r1: lookup leaseholder: 7
I180615 23:19:44.484913 3577985 kv/dist_sender.go:1229  [client=10.142.0.54:59634,user=root,txn=0abb9680,n26] r1: sending batch 1 HeartbeatTxn to (n7,s7):3
I180615 23:19:44.485149 3578189 kv/range_iter.go:174  [client=10.142.0.54:57154,user=root,txn=8fe9129e,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.485166 3578277 kv/range_iter.go:174  [client=10.142.0.54:60726,user=root,txn=e2717e42,n26] key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8] err: <nil>
I180615 23:19:44.485173 3578189 kv/range_iter.go:207  [client=10.142.0.54:57154,user=root,txn=8fe9129e,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]
I180615 23:19:44.485195 3578189 kv/leaseholder_cache.go:71  [client=10.142.0.54:57154,user=root,txn=8fe9129e,n26] r1: lookup leaseholder: 7
I180615 23:19:44.485205 3578277 kv/range_iter.go:207  [client=10.142.0.54:60726,user=root,txn=e2717e42,n26] returning; key: /Min, desc: r1:/{Min-System/} [(n7,s7):3, (n3,s3):4, (n30,s30):7, next=8]

When running TPC-C 10k on a 30 node cluster without partitioning, range 1 was receiving thousands of qps while all other ranges were receiving no more than low hundreds of qps (more details in cockroachdb#26608. Part of it was context cancellations causing range descriptors to be evicted from the range cache (cockroachdb#26764), but an even bigger part of it was HeartbeatTxns being sent for transactions with no anchor key, accounting for thousands of QPS even after cockroachdb#26764 was fixed. This causes the same outcome as the old code without the load, because without this change we'd just send the request and get back a REASON_TXN_NOT_FOUND error, which would cause the function to return true. It's possible that we should instead avoid the heartbeat loop at all for transactions without a key, or that we should put in more effort to prevent such requests from even counting as transactions (a la cockroachdb#26741, which perhaps makes this change unnecessary?). Advice would be great. Release note: None

a-robinson · 2018-06-16T05:00:40Z

Although it looks like the biggest issue is still our sending of HeartbeatTxn requests to r1, which we were supposed to have fixed in #26238, but is still happening even though my build includes that change. cc @spencerkimball

Sorry, I was mixing up HeartbeatTxns and QueryTxns. This was a separate problem.

I've sent out #26764 for the context cancellation problem and #26765 for all the HeartbeatTxn requests to r1. I'm not doing anything about the cache evictions caused by RangeNotFound errors on a single replica, since I can see some sense in them, even though doing so is now the primary cause of requests to r1, which is still seeing a few dozen qps (#18988 isn't helping).

SendErrors cause range descriptors to be evicted from the range cache, but happen innocuously all the time as requests like HeartbeatTxns are cancelled because they're no longer needed. This was part of the cause of range 1 getting thousands of qps on 30 node TPC-C testing, as initially called out in the comments on cockroachdb#26608. Release note: None

bdarnell · 2018-06-18T07:52:14Z

It appears as though whenever we get a RangeNotFound error from one of the replicas, we don't even bother trying the other known replicas. Is there a reason for that?

That sounds like a mistake. As the comment at dist_sender.go:1034 says, these errors are only supposed to reach this level if we've exhausted all the replicas.

The whole notion of range descriptor invalidation could use some reworking. Let's discuss that on #18988 instead of adding to this issue.

SendErrors cause range descriptors to be evicted from the range cache, but happen innocuously all the time as requests like HeartbeatTxns are cancelled because they're no longer needed. This was part of the cause of range 1 getting thousands of qps on 30 node TPC-C testing, as initially called out in the comments on cockroachdb#26608. Release note: None

26764: kv: Don't treat context cancellation as a SendError r=a-robinson a=a-robinson SendErrors cause range descriptors to be evicted from the range cache, but happen innocuously all the time as requests like HeartbeatTxns are cancelled because they're no longer needed. This was part of the cause of range 1 getting thousands of qps on 30 node TPC-C testing, as initially called out in the comments on #26608. Release note: None --------------------------- `kv` is probably the package I'm least familiar with all the details of, so this may be a bad way of accomplishing this. If so, please let me know. If it's reasonable, I'll add a test that context cancellations don't affect the contents of the range cache. Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>

a-robinson · 2018-08-20T19:10:03Z

No need to keep this open any longer. #28340 and #28852 implement this for real.

a-robinson requested a review from a team June 11, 2018 21:15

a-robinson added 2 commits June 14, 2018 10:35

a-robinson force-pushed the store-level-allocator-prototype branch from 61805d0 to 853ac61 Compare June 14, 2018 16:00

a-robinson mentioned this pull request Jun 16, 2018

kv: Don't treat context cancellation as a SendError #26764

Merged

a-robinson mentioned this pull request Jun 16, 2018

kv: Don't heartbeat transactions that are lacking an anchor key #26765

Closed

bdarnell mentioned this pull request Jun 18, 2018

kv: reconsider RangeDescriptorCache eviction policy #18988

Closed

a-robinson closed this Aug 20, 2018

a-robinson mentioned this pull request Aug 21, 2018

storage: diagnose performance issues with tpc-c when not using partitioning #26059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prototype] storage: Make rebalance decisions at store-level #26608

[prototype] storage: Make rebalance decisions at store-level #26608

a-robinson commented Jun 11, 2018 •

edited

Loading

cockroach-teamcity commented Jun 11, 2018

a-robinson commented Jun 11, 2018

nvanbenschoten commented Jun 11, 2018

a-robinson commented Jun 14, 2018

nvanbenschoten commented Jun 14, 2018

a-robinson commented Jun 14, 2018 •

edited

Loading

a-robinson commented Jun 15, 2018

a-robinson commented Jun 15, 2018

a-robinson commented Jun 15, 2018

a-robinson commented Jun 15, 2018

a-robinson commented Jun 16, 2018

bdarnell commented Jun 18, 2018

a-robinson commented Aug 20, 2018

[prototype] storage: Make rebalance decisions at store-level #26608

[prototype] storage: Make rebalance decisions at store-level #26608

Conversation

a-robinson commented Jun 11, 2018 • edited Loading

cockroach-teamcity commented Jun 11, 2018

a-robinson commented Jun 11, 2018

nvanbenschoten commented Jun 11, 2018

a-robinson commented Jun 14, 2018

nvanbenschoten commented Jun 14, 2018

a-robinson commented Jun 14, 2018 • edited Loading

a-robinson commented Jun 15, 2018

a-robinson commented Jun 15, 2018

a-robinson commented Jun 15, 2018

a-robinson commented Jun 15, 2018

a-robinson commented Jun 16, 2018

bdarnell commented Jun 18, 2018

a-robinson commented Aug 20, 2018

a-robinson commented Jun 11, 2018 •

edited

Loading

a-robinson commented Jun 14, 2018 •

edited

Loading