admission,server: scope store overload admission control to write ope… #67717

sumeerbhola · 2021-07-16T17:47:57Z

…rations on the overloaded store

Previously KVWork was all admitted through a single WorkQueue, and this
WorkQueue was paired with a GrantCoordinator that took into account
both CPU overload and storage overload. This meant a single overloaded
store (in a multi-store setting) would slow down all KV operations
including reads, and those on stores that were fine.

We now have multiple WorkQueue's, one per store, that KV writes go
through, and then get to the shared WorkQueue that all operations go
through. The per-store WorkQueues are associated with their own
GrantCoordinators that listen to health state for their respective
stores. The per-store queues do not care about CPU and therefore do not
do grant chaining. Since admission through these per-store queues
happens before the shared queue, a store bottleneck will not cause KV
slots in the shared queue to be taken by requests that are still waiting
for admission elsewhere. The reverse can happen, and is considered
acceptable -- per-store tokens, when a store is overloaded, can be used
by requests that are now waiting for admission in the shared WorkQueue
because of a cpu bottleneck.

The code is significantly refactored for the above:
NewGrantCoordinators returns a container called StoreGrantCoordinators
which lazily initializes the relevant per-store GrantCoordinators when
it first fetches Pebble metrics, in addition to the shared
GrantCoordinator. The server code now integrates with both types and
the code in Node.Batch will sometimes subject a request to two
WorkQueues. The PebbleMetricsProvider now includes StoreIDs, and the
periodic ticking that fetches these metrics at 1min intervals, and does
1s ticks, is moved to StoreGrantCoordinators. This simplifies the
ioLoadListener which no longer does the periodic ticking and eliminates
a some testing-only abstractions. The per-store WorkQueues share the
same metrics, which represent an aggregate across these queues.

Informs #65957

Release note: None

cockroach-teamcity · 2021-07-16T17:48:04Z

This change is

sumeerbhola

I owe new unit tests, which I'll do after some initial feedback.

We could also use a multi-store roachtest for testing this -- do we have any existing ones that I could use as code examples?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

sumeerbhola · 2021-08-02T20:03:03Z

@RaduBerinde @ajwerner any thoughts on this PR?

RaduBerinde

Sorry, this fell through the cracks.

A bit worried that we keep adding per-request overhead. How hard would it be to make it so that we still use a single queue when we have a single store? (which I believe is the case in most deployments)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

sumeerbhola

A bit worried that we keep adding per-request overhead. How hard would it be to make it so that we still use a single queue when we have a single store? (which I believe is the case in most deployments)

In the current setup, even with a single store, only KV writes (not reads) go through both queues.

I think we'll need to solve any overhead issues with more code optimizations rather than fusing together queues. The latter would make the code more complicated and less composable. I plan to start running tpcc and kv roachtests with admission control enabled, which should tell us more.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

RaduBerinde

Sounds good, thanks.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @sumeerbhola)

pkg/server/node.go, line 958 at r2 (raw file):

				return nil, err
			}
			if !continueToSecondQ {

[nit] Maybe add a comment explaining that in this case, admission control for KVWork is disabled so there is no point to use the second queue.
Also maybe rename continueToSecondQ to admissionEnabled.

sumeerbhola

We could also use a multi-store roachtest for testing this -- do we have any existing ones that I could use as code examples?

@ajwerner I could use some suggestions on this

I see only one roachtest with multiple stores tpce.go. Confusingly, it specifies the store count both to Cluster.Start (--store-count) and to TestSpec.Cluster. It's not clear to me why it is not sufficient to do the former.
I'd like to create a kv-like roachtest (single node is fine) with 2 stores and pin tables to the stores to prevent rebalancing, and overload one store using a high concurrency kv0 workload and have the other store see a kv95 workload that doesn't overload it. Currently workload/kv/kv.go hardcodes the table name, which should be easy to override using a parameter. Also, I suppose it is ok to do two workload run calls (I can two nodes for the workloads). But I don't how to pin the tables and prevent rebalancing.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @sumeerbhola)

ajwerner

I see only one roachtest with multiple stores tpce.go. Confusingly, it specifies the store count both to Cluster.Start (--store-count) and to TestSpec.Cluster. It's not clear to me why it is not sufficient to do the former.

The former creates VMs with 2 disks. The latter provides flags to cockroach start to use those disks.

I can two nodes for the workloads

nit: I don't know that you need this. One workload node to run both workloads seems likely to be sufficient

Currently workload/kv/kv.go hardcodes the table name

While this is true, you can use different databases using the db flag for workload to configure them separately.

But I don't how to pin the tables and prevent rebalancing.

You'll need to set up zone configs.

I'd like to create a kv-like roachtest (single node is fine) with 2 stores and pin tables to the stores to prevent rebalancing, and overload one store using a high concurrency kv0 workload and have the other store see a kv95 workload that doesn't overload it.

What we're really missing here is that roachprod is not putting attributes on the stores. We'd need to put something like attrs=s%d for the store index or something around here:

cockroach/pkg/cmd/roachprod/install/cockroach.go

Line 489 in 867985f

var storeDirs []string

Then we could do something like:

CREATE DATABASE db1;
ALTER DATABASE db1 CONFIGURE ZONE USING constraints = '[+s1]';
CREATE DATABASE db2;
ALTER DATABASE db2 CONFIGURE ZONE USING constraints = '[+s2]';

And then in the two workloads, specify the --db flag.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)

pkg/util/admission/granter_test.go, line 219 at r3 (raw file):

// TestStoreCoordinators tests only the setup of GrantCoordinators per store.
// Testing of IO load functionality happens in TestIOLoadListener.

added a test

sumeerbhola

@ajwerner Thanks, that is very useful! I have a simple roachtest with 2 stores but I can't figure out why the two overloaded db (db2) is not being placed on store2. I see the following in crdb_internal.zones.txt

52	0	DATABASE db1	NULL	db1	NULL	NULL	NULL	NULL	"range_min_bytes: null
range_max_bytes: null
gc: null
global_reads: null
num_replicas: null
num_voters: null
constraints: [+store1]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db1 CONFIGURE ZONE USING
	constraints = '[+store1]'"	2\x102\f\b\x01\x12\x00\x1a\x06store18\x00P\x00X\x01x\x00	"range_min_bytes: 134217728
range_max_bytes: 536870912
gc:
  ttlseconds: 90000
global_reads: null
num_replicas: 3
num_voters: null
constraints: [+store1]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db1 CONFIGURE ZONE USING
	range_min_bytes = 134217728,
	range_max_bytes = 536870912,
	gc.ttlseconds = 90000,
	num_replicas = 3,
	constraints = '[+store1]',
	lease_preferences = '[]'"
53	0	DATABASE db2	NULL	db2	NULL	NULL	NULL	NULL	"range_min_bytes: null
range_max_bytes: null
gc: null
global_reads: null
num_replicas: null
num_voters: null
constraints: [+store2]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db2 CONFIGURE ZONE USING
	constraints = '[+store2]'"	2\x102\f\b\x01\x12\x00\x1a\x06store28\x00P\x00X\x01x\x00	"range_min_bytes: 134217728
range_max_bytes: 536870912
gc:
  ttlseconds: 90000
global_reads: null
num_replicas: 3
num_voters: null
constraints: [+store2]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db2 CONFIGURE ZONE USING
	range_min_bytes = 134217728,
	range_max_bytes = 536870912,
	gc.ttlseconds = 90000,
	num_replicas = 3,
	constraints = '[+store2]',
	lease_preferences = '[]'"

And the node was started with the correct store attributes

I210806 22:30:36.721467 9327 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹./cockroach› ‹start› ‹--insecure› ‹--store› ‹path=/mnt/data1/cockroach,attrs=store1› ‹--store› ‹path=/mnt/data2/cockroach,attrs=store2› ‹--log› ‹file-defaults: {dir: 'logs', exit-on-error: false}› ‹--cache=25%› ‹--max-sql-memory=25%› ‹--port=26257› ‹--http-port=26258› ‹--locality=cloud=gce,region=us-east1,zone=us-east1-b› ‹--join=34.73.187.44:26257› ‹--advertise-host› ‹10.142.0.10›]

I didn't anything in the logs that suggests that it is trying to move ranges to conform to these configs. And since these configs were created before the tables, I am assuming that the initial placement of the ranges will be correct.

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

The code is in
sumeerbhola@4aeb26d

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)

…rations on the overloaded store Previously KVWork was all admitted through a single WorkQueue, and this WorkQueue was paired with a GrantCoordinator that took into account both CPU overload and storage overload. This meant a single overloaded store (in a multi-store setting) would slow down all KV operations including reads, and those on stores that were fine. We now have multiple WorkQueue's, one per store, that KV writes go through, and then get to the shared WorkQueue that all operations go through. The per-store WorkQueues are associated with their own GrantCoordinators that listen to health state for their respective stores. The per-store queues do not care about CPU and therefore do not do grant chaining. Since admission through these per-store queues happens before the shared queue, a store bottleneck will not cause KV slots in the shared queue to be taken by requests that are still waiting for admission elsewhere. The reverse can happen, and is considered acceptable -- per-store tokens, when a store is overloaded, can be used by requests that are now waiting for admission in the shared WorkQueue because of a cpu bottleneck. The code is significantly refactored for the above: NewGrantCoordinators returns a container called StoreGrantCoordinators which lazily initializes the relevant per-store GrantCoordinators when it first fetches Pebble metrics, in addition to the shared GrantCoordinator. The server code now integrates with both types and the code in Node.Batch will sometimes subject a request to two WorkQueues. The PebbleMetricsProvider now includes StoreIDs, and the periodic ticking that fetches these metrics at 1min intervals, and does 1s ticks, is moved to StoreGrantCoordinators. This simplifies the ioLoadListener which no longer does the periodic ticking and eliminates a some testing-only abstractions. The per-store WorkQueues share the same metrics, which represent an aggregate across these queues. Informs cockroachdb#65957 Release note: None

ajwerner · 2021-08-09T16:20:34Z

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

Ack. Will look in the next couple of hours.

sumeerbhola · 2021-08-09T16:37:36Z

Will look in the next couple of hours.
Thanks!

I'm going to merge this PR and do the roachtest in a followup, since @RaduBerinde approved this and he is out this week.

sumeerbhola · 2021-08-09T16:38:51Z

bors r+

craig · 2021-08-09T18:01:44Z

Build succeeded:

GitHub CI (Cockroach)

sumeerbhola · 2021-08-16T12:43:11Z

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

Ack. Will look in the next couple of hours.

@ajwerner any pointers would be much appreciated

ajwerner · 2021-08-16T14:15:24Z

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

Ack. Will look in the next couple of hours.

@ajwerner any pointers would be much appreciated

I think that the problem is the inscrutible difference between constraints and voter_constraints. I won't say I understand why it's needed, but I think it is.

alter database db1 configure zone using constraints = '[+store1]', voter_constraints = '[+store1]', num_voters = 1;
alter database db2 configure zone using constraints = '[+store2]', voter_constraints = '[+store2]', num_voters = 1;

I just ran the experiment on ajwerner-test if you want to take a look. It seems like the replicas are being properly distributed.

root@localhost:26257/defaultdb> SELECT database_name, replicas FROM crdb_internal.ranges WHERE database_name LIKE 'db%';
  database_name | replicas
----------------+-----------
  db1         | {1}
  db1         | {1}
  db1         | {1}
  db1         | {1}
  db2         | {2}
  db2         | {2}

sumeerbhola requested review from ajwerner and RaduBerinde July 16, 2021 17:47

sumeerbhola commented Jul 16, 2021

View reviewed changes

sumeerbhola force-pushed the store_overload branch from 6377181 to 0aee0ba Compare July 16, 2021 18:23

RaduBerinde reviewed Aug 2, 2021

View reviewed changes

sumeerbhola commented Aug 3, 2021

View reviewed changes

RaduBerinde approved these changes Aug 3, 2021

View reviewed changes

sumeerbhola commented Aug 6, 2021

View reviewed changes

ajwerner reviewed Aug 6, 2021

View reviewed changes

sumeerbhola force-pushed the store_overload branch 2 times, most recently from d3db1fe to b2b1858 Compare August 6, 2021 19:08

sumeerbhola commented Aug 6, 2021

View reviewed changes

sumeerbhola force-pushed the store_overload branch 2 times, most recently from 94c07b0 to 1d5cae8 Compare August 9, 2021 12:19

sumeerbhola force-pushed the store_overload branch from 1d5cae8 to e0f1479 Compare August 9, 2021 15:37

craig bot merged commit 548fa91 into cockroachdb:master Aug 9, 2021

sumeerbhola deleted the store_overload branch September 14, 2021 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission,server: scope store overload admission control to write ope… #67717

admission,server: scope store overload admission control to write ope… #67717

sumeerbhola commented Jul 16, 2021

cockroach-teamcity commented Jul 16, 2021

sumeerbhola left a comment

sumeerbhola commented Aug 2, 2021

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

ajwerner left a comment

sumeerbhola left a comment

sumeerbhola left a comment

ajwerner commented Aug 9, 2021

sumeerbhola commented Aug 9, 2021

sumeerbhola commented Aug 9, 2021

craig bot commented Aug 9, 2021

sumeerbhola commented Aug 16, 2021

ajwerner commented Aug 16, 2021

admission,server: scope store overload admission control to write ope… #67717

admission,server: scope store overload admission control to write ope… #67717

Conversation

sumeerbhola commented Jul 16, 2021

cockroach-teamcity commented Jul 16, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Aug 2, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

ajwerner commented Aug 9, 2021

sumeerbhola commented Aug 9, 2021

sumeerbhola commented Aug 9, 2021

craig bot commented Aug 9, 2021

sumeerbhola commented Aug 16, 2021

ajwerner commented Aug 16, 2021