Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission,server: scope store overload admission control to write ope… #67717

Merged
merged 1 commit into from
Aug 9, 2021

Conversation

sumeerbhola
Copy link
Collaborator

…rations on the overloaded store

Previously KVWork was all admitted through a single WorkQueue, and this
WorkQueue was paired with a GrantCoordinator that took into account
both CPU overload and storage overload. This meant a single overloaded
store (in a multi-store setting) would slow down all KV operations
including reads, and those on stores that were fine.

We now have multiple WorkQueue's, one per store, that KV writes go
through, and then get to the shared WorkQueue that all operations go
through. The per-store WorkQueues are associated with their own
GrantCoordinators that listen to health state for their respective
stores. The per-store queues do not care about CPU and therefore do not
do grant chaining. Since admission through these per-store queues
happens before the shared queue, a store bottleneck will not cause KV
slots in the shared queue to be taken by requests that are still waiting
for admission elsewhere. The reverse can happen, and is considered
acceptable -- per-store tokens, when a store is overloaded, can be used
by requests that are now waiting for admission in the shared WorkQueue
because of a cpu bottleneck.

The code is significantly refactored for the above:
NewGrantCoordinators returns a container called StoreGrantCoordinators
which lazily initializes the relevant per-store GrantCoordinators when
it first fetches Pebble metrics, in addition to the shared
GrantCoordinator. The server code now integrates with both types and
the code in Node.Batch will sometimes subject a request to two
WorkQueues. The PebbleMetricsProvider now includes StoreIDs, and the
periodic ticking that fetches these metrics at 1min intervals, and does
1s ticks, is moved to StoreGrantCoordinators. This simplifies the
ioLoadListener which no longer does the periodic ticking and eliminates
a some testing-only abstractions. The per-store WorkQueues share the
same metrics, which represent an aggregate across these queues.

Informs #65957

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I owe new unit tests, which I'll do after some initial feedback.

We could also use a multi-store roachtest for testing this -- do we have any existing ones that I could use as code examples?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

@sumeerbhola
Copy link
Collaborator Author

@RaduBerinde @ajwerner any thoughts on this PR?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this fell through the cracks.

A bit worried that we keep adding per-request overhead. How hard would it be to make it so that we still use a single queue when we have a single store? (which I believe is the case in most deployments)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit worried that we keep adding per-request overhead. How hard would it be to make it so that we still use a single queue when we have a single store? (which I believe is the case in most deployments)

In the current setup, even with a single store, only KV writes (not reads) go through both queues.

I think we'll need to solve any overhead issues with more code optimizations rather than fusing together queues. The latter would make the code more complicated and less composable. I plan to start running tpcc and kv roachtests with admission control enabled, which should tell us more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @sumeerbhola)


pkg/server/node.go, line 958 at r2 (raw file):

				return nil, err
			}
			if !continueToSecondQ {

[nit] Maybe add a comment explaining that in this case, admission control for KVWork is disabled so there is no point to use the second queue.
Also maybe rename continueToSecondQ to admissionEnabled.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use a multi-store roachtest for testing this -- do we have any existing ones that I could use as code examples?

@ajwerner I could use some suggestions on this

  • I see only one roachtest with multiple stores tpce.go. Confusingly, it specifies the store count both to Cluster.Start (--store-count) and to TestSpec.Cluster. It's not clear to me why it is not sufficient to do the former.
  • I'd like to create a kv-like roachtest (single node is fine) with 2 stores and pin tables to the stores to prevent rebalancing, and overload one store using a high concurrency kv0 workload and have the other store see a kv95 workload that doesn't overload it. Currently workload/kv/kv.go hardcodes the table name, which should be easy to override using a parameter. Also, I suppose it is ok to do two workload run calls (I can two nodes for the workloads). But I don't how to pin the tables and prevent rebalancing.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @sumeerbhola)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only one roachtest with multiple stores tpce.go. Confusingly, it specifies the store count both to Cluster.Start (--store-count) and to TestSpec.Cluster. It's not clear to me why it is not sufficient to do the former.

The former creates VMs with 2 disks. The latter provides flags to cockroach start to use those disks.

I can two nodes for the workloads

nit: I don't know that you need this. One workload node to run both workloads seems likely to be sufficient

Currently workload/kv/kv.go hardcodes the table name

While this is true, you can use different databases using the db flag for workload to configure them separately.

But I don't how to pin the tables and prevent rebalancing.

You'll need to set up zone configs.

I'd like to create a kv-like roachtest (single node is fine) with 2 stores and pin tables to the stores to prevent rebalancing, and overload one store using a high concurrency kv0 workload and have the other store see a kv95 workload that doesn't overload it.

What we're really missing here is that roachprod is not putting attributes on the stores. We'd need to put something like attrs=s%d for the store index or something around here:

var storeDirs []string

Then we could do something like:

CREATE DATABASE db1;
ALTER DATABASE db1 CONFIGURE ZONE USING constraints = '[+s1]';
CREATE DATABASE db2;
ALTER DATABASE db2 CONFIGURE ZONE USING constraints = '[+s2]';

And then in the two workloads, specify the --db flag.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

@sumeerbhola sumeerbhola force-pushed the store_overload branch 2 times, most recently from d3db1fe to b2b1858 Compare August 6, 2021 19:08
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)


pkg/util/admission/granter_test.go, line 219 at r3 (raw file):

// TestStoreCoordinators tests only the setup of GrantCoordinators per store.
// Testing of IO load functionality happens in TestIOLoadListener.

added a test

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner Thanks, that is very useful! I have a simple roachtest with 2 stores but I can't figure out why the two overloaded db (db2) is not being placed on store2. I see the following in crdb_internal.zones.txt

52	0	DATABASE db1	NULL	db1	NULL	NULL	NULL	NULL	"range_min_bytes: null
range_max_bytes: null
gc: null
global_reads: null
num_replicas: null
num_voters: null
constraints: [+store1]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db1 CONFIGURE ZONE USING
	constraints = '[+store1]'"	2\x102\f\b\x01\x12\x00\x1a\x06store18\x00P\x00X\x01x\x00	"range_min_bytes: 134217728
range_max_bytes: 536870912
gc:
  ttlseconds: 90000
global_reads: null
num_replicas: 3
num_voters: null
constraints: [+store1]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db1 CONFIGURE ZONE USING
	range_min_bytes = 134217728,
	range_max_bytes = 536870912,
	gc.ttlseconds = 90000,
	num_replicas = 3,
	constraints = '[+store1]',
	lease_preferences = '[]'"
53	0	DATABASE db2	NULL	db2	NULL	NULL	NULL	NULL	"range_min_bytes: null
range_max_bytes: null
gc: null
global_reads: null
num_replicas: null
num_voters: null
constraints: [+store2]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db2 CONFIGURE ZONE USING
	constraints = '[+store2]'"	2\x102\f\b\x01\x12\x00\x1a\x06store28\x00P\x00X\x01x\x00	"range_min_bytes: 134217728
range_max_bytes: 536870912
gc:
  ttlseconds: 90000
global_reads: null
num_replicas: 3
num_voters: null
constraints: [+store2]
voter_constraints: []
lease_preferences: []
"	"ALTER DATABASE db2 CONFIGURE ZONE USING
	range_min_bytes = 134217728,
	range_max_bytes = 536870912,
	gc.ttlseconds = 90000,
	num_replicas = 3,
	constraints = '[+store2]',
	lease_preferences = '[]'"

And the node was started with the correct store attributes

I210806 22:30:36.721467 9327 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹./cockroach› ‹start› ‹--insecure› ‹--store› ‹path=/mnt/data1/cockroach,attrs=store1› ‹--store› ‹path=/mnt/data2/cockroach,attrs=store2› ‹--log› ‹file-defaults: {dir: 'logs', exit-on-error: false}› ‹--cache=25%› ‹--max-sql-memory=25%› ‹--port=26257› ‹--http-port=26258› ‹--locality=cloud=gce,region=us-east1,zone=us-east1-b› ‹--join=34.73.187.44:26257› ‹--advertise-host› ‹10.142.0.10›]

I didn't anything in the logs that suggests that it is trying to move ranges to conform to these configs. And since these configs were created before the tables, I am assuming that the initial placement of the ranges will be correct.

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

The code is in
sumeerbhola@4aeb26d

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)

@sumeerbhola sumeerbhola force-pushed the store_overload branch 2 times, most recently from 94c07b0 to 1d5cae8 Compare August 9, 2021 12:19
…rations on the overloaded store

Previously KVWork was all admitted through a single WorkQueue, and this
WorkQueue was paired with a GrantCoordinator that took into account
both CPU overload and storage overload. This meant a single overloaded
store (in a multi-store setting) would slow down all KV operations
including reads, and those on stores that were fine.

We now have multiple WorkQueue's, one per store, that KV writes go
through, and then get to the shared WorkQueue that all operations go
through. The per-store WorkQueues are associated with their own
GrantCoordinators that listen to health state for their respective
stores. The per-store queues do not care about CPU and therefore do not
do grant chaining. Since admission through these per-store queues
happens before the shared queue, a store bottleneck will not cause KV
slots in the shared queue to be taken by requests that are still waiting
for admission elsewhere. The reverse can happen, and is considered
acceptable -- per-store tokens, when a store is overloaded, can be used
by requests that are now waiting for admission in the shared WorkQueue
because of a cpu bottleneck.

The code is significantly refactored for the above:
NewGrantCoordinators returns a container called StoreGrantCoordinators
which lazily initializes the relevant per-store GrantCoordinators when
it first fetches Pebble metrics, in addition to the shared
GrantCoordinator. The server code now integrates with both types and
the code in Node.Batch will sometimes subject a request to two
WorkQueues. The PebbleMetricsProvider now includes StoreIDs, and the
periodic ticking that fetches these metrics at 1min intervals, and does
1s ticks, is moved to StoreGrantCoordinators. This simplifies the
ioLoadListener which no longer does the periodic ticking and eliminates
a some testing-only abstractions. The per-store WorkQueues share the
same metrics, which represent an aggregate across these queues.

Informs cockroachdb#65957

Release note: None
@ajwerner
Copy link
Contributor

ajwerner commented Aug 9, 2021

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

Ack. Will look in the next couple of hours.

@sumeerbhola
Copy link
Collaborator Author

Will look in the next couple of hours.
Thanks!

I'm going to merge this PR and do the roachtest in a followup, since @RaduBerinde approved this and he is out this week.

@sumeerbhola
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 9, 2021

Build succeeded:

@craig craig bot merged commit 548fa91 into cockroachdb:master Aug 9, 2021
@sumeerbhola
Copy link
Collaborator Author

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

Ack. Will look in the next couple of hours.

@ajwerner any pointers would be much appreciated

@ajwerner
Copy link
Contributor

I couldn't find anything relevant in the db console either. Any suggestions on where I should look?

Ack. Will look in the next couple of hours.

@ajwerner any pointers would be much appreciated

I think that the problem is the inscrutible difference between constraints and voter_constraints. I won't say I understand why it's needed, but I think it is.

alter database db1 configure zone using constraints = '[+store1]', voter_constraints = '[+store1]', num_voters = 1;
alter database db2 configure zone using constraints = '[+store2]', voter_constraints = '[+store2]', num_voters = 1;

I just ran the experiment on ajwerner-test if you want to take a look. It seems like the replicas are being properly distributed.

root@localhost:26257/defaultdb> SELECT database_name, replicas FROM crdb_internal.ranges WHERE database_name LIKE 'db%';
  database_name | replicas
----------------+-----------
  db1         | {1}
  db1         | {1}
  db1         | {1}
  db1         | {1}
  db2         | {2}
  db2         | {2}

@sumeerbhola sumeerbhola deleted the store_overload branch September 14, 2021 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants