roachtest: introduce admission-control/elastic-cdc #89656

irfansharif · 2022-10-10T14:26:36Z

Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope.

Release note: None

cockroach-teamcity · 2022-10-10T14:26:44Z

This change is

irfansharif · 2022-10-10T14:31:44Z

05:02:26 util.go:55: up-replication complete
05:02:26 tpcc.go:175: test status: loading fixture (<10m0s)
05:08:53 tpcc.go:191: test status: finished tpc-c setup
05:08:53 tpcc.go:255: test worker status: running tpcc worker=0 warehouses=1000 ramp=5m0s duration=1h30m0s on {pgurl:1-3} (<1m0s)
05:08:53 admission_control_elastic_cdc.go:84: test status: during: cluster settings (<30s)
05:08:53 admission_control_elastic_cdc.go:94: test status: during: setting performance baseline (<5m0s)
05:13:53 admission_control_elastic_cdc.go:129: test status: during: waiting for workload to finish (<1h30m0s)
05:13:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 0 (<1m0s)
05:18:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 1 (<1m0s)
05:23:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 2 (<1m0s)
05:28:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 3 (<1m0s)
05:33:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 4 (<1m0s)
06:44:02 test_runner.go:942: tearing down after success; see teardown.log

Here's a prometheus dump of an experiment run. To see it, run the contained ./prometheus-docker-run.sh and point to localhost:9090 through your grafana instance (brew install grafana; brew services start grafana + localhost:3000 if you're on macs). http://go.crdb.dev/p/changefeed-admission-control-grafana is the set of dashboards I've been using.

elastic-cdc.zip

Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri). Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work. Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None

sumeerbhola

nice! minor comments.

Reviewed 5 of 5 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 106 at r1 (raw file):

					time.Sleep(padDuration)

					stopFeeds(db)

Is this stopping feeds before starting the experiment below since there may be some leftover from before?

pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 117 at r1 (raw file):

							t.Status(fmt.Sprintf("during: round %d: stopping extant changefeeds (<%s)", i, stopFeedsDuration))
							stopFeeds(db)
							time.Sleep(stopFeedsDuration)

The logic here could use an overview comment explaining what it is trying to do. And explain stopFeedsDuration and padDuration.
Is the number of changefeeds > 1 so that they will all do their independent catchup scan? How does this cause the "Changefeeds running" graph to show an increasing step function?

pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 124 at r1 (raw file):

									CREATE CHANGEFEED FOR tpcc.order_line, tpcc.stock, tpcc.customer
									INTO 'null://' WITH cursor = '-%ds'
								`, int64(float64(i+1)*padDuration.Seconds()))

Are we increasing how far back we need to scan for the catchup scan?

Part of cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope. Release note: None

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @shermanCRL and @sumeerbhola)

pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 106 at r1 (raw file):

Previously, sumeerbhola wrote…

Is this stopping feeds before starting the experiment below since there may be some leftover from before?

Yes, was handy when constantly re-running against the same cluster. Added a comment

pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 117 at r1 (raw file):
Added comments.

Is the number of changefeeds > 1 so that they will all do their independent catchup scan?

Yes, and just to really swamp the cluster. Even this week @shermanCRL pointed me to this case internally where a customer was trying to run 50+ changefeeds all at once and being worried about the CPU effects. See this slack thread.

How does this cause the "Changefeeds running" graph to show an increasing step function?

The screenshot above was from a different version of the test. The most recent version/data dumps are here where we don't have these step functions: #89709 (comment).

pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 124 at r1 (raw file):

Previously, sumeerbhola wrote…

Are we increasing how far back we need to scan for the catchup scan?

Yes, trying to scan as far back as possible (~ when the workload started). This test doesn't care about the work getting done or not (feeds are cancelled by the next iteration), just what the effect is on CPU/scheduling latencies. Added a comment.

irfansharif · 2022-10-17T20:15:24Z

bors r+

craig · 2022-10-17T21:23:18Z

Build succeeded:

Bazel Essential CI (Cockroach)

Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work (cockroachdb#90089). Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None

89709: kv,rangefeed: integrate catchup scans with elastic cpu r=irfansharif a=irfansharif Part of #65957. First commit is from #89656. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In #89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in #86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri). Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something #89589 can help with. We leave admission integration of parallel worker goroutines to future work. Unlike export requests, rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to use at most 100ms of CPU time (it makes for an ineffective controller otherwise), and to that end we introduce the following component used within the rangefeed catchup iterator: // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work (cockroachdb#90089). Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None

irfansharif requested a review from a team October 10, 2022 14:26

irfansharif requested a review from a team as a code owner October 10, 2022 14:26

irfansharif requested review from smg260 and removed request for a team October 10, 2022 14:26

irfansharif removed the request for review from smg260 October 10, 2022 14:26

irfansharif force-pushed the 221007.rangefeed-ac-test branch 2 times, most recently from a3f7ea1 to 36d5dd8 Compare October 11, 2022 00:21

irfansharif mentioned this pull request Oct 11, 2022

kv,rangefeed: integrate catchup scans with elastic cpu #89709

Merged

irfansharif mentioned this pull request Oct 11, 2022

admission: add integration tests #89208

Open

21 tasks

sumeerbhola approved these changes Oct 11, 2022

View reviewed changes

irfansharif force-pushed the 221007.rangefeed-ac-test branch from 36d5dd8 to 42d08d8 Compare October 17, 2022 20:13

irfansharif commented Oct 17, 2022

View reviewed changes

irfansharif mentioned this pull request Oct 17, 2022

admission,changefeedccl: integrate parallel workers w/ elastic cpu control #90089

Closed

craig bot merged commit 55aca65 into cockroachdb:master Oct 17, 2022

irfansharif deleted the 221007.rangefeed-ac-test branch October 17, 2022 21:28

irfansharif mentioned this pull request Oct 18, 2022

kvfeed: fix scanRequestScanner.exportScan integration with admission control #88733

Closed

miretskiy mentioned this pull request Oct 18, 2022

admission: Further improvements in catchup scans AC #90194

Closed

irfansharif mentioned this pull request Nov 24, 2022

release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: introduce admission-control/elastic-cdc #89656

roachtest: introduce admission-control/elastic-cdc #89656

irfansharif commented Oct 10, 2022 •

edited

Loading

cockroach-teamcity commented Oct 10, 2022

irfansharif commented Oct 10, 2022

sumeerbhola left a comment

irfansharif left a comment

irfansharif commented Oct 17, 2022

craig bot commented Oct 17, 2022

roachtest: introduce admission-control/elastic-cdc #89656

roachtest: introduce admission-control/elastic-cdc #89656

Conversation

irfansharif commented Oct 10, 2022 • edited Loading

cockroach-teamcity commented Oct 10, 2022

irfansharif commented Oct 10, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif commented Oct 17, 2022

craig bot commented Oct 17, 2022

irfansharif commented Oct 10, 2022 •

edited

Loading