release-22.2: sql: output RU estimate for EXPLAIN ANALYZE on tenants #93179

DrewKimball · 2022-12-07T02:22:30Z

Backport:

1/1 commits from "sql: output RU estimate for EXPLAIN ANALYZE on tenants" (sql: output RU estimate for EXPLAIN ANALYZE on tenants #89256)
1/1 commits from "sql: add cluster setting to disable RU estimation" (sql: add cluster setting to disable RU estimation #92968)

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: low-risk, high-benefit change to existing functionality

blathers-crl · 2022-12-07T02:22:33Z

cockroach-teamcity · 2022-12-07T02:22:42Z

This change is

DrewKimball · 2022-12-07T02:27:35Z

I've left the cluster setting as default-on - should it be off by default for the backport?

Also, while the backport was mostly clean, I had to make a few changes to planNodeToRowSource vs the original PR since there have been changes to context-passing and span collection for that processor after 22.2. Instead of using execstats.ShouldCollectStats when deciding whether to set the ExecStatsForTrace callback, we only check flowCtx.CollectStats, and make the same check when deciding whether to record a span. @yuzefovich does this seem safe to you?

rytaft

(although would be good to get sign off from @yuzefovich regarding your question above)

I think it's ok to leave the setting defaulted to on for backporting to 22.2. I think we'll get this PR into 22.2.1, and presumably serverless clusters will start on 22.2.1 or later. If we want to backport to 22.1 it should probably be defaulted to off.

Reviewed 37 of 37 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)

andy-kimball · 2022-12-07T17:04:33Z

We don't need to backport this to 22.1.

yuzefovich

Reviewed 37 of 37 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)

pkg/sql/plan_node_to_row_source.go line 126 at r1 (raw file):

	// We can't use execstats.ShouldCollectStats here because the context isn't
	// passed to InitWithOutput.
	if flowCtx.CollectStats {

This is safe, but we should use execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats) for consistency with other places on 22.2 branch.

pkg/sql/plan_node_to_row_source.go line 163 at r1 (raw file):

func (p *planNodeToRowSource) Start(ctx context.Context) {
	if p.FlowCtx.CollectStats {
		ctx = p.StartInternal(ctx, nodeName(p.node))

I think we should just always call StartInternal here (we did this in 7291e4d on master). The only thing I would check is whether there are any regressions in BenchmarkSQL/Insert\$\$ in pkg/bench, but I don't expect them to be there.

DrewKimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)

pkg/sql/plan_node_to_row_source.go line 126 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This is safe, but we should use execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats) for consistency with other places on 22.2 branch.

In the tests I did, flowCtx.EvalCtx.Ctx() didn't necessarily have a span set even when I added the StartInternal call to Start - I guess maybe the difference is because there are more StartInternal calls on master now? Maybe this check should go in a different place to ensure the span is set?

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)

pkg/sql/plan_node_to_row_source.go line 126 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

In the tests I did, flowCtx.EvalCtx.Ctx() didn't necessarily have a span set even when I added the StartInternal call to Start - I guess maybe the difference is because there are more StartInternal calls on master now? Maybe this check should go in a different place to ensure the span is set?

Hm, this is surprising to me - what was the reproduction that you ran into this with?

DrewKimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)

pkg/sql/plan_node_to_row_source.go line 126 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, this is surprising to me - what was the reproduction that you ran into this with?

It wasn't anything special - I just did an insert on a table like this:

CREATE TABLE xy (x INT, y INT);
EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t));

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)

pkg/sql/plan_node_to_row_source.go line 126 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

It wasn't anything special - I just did an insert on a table like this:
CREATE TABLE xy (x INT, y INT);
EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t));

Hm, I just tried these two queries in demo --multitenant=true with cockroach binary built on this branch plus

diff --git a/pkg/sql/plan_node_to_row_source.go b/pkg/sql/plan_node_to_row_source.go
index 18401059a5..c7789a88d3 100644
--- a/pkg/sql/plan_node_to_row_source.go
+++ b/pkg/sql/plan_node_to_row_source.go
@@ -123,7 +123,7 @@ func (p *planNodeToRowSource) InitWithOutput(
        }
        // We can't use execstats.ShouldCollectStats here because the context isn't
        // passed to InitWithOutput.
-       if flowCtx.CollectStats {
+       if execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats) {
                p.ExecStatsForTrace = p.execStatsForTrace
        }
        return nil

and it worked

root@127.0.0.1:26257/defaultdb> EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t));
                 info
--------------------------------------
  planning time: 171µs
  execution time: 325ms
  distribution: local
  vectorized: true
  maximum memory usage: 70 KiB
  network usage: 0 B (0 messages)
  estimated RUs consumed: 10,291
...

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)

pkg/sql/plan_node_to_row_source.go line 126 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, I just tried these two queries in demo --multitenant=true with cockroach binary built on this branch plus

diff --git a/pkg/sql/plan_node_to_row_source.go b/pkg/sql/plan_node_to_row_source.go
index 18401059a5..c7789a88d3 100644
--- a/pkg/sql/plan_node_to_row_source.go
+++ b/pkg/sql/plan_node_to_row_source.go
@@ -123,7 +123,7 @@ func (p *planNodeToRowSource) InitWithOutput(
        }
        // We can't use execstats.ShouldCollectStats here because the context isn't
        // passed to InitWithOutput.
-       if flowCtx.CollectStats {
+       if execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats) {
                p.ExecStatsForTrace = p.execStatsForTrace
        }
        return nil

and it worked

root@127.0.0.1:26257/defaultdb> EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t));
                 info
--------------------------------------
  planning time: 171µs
  execution time: 325ms
  distribution: local
  vectorized: true
  maximum memory usage: 70 KiB
  network usage: 0 B (0 messages)
  estimated RUs consumed: 10,291
...

Or was the problem that RUs estimation was wrong?

DrewKimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)

pkg/sql/plan_node_to_row_source.go line 126 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Or was the problem that RUs estimation was wrong?

Hm, maybe I tested that before changing to use StartInternal or something. Reverted to use ShouldCollectStats. Done.

pkg/sql/plan_node_to_row_source.go line 163 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think we should just always call StartInternal here (we did this in 7291e4d on master). The only thing I would check is whether there are any regressions in BenchmarkSQL/Insert\$\$ in pkg/bench, but I don't expect them to be there.

Done.

yuzefovich

Reviewed 10 of 10 files at r3, 9 of 9 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball)

DrewKimball · 2022-12-08T17:23:35Z

TFTRs!

DrewKimball · 2022-12-09T20:44:04Z

@yuzefovich I had to modify the call to StartInternal for planNodeToRowSource to make TestAlreadyRunningJobsAreHandledProperly test pass.

yuzefovich

LGTM, although it probably would have been cleaner to backport the whole 7291e4d without modifying the commit for RU estimation. I'm ok with merging either way.

Reviewed 18 of 18 files at r5, 9 of 9 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @DrewKimball)

pkg/sql/opt/exec/execbuilder/testdata/select line 24 at r5 (raw file):

query ITT
SELECT
  span, split_part(regexp_replace(message, 'pos:[0-9]*', 'pos:?'), E'\n', 1), operation

Why did we need to modify this? Looks like 7291e4d didn't do it.

DrewKimball

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @yuzefovich)

pkg/sql/opt/exec/execbuilder/testdata/select line 24 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why did we need to modify this? Looks like 7291e4d didn't do it.

The SPAN START: values lines started including the flow ID, which made the test non-deterministic... it's not clear to me what's changed since 22.2 to cause this

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @DrewKimball)

pkg/sql/opt/exec/execbuilder/testdata/select line 24 at r5 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

The SPAN START: values lines started including the flow ID, which made the test non-deterministic... it's not clear to me what's changed since 22.2 to cause this

Those were removed on master in ae24dd9. I think we should just backport all three commits from #87317, and then just keep the original two commits here. Will do that now.

This commit adds a top-level field to the output of `EXPLAIN ANALYZE` that shows the estimated number of RUs that would be consumed due to network egress to the client. The estimate is obtained by buffering each value from the query result in text format and then measuring the size of the buffer before resetting it. The result is used to get the RU consumption with the tenant cost config's `PGWireEgressCost` method. **sql: surface query request units consumed due to cpu usage** This commit adds the ability for clients to estimate the number of RUs consumed by a query due to CPU usage. This is accomplished by keeping a moving average of the CPU usage for the entire tenant process, then using that to obtain an estimate for what the CPU usage *would* be if the query wasn't running. This is then compared against the actual measured CPU usage during the query's execution to get the estimate. For local flows this is done at the `connExecutor` level; for remote flows this is handled by the last outbox on the node (which gathers and sends the flow's metadata). The resulting RU estimate is added to the existing estimate from network egress and displayed in the output of `EXPLAIN ANALYZE`. **sql: surface query request units consumed by IO** This commit adds tracking for request units consumed by IO operations for all execution operators that perform KV operations. The corresponding RU count is recorded in the span and later aggregated with the RU consumption due to network egress and CPU usage. The resulting query RU consumption estimate is visible in the output of `EXPLAIN ANALYZE`. **multitenantccl: add sanity testing for ru estimation** This commit adds a sanity test for the RU estimates produced by running queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query several times with `EXPLAIN ANALYZE`, then runs all test queries without `EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the aggregated estimates. For now, this test is disabled during builds because it is flaky in the presence of background activity. For this reason it should only be used as a manual sanity test. Informs cockroachdb#74441 Release note (sql change): Added an estimate for the number of request units consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.

This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`, which is used to determine whether tenants collect an RU estimate for queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the RU estimation logic can be more safely backported. Informs cockroachdb#74441 Release note: None

yuzefovich

Reviewed 21 of 22 files at r7, 16 of 16 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball)

DrewKimball · 2022-12-12T21:45:16Z

TFTRs!

DrewKimball requested a review from a team as a code owner December 7, 2022 02:22

DrewKimball requested a review from a team December 7, 2022 02:22

DrewKimball requested review from a team as code owners December 7, 2022 02:22

DrewKimball requested review from a team December 7, 2022 02:22

DrewKimball requested a review from a team as a code owner December 7, 2022 02:22

DrewKimball requested review from andy-kimball, yuzefovich and rytaft and removed request for a team December 7, 2022 02:22

rytaft approved these changes Dec 7, 2022

View reviewed changes

yuzefovich reviewed Dec 7, 2022

View reviewed changes

DrewKimball commented Dec 7, 2022

View reviewed changes

yuzefovich reviewed Dec 8, 2022

View reviewed changes

DrewKimball commented Dec 8, 2022

View reviewed changes

yuzefovich reviewed Dec 8, 2022

View reviewed changes

DrewKimball force-pushed the backport22.2-89256-92968 branch from fe6aa7c to 0c7a184 Compare December 8, 2022 17:10

DrewKimball commented Dec 8, 2022

View reviewed changes

yuzefovich approved these changes Dec 8, 2022

View reviewed changes

DrewKimball force-pushed the backport22.2-89256-92968 branch 2 times, most recently from 05d3314 to 07fef76 Compare December 9, 2022 20:43

DrewKimball requested a review from a team December 9, 2022 20:43

yuzefovich approved these changes Dec 9, 2022

View reviewed changes

DrewKimball commented Dec 9, 2022

View reviewed changes

yuzefovich reviewed Dec 9, 2022

View reviewed changes

DrewKimball added 2 commits December 12, 2022 12:20

DrewKimball force-pushed the backport22.2-89256-92968 branch from 07fef76 to 8d8941e Compare December 12, 2022 20:41

yuzefovich approved these changes Dec 12, 2022

View reviewed changes

DrewKimball merged commit 2a04626 into cockroachdb:release-22.2 Dec 12, 2022

DrewKimball deleted the backport22.2-89256-92968 branch December 12, 2022 21:46

cockroach-teamcity mentioned this pull request Dec 13, 2022

PR #93179 - **sql: surface query request units consumed by network egress** cockroachdb/docs#15804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.2: sql: output RU estimate for EXPLAIN ANALYZE on tenants #93179

release-22.2: sql: output RU estimate for EXPLAIN ANALYZE on tenants #93179

DrewKimball commented Dec 7, 2022

blathers-crl bot commented Dec 7, 2022 •

edited by DrewKimball

Loading

cockroach-teamcity commented Dec 7, 2022

DrewKimball commented Dec 7, 2022

rytaft left a comment

andy-kimball commented Dec 7, 2022

yuzefovich left a comment

DrewKimball left a comment

yuzefovich left a comment

DrewKimball left a comment

yuzefovich left a comment

yuzefovich left a comment

DrewKimball left a comment

yuzefovich left a comment

DrewKimball commented Dec 8, 2022

DrewKimball commented Dec 9, 2022

yuzefovich left a comment

DrewKimball left a comment

yuzefovich left a comment

yuzefovich left a comment

DrewKimball commented Dec 12, 2022

release-22.2: sql: output RU estimate for EXPLAIN ANALYZE on tenants #93179

release-22.2: sql: output RU estimate for EXPLAIN ANALYZE on tenants #93179

Conversation

DrewKimball commented Dec 7, 2022

blathers-crl bot commented Dec 7, 2022 • edited by DrewKimball Loading

cockroach-teamcity commented Dec 7, 2022

DrewKimball commented Dec 7, 2022

rytaft left a comment

Choose a reason for hiding this comment

andy-kimball commented Dec 7, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball commented Dec 8, 2022

DrewKimball commented Dec 9, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball commented Dec 12, 2022

blathers-crl bot commented Dec 7, 2022 •

edited by DrewKimball

Loading