-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: sql: output RU estimate for EXPLAIN ANALYZE on tenants #93179
release-22.2: sql: output RU estimate for EXPLAIN ANALYZE on tenants #93179
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
I've left the cluster setting as default-on - should it be off by default for the backport? Also, while the backport was mostly clean, I had to make a few changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(although would be good to get sign off from @yuzefovich regarding your question above)
I think it's ok to leave the setting defaulted to on for backporting to 22.2. I think we'll get this PR into 22.2.1, and presumably serverless clusters will start on 22.2.1 or later. If we want to backport to 22.1 it should probably be defaulted to off.
Reviewed 37 of 37 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)
We don't need to backport this to 22.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 37 of 37 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)
pkg/sql/plan_node_to_row_source.go
line 126 at r1 (raw file):
// We can't use execstats.ShouldCollectStats here because the context isn't // passed to InitWithOutput. if flowCtx.CollectStats {
This is safe, but we should use execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats)
for consistency with other places on 22.2 branch.
pkg/sql/plan_node_to_row_source.go
line 163 at r1 (raw file):
func (p *planNodeToRowSource) Start(ctx context.Context) { if p.FlowCtx.CollectStats { ctx = p.StartInternal(ctx, nodeName(p.node))
I think we should just always call StartInternal
here (we did this in 7291e4d on master). The only thing I would check is whether there are any regressions in BenchmarkSQL/Insert\$\$
in pkg/bench
, but I don't expect them to be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)
pkg/sql/plan_node_to_row_source.go
line 126 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This is safe, but we should use
execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats)
for consistency with other places on 22.2 branch.
In the tests I did, flowCtx.EvalCtx.Ctx()
didn't necessarily have a span set even when I added the StartInternal
call to Start
- I guess maybe the difference is because there are more StartInternal
calls on master now? Maybe this check should go in a different place to ensure the span is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)
pkg/sql/plan_node_to_row_source.go
line 126 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
In the tests I did,
flowCtx.EvalCtx.Ctx()
didn't necessarily have a span set even when I added theStartInternal
call toStart
- I guess maybe the difference is because there are moreStartInternal
calls on master now? Maybe this check should go in a different place to ensure the span is set?
Hm, this is surprising to me - what was the reproduction that you ran into this with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)
pkg/sql/plan_node_to_row_source.go
line 126 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, this is surprising to me - what was the reproduction that you ran into this with?
It wasn't anything special - I just did an insert on a table like this:
CREATE TABLE xy (x INT, y INT);
EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)
pkg/sql/plan_node_to_row_source.go
line 126 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
It wasn't anything special - I just did an insert on a table like this:
CREATE TABLE xy (x INT, y INT); EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t));
Hm, I just tried these two queries in demo --multitenant=true
with cockroach binary built on this branch plus
diff --git a/pkg/sql/plan_node_to_row_source.go b/pkg/sql/plan_node_to_row_source.go
index 18401059a5..c7789a88d3 100644
--- a/pkg/sql/plan_node_to_row_source.go
+++ b/pkg/sql/plan_node_to_row_source.go
@@ -123,7 +123,7 @@ func (p *planNodeToRowSource) InitWithOutput(
}
// We can't use execstats.ShouldCollectStats here because the context isn't
// passed to InitWithOutput.
- if flowCtx.CollectStats {
+ if execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats) {
p.ExecStatsForTrace = p.execStatsForTrace
}
return nil
and it worked
root@127.0.0.1:26257/defaultdb> EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t));
info
--------------------------------------
planning time: 171µs
execution time: 325ms
distribution: local
vectorized: true
maximum memory usage: 70 KiB
network usage: 0 B (0 messages)
estimated RUs consumed: 10,291
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)
pkg/sql/plan_node_to_row_source.go
line 126 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, I just tried these two queries in
demo --multitenant=true
with cockroach binary built on this branch plusdiff --git a/pkg/sql/plan_node_to_row_source.go b/pkg/sql/plan_node_to_row_source.go index 18401059a5..c7789a88d3 100644 --- a/pkg/sql/plan_node_to_row_source.go +++ b/pkg/sql/plan_node_to_row_source.go @@ -123,7 +123,7 @@ func (p *planNodeToRowSource) InitWithOutput( } // We can't use execstats.ShouldCollectStats here because the context isn't // passed to InitWithOutput. - if flowCtx.CollectStats { + if execstats.ShouldCollectStats(flowCtx.EvalCtx.Ctx(), flowCtx.CollectStats) { p.ExecStatsForTrace = p.execStatsForTrace } return nil
and it worked
root@127.0.0.1:26257/defaultdb> EXPLAIN ANALYZE INSERT INTO xy (SELECT t, t FROM generate_series(1, 10000) g(t)); info -------------------------------------- planning time: 171µs execution time: 325ms distribution: local vectorized: true maximum memory usage: 70 KiB network usage: 0 B (0 messages) estimated RUs consumed: 10,291 ...
Or was the problem that RUs estimation was wrong?
fe6aa7c
to
0c7a184
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @yuzefovich)
pkg/sql/plan_node_to_row_source.go
line 126 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Or was the problem that RUs estimation was wrong?
Hm, maybe I tested that before changing to use StartInternal
or something. Reverted to use ShouldCollectStats
. Done.
pkg/sql/plan_node_to_row_source.go
line 163 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think we should just always call
StartInternal
here (we did this in 7291e4d on master). The only thing I would check is whether there are any regressions inBenchmarkSQL/Insert\$\$
inpkg/bench
, but I don't expect them to be there.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r3, 9 of 9 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball)
TFTRs! |
05d3314
to
07fef76
Compare
@yuzefovich I had to modify the call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although it probably would have been cleaner to backport the whole 7291e4d without modifying the commit for RU estimation. I'm ok with merging either way.
Reviewed 18 of 18 files at r5, 9 of 9 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @DrewKimball)
pkg/sql/opt/exec/execbuilder/testdata/select
line 24 at r5 (raw file):
query ITT SELECT span, split_part(regexp_replace(message, 'pos:[0-9]*', 'pos:?'), E'\n', 1), operation
Why did we need to modify this? Looks like 7291e4d didn't do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @yuzefovich)
pkg/sql/opt/exec/execbuilder/testdata/select
line 24 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Why did we need to modify this? Looks like 7291e4d didn't do it.
The SPAN START: values
lines started including the flow ID, which made the test non-deterministic... it's not clear to me what's changed since 22.2 to cause this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @DrewKimball)
pkg/sql/opt/exec/execbuilder/testdata/select
line 24 at r5 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
The
SPAN START: values
lines started including the flow ID, which made the test non-deterministic... it's not clear to me what's changed since 22.2 to cause this
Those were removed on master in ae24dd9. I think we should just backport all three commits from #87317, and then just keep the original two commits here. Will do that now.
This commit adds a top-level field to the output of `EXPLAIN ANALYZE` that shows the estimated number of RUs that would be consumed due to network egress to the client. The estimate is obtained by buffering each value from the query result in text format and then measuring the size of the buffer before resetting it. The result is used to get the RU consumption with the tenant cost config's `PGWireEgressCost` method. **sql: surface query request units consumed due to cpu usage** This commit adds the ability for clients to estimate the number of RUs consumed by a query due to CPU usage. This is accomplished by keeping a moving average of the CPU usage for the entire tenant process, then using that to obtain an estimate for what the CPU usage *would* be if the query wasn't running. This is then compared against the actual measured CPU usage during the query's execution to get the estimate. For local flows this is done at the `connExecutor` level; for remote flows this is handled by the last outbox on the node (which gathers and sends the flow's metadata). The resulting RU estimate is added to the existing estimate from network egress and displayed in the output of `EXPLAIN ANALYZE`. **sql: surface query request units consumed by IO** This commit adds tracking for request units consumed by IO operations for all execution operators that perform KV operations. The corresponding RU count is recorded in the span and later aggregated with the RU consumption due to network egress and CPU usage. The resulting query RU consumption estimate is visible in the output of `EXPLAIN ANALYZE`. **multitenantccl: add sanity testing for ru estimation** This commit adds a sanity test for the RU estimates produced by running queries with `EXPLAIN ANALYZE` on a tenant. The test runs each test query several times with `EXPLAIN ANALYZE`, then runs all test queries without `EXPLAIN ANALYZE` and compares the resulting actual RU measurement to the aggregated estimates. For now, this test is disabled during builds because it is flaky in the presence of background activity. For this reason it should only be used as a manual sanity test. Informs cockroachdb#74441 Release note (sql change): Added an estimate for the number of request units consumed by a query to the output of `EXPLAIN ANALYZE` for tenant sessions.
This patch adds a cluster setting, `sql.tenant_ru_estimation.enabled`, which is used to determine whether tenants collect an RU estimate for queries run with `EXPLAIN ANALYZE`. This is an escape hatch so that the RU estimation logic can be more safely backported. Informs cockroachdb#74441 Release note: None
07fef76
to
8d8941e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 21 of 22 files at r7, 16 of 16 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball)
TFTRs! |
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: low-risk, high-benefit change to existing functionality