Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: investigate performance regressions on some queries #47058

Closed
yuzefovich opened this issue Apr 5, 2020 · 8 comments
Closed

sql: investigate performance regressions on some queries #47058

yuzefovich opened this issue Apr 5, 2020 · 8 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@yuzefovich
Copy link
Member

yuzefovich commented Apr 5, 2020

Our friends from Apollo team at Georgia Tech have found multiple queries on which our 19.1 version performs better than 19.2 which indicates some performance regressions. I reran all reported queries on a single node setup of 19.1.8, 19.2.5, and 20.1.0.beta-4-dirty and observed the regressions on most of the queries (all times are in seconds).

Case # 19.1.8 19.2.5 20.1.0
0 0.875 1.922 1.815
1 1.110 2.896 3.118
3 0.759 1.856 1.810
5 0.057 0.299 0.304
6 0.196 0.301 0.314
9 0.283 0.665 0.630
10 1.201 3.279 3.598
11 1.097 3.326 3.391
12 0.310 0.686 0.627
14 0.747 3.181 2.997
16 0.965 1.957 1.966
18 2.037 2.346 2.323
19 0.758 3.190 2.959
22 1.019 1.923 2.018
24 2.031 2.303 2.375

Note that query 6 runs in about the same time in 20.1.0 as in 19.2.5 if we set distsql=on. Further investigation is needed for why the behavior of distsql=auto has changed between 19.2.5 and 20.1. This has been addressed by #47365, and I updated the run times.

I have not looked into any other regressions.

The repro instructions can be found in our Apollo slack channel.

@yuzefovich yuzefovich added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Apr 5, 2020
@awoods187
Copy link
Contributor

cc @RaduBerinde

craig bot pushed a commit that referenced this issue Apr 11, 2020
47365: sql: ignore soft limits on scan nodes for distsql planning r=yuzefovich a=yuzefovich

We have added propagation of soft limits in 20.1 release, and this
causes some of the queries that used to run via DistSQL with
`distsql=auto` to get a "should not distribute" recommendation during
distsql physical planning. However, this can cause an egregious
performance regression on some queries from 19.2 version. In order to
keep the decision whether to distribute scans or not the same, we will
be ignoring the soft limits on scan nodes for now.

Addresses: #47058.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@asubiotto asubiotto self-assigned this Apr 14, 2020
@asubiotto
Copy link
Contributor

asubiotto commented Apr 16, 2020

Wrote a script to check for plan differences in different versions. Here is the output for these queries. In each query, the top plan is 19.1.8 and the bottom is 20.1.0:
diffoutput.txt

Some differences are due to the output changing between versions. I haven't taken a close look at them yet but wanted to upload them in case anyone saw anything interesting. In query 1, for example, there seems to be an extra ORDER BY in the window function as well as an extra render stage.

@asubiotto
Copy link
Contributor

And here is the explain analyze output:
diffoutput_analyze.txt

@asubiotto
Copy link
Contributor

Looking at the explain analyze output of some queries that have window functions it seems that a lot of these regressions can be explained by the fact that we previously did not respect the work limit in the windower processor. We seem to be properly spilling to disk in 20.1 so I would classify these as expected regressions. Queries that include window functions here are: 1, 10, 11, 14, 18, 19, 24 so we can eliminate those.

One thing for the optimizer team cc @RaduBerinde: Query 5 seems to use a merge join instead of a lookup join.

Still needs further investigation: 0, 3, 6, 9, 12, 16, and 22. The EXPLAIN (VERBOSE) output doesn't seem different for these and the EXPLAIN ANALYZE doesn't seem to display properly and for some reason the encoded links all have a bunch of repeated "A"s (example: Query 6 explain analyze plan)

@awoods187
Copy link
Contributor

thanks for taking a look at these @asubiotto! I'm curious about query 5 @rytaft or @RaduBerinde

@rytaft
Copy link
Collaborator

rytaft commented Apr 29, 2020

I've looked into this, and I'm pretty sure that query 5 changed because we've made lookup joins more expensive since 19.1. In particular, #40248 and #43003 probably had an impact. I haven't actually done a git bisect, but I'm pretty sure that's the cause.

Part of the problem here is that our row count estimate of the distinct table after filtering is incorrect since we can't determine that the filter on district is actually false. We could do that pretty easily by creating a rule that converts exists(<subquery known to return 0 rows>) to false. I'll open an issue to create such a rule.

rytaft added a commit to rytaft/cockroach that referenced this issue Apr 29, 2020
This commit adds a new rule, EliminateExistsZeroRows, which
converts an Exists subquery to False when it's known
that the input produces zero rows.

Informs cockroachdb#47058

Release note (performance improvement): The optimizer can now
detect when an Exists subquery can be eliminated because the input
has zero rows. This leads to better plans in some cases.
@rytaft
Copy link
Collaborator

rytaft commented Apr 29, 2020

Actually -- just went ahead and submitted a PR since it was so trivial: #48162. I confirmed that the resulting plan for query 5 is much better.

craig bot pushed a commit that referenced this issue Apr 29, 2020
46992: sql: Add Logical Column ID field to ColumnDescriptor r=rohany a=RichardJCai

The LogicalColumnID field mimics the ColumnID field however LogicalColumnID may be swapped
between two columns whereas ColumnID cannot. LogicalColumnID is referenced for virtual tables
(pg_catalog, information_schema) and most notably affects column ordering for SHOW COLUMNS.

This LogicalColumnID field support swapping the order of two columns - currently only used for
ALTER COLUMN TYPE when a shadow column is created and swapped with it's original column.

Does not affect existing behaviour.

Release note: None

47449: cli: add --cert-principal-map to client commands r=petermattis a=petermattis

Add support for the `--cert-principal-map` flag to the certs and client
commands. Anywhere we were accepting the `--certs-dir` flag, we now also
accept the `--cert-principal-map` flag.

Fixes #47300

Release note (cli change): Support the `--cert-principal-map` flag in
the `cert *` and "client" commands such as `sql`.

48138: keys: support splitting Ranges on tenant-id prefixed keys r=nvanbenschoten a=nvanbenschoten

Fixes #48122.
Relates to #47903.
Relates to #48123.

This PR contains a series of small commits that work towards the introduction of tenant-id prefixed keyspaces and begin the removal of some `keys.TODOSQLCodec` instances. This should be the only time we need to touch C++ throughout this work.

48160: storage,libroach: Check for MaxKeys when reading from intent history r=itsbilal a=itsbilal

We weren't checking for MaxKeys (or TargetBytes) being reached
in the case where we read from intent history in the MVCC scanner.
All other cases go through addAndAdvance(), which had these checks.

Almost certainly fixes #46652. Would be very surprised if it was
something else.

Release note (bug fix): Fixes a bug where a read operation in a transaction
would error out for exceeding the maximum count of results returned.

48162: opt: add rule to eliminate Exists when input has zero rows r=rytaft a=rytaft

This commit adds a new rule, `EliminateExistsZeroRows`, which
converts an `Exists` subquery to False when it's known
that the input produces zero rows.

Informs #47058

Release note (performance improvement): The optimizer can now
detect when an Exists subquery can be eliminated because the input
has zero rows. This leads to better plans in some cases.

Co-authored-by: richardjcai <caioftherichard@gmail.com>
Co-authored-by: Peter Mattis <petermattis@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
@yuzefovich
Copy link
Member Author

I think we addressed the two main problems identified by this issue (#47365, #48162), and most of the remaining regressions were expected given the fix to the window functions, so I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

4 participants