Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: remove vectorize_row_count_thresold session variable #53893

Closed
asubiotto opened this issue Sep 3, 2020 · 14 comments · Fixed by #62164
Closed

colexec: remove vectorize_row_count_thresold session variable #53893

asubiotto opened this issue Sep 3, 2020 · 14 comments · Fixed by #62164
Assignees
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@asubiotto
Copy link
Contributor

By default, the vectorized execution engine is only used to execute a query if the query's row count estimate is larger than the vectorize_row_count_threshold. This was because there was a non-negligible allocation overhead.

Since dynamic batches were introduced, the allocations were minimized and the vectorized execution engine demonstrates a 10% speedup on a point lookup workload (kv95), which is a worst-case scenario for the engine.

We should:

  1. Remove the vectorize_row_count_threshold since it is not useful anymore.
  2. Create an optimizer cost model to take into account the fact that vectorized execution will now be unconditionally used if set (this will be the default). This is a lot easier than what we wanted to do previously, which was make the optimizer decide whether vectorized execution would be beneficial.
@asubiotto asubiotto added A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Sep 3, 2020
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Sep 3, 2020
@asubiotto
Copy link
Contributor Author

  1. is likely out of scope for the immediate 21.1 release.

I think the main work item here is benchmarking other execution operations with a small number of rows (e.g. hash joins, aggregations) to offer more data points on the performance difference in the worst case for the vectorized engine.

@yuzefovich
Copy link
Member

On a related note, do we want to remove vectorize=201auto setting as well? As a reminder, it means that only streaming queries will run via the vectorized engine when the threshold is met; we're removing the second condition, and I'm thinking that "streaming" concept is no longer useful differentiator as well because we have disk-spilling in all cases when the row engine has it. Thoughts? cc @jordanlewis

@asubiotto
Copy link
Contributor Author

Yes, I think 201auto only makes sense in 20.2

craig bot pushed a commit that referenced this issue Nov 5, 2020
56206: sql: make SupportsVectorized check light-weight r=yuzefovich a=yuzefovich

**colexec: pool allocations of some objects in the read path**

This commit pools the allocations of some objects that are created on
the simplest read path in the vectorized engine - when we have
a ColBatchScan connected with a Materializer.

Release note: None

**colbuilder: fix casting behavior for actual types mismatch**

We have recently merged a change that enforces that the colbuilder
produces an operator chain that outputs batches with the desired type
schema. This is enforced by planning casts when there is a mismatch.
Previously, we would only try planning a vectorized cast because the
assumption was that only integers of different widths will need to be
cast in some cases, but as it turns out types of string family also
might need to be cast (e.g. `string` and `"char"` aren't identical).
This is now fixed by falling back to row-execution casting when
a vectorized cast isn't supported.

Release note: None

**sql: make SupportsVectorized check light-weight**

Previously, in order to determine whether we supported the vectorization
of a flow, we would run a pretty expensive SupportsVectorized check that
performs a "fake flow setup" by actually creating all of the components
without running them. This has non-negligible performance impact on
KV-like workloads, so it has been optimized away in favor of a more
light-weight check that simply inspects the processor specs for the fact
whether the processor core can be vectorized (either natively or by
wrapping row-execution processor). All processor cores have been
audited to separate out all that we currently cannot wrap (usually
because they don't implement RowSource interface). Note that if a new
processor core is introduced and `canWrap` check is not updated, we
defensively assume that it cannot be wrapped and emit an assertion
failed error that - hopefully - should surface the fact that we need to
update the check.

Addresses: #53893.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Nov 11, 2020
56540: colexec: propagate the set of needed columns in table reader spec r=yuzefovich a=yuzefovich

This commit adds the propagation of the set of needed columns via the
table reader spec and that information is now used when setting up the
ColBatchScans. The row-by-row engine is not affected since it still
needs to set up the ProcOutputHelpers, but that is no longer needed in
the vectorized engine which gives us a couple of percent improvement on
KV microbenchmark.

Addresses: #53893

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Dec 3, 2020
55713: sql: decrease vectorize_row_count_threshold to 0 r=yuzefovich a=yuzefovich

**colexec: add context to errors from builtin functions**

This commit wraps the errors that occur during builtin functions
evaluations to provide more context.

Release note: None

**sql: decrease vectorize_row_count_threshold to 0**

This commit decreases the default value for
`vectorize_row_count_threshold` setting to 0 which means that we will be
using the vectorized engine for all supported queries. We intend to
remove that setting entirely in 21.1 release, but for now we choose the
option of effectively disabling it, just in case.

The benchmarks have shown the following:
- -1.5% on KV95
- similar performance on TPCC
- -3% on movr
- -10% on miscellaneous operations (joins, aggregations) on small tables.

We think that such gap is small enough to merge this change, and we
intend to optimize the vectorized engine more before making the final
call for the default value for the 21.1 release.

Additionally, this commit collects the trace metadata on the outboxes.

Informs: #53893.

Release note (sql change): The default value for
`vectorize_row_count_threshold` setting has been decreased from 1000 to
0 meaning that from now on we will always use the vectorized engine for
all supported queries regardless of the row estimate (unless
`vectorize=off` is set).

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@yuzefovich
Copy link
Member

Posting the last results we observed in #55713 for convenience:


KV95

  • 1 min ramp, 5 min duration
    • GCE: +3.68%
    • AWS: -3.84%
  • 2 min ramp, 10 min duration
    • GCE: -2.84%
    • AWS: -1.10%
  • 3 min ramp, 15 min duration
    • GCE: +3.58%
    • AWS: -3.70%
  • 5 min ramp, 30 min duration
    • GCE: -2.51%
    • AWS: -0.66%
  • 10 min ramp, 60 min duration
    • GCE: +1.37%
    • AWS: -1.87%

MOVR (with these parameters --num-histories 10000 --num-promo-codes 10000 --num-rides 5000 --num-users 500 --num-vehicles 150)

  • 1 min ramp, 5 min duration
    • GCE:
      • old: 300.0s 0 88306 294.4 3.4 1.2 12.1 44.0 96.5
      • new: 300.0s 0 86214 287.4 3.5 1.4 11.5 41.9 60.8
      • -2.37% in throughput
    • AWS:
      • old: 300.0s 0 153160 510.5 2.0 0.9 6.8 22.0 226.5
      • new: 300.0s 0 147820 492.7 2.0 0.9 6.8 22.0 226.5
      • -3.49% in throughput
  • 3 min ramp, 15 min duration
    • GCE:
      • old: 900.0s 0 260376 289.3 3.5 1.4 11.5 41.9 67.1
      • new: 900.0s 0 254317 282.6 3.5 1.4 13.1 41.9 60.8
      • -2.33% in throughput
    • AWS:
      • old: 900.0s 0 435143 483.5 2.1 1.0 6.6 22.0 226.5
      • new: 900.0s 0 417249 463.6 2.2 1.0 7.1 22.0 469.8
      • -4.11% in throughput
  • 10 min ramp, 60 min duration
    • GCE:
      • old: 3600.0s 0 1077971 299.4 3.3 1.6 10.0 37.7 209.7
      • new: 3600.0s 0 1055800 293.3 3.4 1.7 10.0 37.7 58.7
      • -2.06% in throughput
    • AWS:
      • old: 3600.0s 0 1488654 413.5 2.4 1.4 6.8 22.0 436.2
      • new: 3600.0s 0 1418218 393.9 2.5 1.4 7.6 23.1 436.2
      • -4.73% in throughput

I'll rerun KV95 and MOVR on 4989436. I'll also quickly confirm that there is still no regression on TPCC.

@yuzefovich
Copy link
Member

yuzefovich commented Mar 5, 2021

New numbers (on 4989436).

KV95

  • 1 min ramp, 5 min duration
    • GCE: -2.1%
    • AWS: -2.3%
  • 2 min ramp, 10 min duration
    • GCE: -2.8%
    • AWS: -2.9%
  • 3 min ramp, 15 min duration
    • GCE: -2.2%
    • AWS: -3.0%

MOVR (with these parameters --num-histories 10000 --num-promo-codes 10000 --num-rides 5000 --num-users 500 --num-vehicles 150)

  • 1 min ramp, 5 min duration
    • GCE: -3.7%
    • AWS: -4.8%
  • 2 min ramp, 10 min duration
    • GCE: -0.0%
    • AWS: -4.4%
  • 3 min ramp, 15 min duration
    • GCE: -5.2%
    • AWS: -6.5%

TPCC still shows no significant difference.

@asubiotto
Copy link
Contributor Author

Action items: get the difference with 20.2 & the roachperf graph doesn't show a noticeable drop when we set this value to 0 previously.

@yuzefovich
Copy link
Member

Comparison of 20.2.5 against 21.1.0.alpha3 with all default settings (vectorize_row_count_threshold=1000 for the former and 0 for the latter). All tests were run once on a 3 node roachprod cluster with the load coming from the fourth node.


KV95

  • 1 min ramp, 5 min duration
    • GCE: +1.5%
    • AWS: +1.3%
  • 2 min ramp, 10 min duration
    • GCE: +2.1%
    • AWS: -0.4%
  • 3 min ramp, 15 min duration
    • GCE: +2.1%
    • AWS: -2.7%

TPCC 100 warehouses

  • 1 min ramp, 5 min duration
    • GCE: +2.9%
    • AWS: -0.2%
  • 2 min ramp, 10 min duration
    • GCE: -0.2%
    • AWS: -0.1%
  • 3 min ramp, 15 min duration
    • GCE: +0.0%
    • AWS: -0.3%

MOVR (with these parameters --num-histories 10000 --num-promo-codes 10000 --num-rides 5000 --num-users 500 --num-vehicles 150)

  • 1 min ramp, 5 min duration
    • GCE: -0.0%
    • AWS: -0.7%
  • 2 min ramp, 10 min duration
    • GCE: -3.2%
    • AWS: -4.2%
  • 3 min ramp, 15 min duration
    • GCE: -4.4%
    • AWS: -1.0%

@asubiotto
Copy link
Contributor Author

OK. These look good to me cc @jordanlewis @awoods187 . Let's socialize it in your email thread?

@awoods187
Copy link
Contributor

Why is there such a large variance across clouds? Is this a small sample size problem? Could we get one test, how about the hypothesized worst test, and run it 10 times to reduce randomness?

@yuzefovich
Copy link
Member

Yes, I'm pretty sure the variance is due to having only a single run. I'll kick off 10 runs of KV95 3 min ramp 15 min duration to reduce the noise.

@yuzefovich
Copy link
Member

Here is the info from 10 runs of KV95:

  • 3 min ramp, 15 min duration
    • GCE:
      • 20.2.5: avg throughput 9685.81
      • 21.1.0.alpha3 threshold=1000: avg throughput 10017.62
      • 21.1.0.alpha3 threshold=0: avg throughput 9746.22
      • 20.2.5 vs 21.1.0.alpha3 threshold=0: +0.6%
      • 21.1.0.alpha3 (threshold 1000) vs 21.1.0.alpha3 (threshold 0): -3.7%
    • AWS:
      • 20.2.5: avg throughput 10260.21
      • 21.1.0.alpha3 threshold=1000: avg throughput 10589.77
      • 21.1.0.alpha3 threshold=0: avg throughput 10349.99
      • 20.2.5 vs 21.1.0.alpha3 threshold=0: +0.9%
      • 21.1.0.alpha3 (threshold 1000) vs 21.1.0.alpha3 (threshold 0): -2.3%

@yuzefovich
Copy link
Member

yuzefovich commented Mar 17, 2021

It looks like there wasn't any pushback on our decision, so I think the issue of "choosing 0 as the default" can be closed.

Now the question is, do we want to remove the setting entirely in 21.1 release? I think we probably should lean on the safer side - keep the setting and remove it in 21.2 release. This will allow us to have a more fine-grained escape hatch than vectorize=off. Thoughts?

Another question is do we want to force all clusters to use the new default once they upgrade to 21.1? For some context, if the user has previously explicitly set the cluster setting to any value (including the previous default), that value will be kept after an upgrade.

@asubiotto
Copy link
Contributor Author

Yeah, let's be safe and remove it in the 21.1 release. I think we shouldn't force clusters to use the new vectorized setting if they set it explicitly.

@yuzefovich
Copy link
Member

Cool, that was my thinking as well. I'll open up a PR to remove the setting on master and will not backport it to 21.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants