Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: decrease vectorize_row_count_threshold to 0 #55713

Merged
merged 2 commits into from
Dec 3, 2020

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Oct 19, 2020

colexec: add context to errors from builtin functions

This commit wraps the errors that occur during builtin functions
evaluations to provide more context.

Release note: None

sql: decrease vectorize_row_count_threshold to 0

This commit decreases the default value for
vectorize_row_count_threshold setting to 0 which means that we will be
using the vectorized engine for all supported queries. We intend to
remove that setting entirely in 21.1 release, but for now we choose the
option of effectively disabling it, just in case.

The benchmarks have shown the following:

  • -1.5% on KV95
  • similar performance on TPCC
  • -3% on movr
  • -10% on miscellaneous operations (joins, aggregations) on small tables.

We think that such gap is small enough to merge this change, and we
intend to optimize the vectorized engine more before making the final
call for the default value for the 21.1 release.

Additionally, this commit collects the trace metadata on the outboxes.

Informs: #53893.

Release note (sql change): The default value for
vectorize_row_count_threshold setting has been decreased from 1000 to
0 meaning that from now on we will always use the vectorized engine for
all supported queries regardless of the row estimate (unless
vectorize=off is set).

@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Oct 19, 2020
@yuzefovich yuzefovich requested review from a team and dt and removed request for a team October 19, 2020 21:06
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich removed the request for review from dt October 19, 2020 21:07
@yuzefovich yuzefovich requested a review from a team as a code owner October 19, 2020 23:48
@yuzefovich yuzefovich removed the request for review from a team October 19, 2020 23:48
@yuzefovich yuzefovich changed the title WIP sql: remove vectorize_row_count_threshold sql: remove vectorize_row_count_threshold Oct 21, 2020
@yuzefovich
Copy link
Member Author

yuzefovich commented Oct 22, 2020

The benchmarks.


3 node roachprod cluster

  1. TPCC, 100 warehouses, 1 min ramp, 5 min duration: -0.748% in tpmC
  • old
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.0s     1203.2  93.6%     28.3     26.2     37.7     46.1     71.3    151.0
  • new
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  300.0s     1194.2  92.9%     29.9     28.3     39.8     48.2     65.0    117.4
  1. KV95, 1 min ramp, 5 min duration: -6.45% in throughput
  • old
Highest sequence written: 126430. Can be passed as --write-seq=R126430 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0        2142885         7143.0      1.0      1.0      1.8      3.0    159.4  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0         113060          376.9      2.5      2.4      4.1      7.3     30.4  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0        2255945         7519.8      1.1      1.0      2.4      3.5    159.4  
  • new
Highest sequence written: 118557. Can be passed as --write-seq=R118557 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0        2004183         6680.6      1.1      1.0      2.0      3.5    151.0  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0         106178          353.9      2.6      2.5      4.1      6.6     23.1  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0        2110361         7034.5      1.1      1.0      2.5      3.9    151.0  
  1. KV95, 5 min ramp, 30 min duration: -8.87% in throughput
  • old
Highest sequence written: 843826. Can be passed as --write-seq=R843826 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0       14648469         8138.0      0.9      0.9      1.6      2.4     50.3  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0         771129          428.4      2.1      2.0      3.1      4.7     56.6  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0       15419598         8566.4      0.9      0.9      1.9      2.8     56.6 
  • new
Highest sequence written: 774506. Can be passed as --write-seq=R774506 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0       13347899         7415.5      1.0      0.9      1.7      2.9    142.6  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 1800.0s        0         703820          391.0      2.3      2.2      3.7      5.8     31.5  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0       14051719         7806.5      1.0      1.0      2.1      3.3    142.6 

I trust these numbers a lot more than the ones I got here because benchmarking on the laptop can have significant variance in performance. My current guess is that this difference is mostly caused by the fact that we're pooling tableReader objects as well as reusing some of the slices in tableInfo objects in the row-by-row engine. I looked briefly into how we could do the same in the vectorized engine, and especially I wanted to remove a lot of duplication between cFetcher and row.fetcher (as well as cTableInfo and tableInfo), but it's not that easy. cc @cockroachdb/sql-execution

@jordanlewis
Copy link
Member

Thanks for running these benchmarks. What setup did you use? I guess we're not quite ready to merge this yet.

@yuzefovich
Copy link
Member Author

It was 3 node roachprod cluster with the default hardware options in GCE, the load was coming from node 1. Is this what you mean by "setup"?

I took a look at profiles, and nothing really stood out, and the pooling behavior is the main difference I can see right now. I'll take a stab tomorrow at cleaning up the fetchers a bit.

@jordanlewis
Copy link
Member

Yes, I meant hardware setup. Thanks!

@awoods187
Copy link
Contributor

awoods187 commented Oct 22, 2020 via email

@yuzefovich
Copy link
Member Author

yuzefovich commented Oct 23, 2020

One pretty obvious thing that I missed that is likely to explain the majority of the difference is that without vectorize_row_count_threshold we are now always running SupportsVectorized check (previously, we would do so only if the threshold is met). That check especially shows up on KV-like workload because we're essentially instantiating the whole flow twice with only the second time running it (the flow runtime is comparable to the flow setup time, so it is too expensive to do 2 x setup), and in particular that means that we're creating a ColBatchScan and initializing cFetcher twice too - cFetcher.Init is pretty heavy.

So I think apart from the pooling of objects we'll need to do something about this check. I feel like the most viable option is to make it a lot more lightweight: instead of creating the whole fake flow with all of the components, we could probably just inspect the processor specs and see whether there are any that we refuse to wrap, and if we either support "natively" or can wrap all of the cores, then we'll assume that we can successfully setup the whole flow. I think when SupportsVectorized check was introduced, we had a lot more edge cases that needed to be handled, so we went with the fake flow setup, but now that might be unnecessary anymore. It does feel a bit scary to remove that check though.

@yuzefovich
Copy link
Member Author

yuzefovich commented Oct 23, 2020

Hm, to my surprise a prototype with some pooling and a more lightweight support check (#55883) shows a similar big drop in performance between vectorize_row_count_threshold=1000 and vectorize_row_count_threshold=0. I guess we'll have to abandon this PR for now :/

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/cli/cli_test.go, line 531 at r1 (raw file):

	c.RunWithArgs([]string{`sql`, `--set`, `unknownoption`, `-e`, `select 123 as "123"`})
	// Check that partial results + error get reported together.
	c.RunWithArgs([]string{`sql`, `-e`, `select 1/(@1-2) from generate_series(1,3)`})

Out of curiosity, why did you need to change this?

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/cli/cli_test.go, line 531 at r1 (raw file):

Previously, knz (kena) wrote…

Out of curiosity, why did you need to change this?

Currently the test will fail if run via the vectorized engine because a division by zero error will occur sooner (it occurs on the third row, but we have batches of growing capacity starting at 1, so the batch sizes will be 1, 2, 4, ..., 1024, 1024, ..., and the error occurs when processing the second batch at which point only 1 row has been returned to the client whereas the test expects 2 rows).

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/cli/cli_test.go, line 531 at r1 (raw file):

Previously, yuzefovich wrote…

Currently the test will fail if run via the vectorized engine because a division by zero error will occur sooner (it occurs on the third row, but we have batches of growing capacity starting at 1, so the batch sizes will be 1, 2, 4, ..., 1024, 1024, ..., and the error occurs when processing the second batch at which point only 1 row has been returned to the client whereas the test expects 2 rows).

I think it's worth explaining this in a comment here.

@asubiotto
Copy link
Contributor

Let's discuss this and how to move forward during the next meeting.

@yuzefovich
Copy link
Member Author

A newer prototype (that is almost production ready) shows an improvement - the performance hit on KV95 if we remove SupportsVectorized check in favor of a more light-weight one decreases from 6-8% to 3-4%.

@yuzefovich yuzefovich changed the title sql: remove vectorize_row_count_threshold sql: decrease vectorize_row_count_threshold to 0 Oct 28, 2020
@yuzefovich
Copy link
Member Author

More KV95 numbers:

  • 1 min ramp, 5 min duration
    • GCE: +3.68%
    • AWS: -3.84%
  • 2 min ramp, 10 min duration
    • GCE: -2.84%
    • AWS: -1.10%
  • 3 min ramp, 15 min duration
    • GCE: +3.58%
    • AWS: -3.70%
  • 5 min ramp, 30 min duration
    • GCE: -2.51%
    • AWS: -0.66%
  • 10 min ramp, 60 min duration
    • GCE: +1.37%
    • AWS: -1.87%

@jordanlewis
Copy link
Member

LGTM! Ship it!

@jordanlewis
Copy link
Member

Err, I guess we decided to also see TPCC numbers, right? I'm getting eager.

@yuzefovich
Copy link
Member Author

Yes, I'll be rerunning TPCC today as well as MovR to see what's up, and if those look good, we'll merge tomorrow I think.

@yuzefovich
Copy link
Member Author

yuzefovich commented Dec 2, 2020

TPCC100 numbers again show no significant difference:

  • 1 min ramp, 5 min duration
    • GCE:
      • old: 300.0s 1193.2 92.8% 24.1 23.1 30.4 37.7 54.5 92.3
      • new: 300.0s 1194.8 92.9% 23.9 23.1 30.4 37.7 52.4 96.5
    • AWS:
      • old: 300.0s 1204.8 93.7% 16.4 15.7 19.9 26.2 39.8 104.9
      • new: 300.0s 1198.6 93.2% 17.3 16.3 22.0 27.3 41.9 226.5
  • 3 min ramp, 15 min duration
    • GCE:
      • old: 900.0s 1233.3 95.9% 21.9 22.0 26.2 30.4 46.1 121.6
      • new: 900.0s 1228.7 95.5% 21.9 21.0 27.3 31.5 48.2 88.1
    • AWS:
      • old: 900.0s 1237.5 96.2% 15.9 15.2 18.9 22.0 35.7 243.3
      • new: 900.0s 1235.9 96.1% 15.1 14.7 17.8 21.0 33.6 100.7
  • 10 min ramp, 60 min duration
    • GCE:
      • old: 3600.0s 1251.7 97.3% 23.1 23.1 27.3 31.5 48.2 130.0
      • new: 3600.0s 1252.0 97.4% 22.5 22.0 26.2 30.4 48.2 151.0
    • AWS:
      • old: 3600.0s 1255.1 97.6% 15.3 14.7 17.8 19.9 33.6 436.2
      • new: 3600.0s 1253.5 97.5% 15.7 15.2 17.8 19.9 33.6 226.5

The only interesting detail is that during 60 min run with old config on AWS apparently one node died and "new order" transactions were returning an error. I didn't notice it before wiping the cluster, so I'm not sure what happened.


And here are the numbers of movr workload (with these parameters --num-histories 10000 --num-promo-codes 10000 --num-rides 5000 --num-users 500 --num-vehicles 150):

  • 1 min ramp, 5 min duration
    • GCE:
      • old: 300.0s 0 88306 294.4 3.4 1.2 12.1 44.0 96.5
      • new: 300.0s 0 86214 287.4 3.5 1.4 11.5 41.9 60.8
      • -2.37% in throughput
    • AWS:
      • old: 300.0s 0 153160 510.5 2.0 0.9 6.8 22.0 226.5
      • new: 300.0s 0 147820 492.7 2.0 0.9 6.8 22.0 226.5
      • -3.49% in throughput
  • 3 min ramp, 15 min duration
    • GCE:
      • old: 900.0s 0 260376 289.3 3.5 1.4 11.5 41.9 67.1
      • new: 900.0s 0 254317 282.6 3.5 1.4 13.1 41.9 60.8
      • -2.33% in throughput
    • AWS:
      • old: 900.0s 0 435143 483.5 2.1 1.0 6.6 22.0 226.5
      • new: 900.0s 0 417249 463.6 2.2 1.0 7.1 22.0 469.8
      • -4.11% in throughput
  • 10 min ramp, 60 min duration
    • GCE:
      • old: 3600.0s 0 1077971 299.4 3.3 1.6 10.0 37.7 209.7
      • new: 3600.0s 0 1055800 293.3 3.4 1.7 10.0 37.7 58.7
      • -2.06% in throughput
    • AWS:
      • old: 3600.0s 0 1488654 413.5 2.4 1.4 6.8 22.0 436.2
      • new: 3600.0s 0 1418218 393.9 2.5 1.4 7.6 23.1 436.2
      • -4.73% in throughput

@yuzefovich yuzefovich removed the do-not-merge bors won't merge a PR with this label. label Dec 2, 2020
@yuzefovich
Copy link
Member Author

Updated all the benchmark numbers and the commit message. PTAL, @asubiotto @jordanlewis

@yuzefovich yuzefovich force-pushed the vec-threshold branch 2 times, most recently from 220b010 to a138cea Compare December 2, 2020 21:25
Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)

@yuzefovich
Copy link
Member Author

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 3, 2020

Build failed:

@yuzefovich
Copy link
Member Author

test target timed out, testrace target is a flake that will be fixed shortly. I'll amend the commit and force-push in order to run the CI again before merging to see whether the time out is related or not.

This commit wraps the errors that occur during builtin functions
evaluations to provide more context.

Release note: None
This commit decreases the default value for
`vectorize_row_count_threshold` setting to 0 which means that we will be
using the vectorized engine for all supported queries. We intend to
remove that setting entirely in 21.1 release, but for now we choose the
option of effectively disabling it, just in case.

The benchmarks have shown the following:
- -1.5% on KV95
- similar performance on TPCC
- -3% on movr
- -10% on miscellaneous operations (joins, aggregations) on small tables.

We think that such gap is small enough to merge this change, and we
intend to optimize the vectorized engine more before making the final
call for the default value for the 21.1 release.

Additionally, this commit collects the trace metadata on the outboxes.

Release note (sql change): The default value for
`vectorize_row_count_threshold` setting has been decreased from 1000 to
0 meaning that from now on we will always use the vectorized engine for
all supported queries regardless of the row estimate (unless
`vectorize=off` is set).
@yuzefovich
Copy link
Member Author

Ok, test target passed, testrace target should be fixed.

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 3, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants