Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: using ToHashKey to check equality of decimal when count(distinct) #9901

Merged
merged 4 commits into from
Mar 27, 2019

Conversation

XuHuaiyu
Copy link
Contributor

@XuHuaiyu XuHuaiyu commented Mar 26, 2019

What problem does this PR solve?

fix #9900

What is changed and how it works?

Using MyDecimal.ToHashKey to check the equality of decimal when in when count(distinct).

Before this commit, we wrap a cast as decimal(52, 31) for both children of union.
After the cast, the value from the int column becomes {9 3 3 false [1 0 0 0 0 0 0 0 0]}
the value from the decimal column becomes {3 3 3 false [1 0 0 0 0 0 0 0 0]}.
Which is thought as in-equal when compared directly.

Check List

Tests

  • Integration test

Code changes

  • Has exported function/method change

Side effects

  • Possible performance regression

Before this commit, we use the value as a hash-key without evaluating.
This commit may cause some performance regression since we evaluating invoke MyDecimal.ToHashKey for every decimal.

I've tested it using tpch 10G dataset.

tidb [10.0.1.5]> desc lineitem;
+-----------------+---------------+------+------+---------+-------+
| Field           | Type          | Null | Key  | Default | Extra |
+-----------------+---------------+------+------+---------+-------+
...
| L_QUANTITY      | decimal(15,2) | NO   |      | NULL    |       |
...
+-----------------+---------------+------+------+---------+-------+

tidb [10.0.1.5]> select count(L_QUANTITY) from lineitem;
+-------------------+
| count(L_QUANTITY) |
+-------------------+
|          59986052 |
+-------------------+

tidb [10.0.1.5]> select count(distinct L_QUANTITY) from lineitem;
Before this commit(agg phase/ total phase) After this commit(agg phase/ total phase) performance regression(agg phase/ total phase)
12.9s/24.5s 16.5s/28.1s 14.69%/25.9%

Related changes

  • Need to cherry-pick to the release branch

@XuHuaiyu XuHuaiyu added the type/bugfix This PR fixes a bug. label Mar 26, 2019
@codecov
Copy link

codecov bot commented Mar 26, 2019

Codecov Report

Merging #9901 into master will increase coverage by 0.0138%.
The diff coverage is 50%.

@@               Coverage Diff                @@
##             master      #9901        +/-   ##
================================================
+ Coverage   77.2034%   77.2172%   +0.0138%     
================================================
  Files           405        405                
  Lines         81635      81667        +32     
================================================
+ Hits          63025      63061        +36     
+ Misses        13939      13929        -10     
- Partials       4671       4677         +6

@XuHuaiyu
Copy link
Contributor Author

/run-all-tests tidb-test=pr/581

1 similar comment
@XuHuaiyu
Copy link
Contributor Author

/run-all-tests tidb-test=pr/581

@XuHuaiyu
Copy link
Contributor Author

/run-all-tests tidb-test=pr/581

Copy link
Member

@winoros winoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

What's the reason that we don't call EncodeKey here? For performance?

@XuHuaiyu
Copy link
Contributor Author

They are almost the same, calling ToHashKey directly saves the cost of the function invocation. @winoros

Copy link
Contributor

@eurekaka eurekaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wrong result when select count/sum(distinct ) from int_col union all decimal_col
3 participants