Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH-186] support RangePartitioning #189

Merged
merged 12 commits into from
Nov 22, 2022

Conversation

lgbo-ustc
Copy link

@lgbo-ustc lgbo-ustc commented Nov 8, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
see issue #186

We implement RangePartitionNativeSplitter in this pr which support running range partitioning in native backend.
This pr is mainly for improving the order by clause.

Some benchmark test by running order by.

data type rows eclapsed time(s) by gluten eclapsed time(s) by spark
long 69498916 6.9 8.5
long 208496748 13.5 22.8
string 69498916 10.5 26.5
string 208496748 23.9 71.3

Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/
close #186

@kyligence-git
Copy link
Collaborator

Can one of the admins verify this patch?

@lgbo-ustc
Copy link
Author

rely on apache/incubator-gluten#524

@lgbo-ustc
Copy link
Author

test this please with 524

1 similar comment
@lgbo-ustc
Copy link
Author

test this please with 524

@lgbo-ustc
Copy link
Author

test this please with 524

@lgbo-ustc
Copy link
Author

test this please with 524

utils/local-engine/Shuffle/NativeSplitter.cpp Outdated Show resolved Hide resolved
utils/local-engine/Shuffle/NativeSplitter.cpp Outdated Show resolved Hide resolved
utils/local-engine/Shuffle/NativeSplitter.cpp Outdated Show resolved Hide resolved
utils/local-engine/Shuffle/NativeSplitter.cpp Outdated Show resolved Hide resolved
@lgbo-ustc
Copy link
Author

test this please with 524

@lgbo-ustc lgbo-ustc marked this pull request as ready for review November 10, 2022 01:52
@zhanglistar
Copy link

As comment, there is a optimization to do:

  1. clickhouse backend consume data from JVM, but not sort the block at once, instead, we read the data and do nothing until memory limit is reached or all data is consumed
  2. sort the data, here if memory limit is reached first, we then flush the sorted data to disk, and do 1; else return the output data stream.
  3. do merge sort on the flushed data, and then return the output data stream.

@liuneng1994
Copy link
Collaborator

test this please with 524

1 similar comment
@lgbo-ustc
Copy link
Author

test this please with 524

Copy link
Collaborator

@liuneng1994 liuneng1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liuneng1994 liuneng1994 merged commit e5b7449 into Kyligence:clickhouse_backend Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support RangePartitioning
4 participants