Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hash] Wrap/Rewrite ClickHouse hash method for Datafuse concat_row_to_one_key #754

Closed
BohuTANG opened this issue Jun 7, 2021 · 8 comments
Labels
C-performance Category: Performance

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Jun 7, 2021

ClickHouse hash faster enough, it's interesting to try to wrap it for datafuse, or rewrite them in Rust.
In datafuse, for groupby, the main performance killer is concat_row_to_one_key in datablock:
https://github.com/datafuselabs/datafuse/blob/04f0b38f172e5aeb9580095c66124011c08ad7e0/common/datablocks/src/kernels/data_block_groupby.rs#L72-L74

concat_row_to_one_key concat all the group key by bytes to one.

ClickHouse hash methods:
https://github.com/ClickHouse/ClickHouse/blob/27ddf78ba572b893cb5351541f566d1080d8a9c6/src/Interpreters/Aggregator.h#L68-L103

@BohuTANG BohuTANG added the C-performance Category: Performance label Jun 7, 2021
@BohuTANG BohuTANG changed the title [hash] Wrap ClickHouse hash method for Datafuse concat_row_to_one_key [hash] Wrap/Rewrite ClickHouse hash method for Datafuse concat_row_to_one_key Jun 7, 2021
@BohuTANG BohuTANG mentioned this issue Jun 7, 2021
2 tasks
@sundy-li
Copy link
Member

sundy-li commented Jun 7, 2021

Related #520

@PsiACE
Copy link
Member

PsiACE commented Jun 7, 2021

The hashs function seems to be at: https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/HashTable/Hash.h

FYI, there are some related Rust implementations:

MurmurHash:

CRC32:

@BohuTANG
Copy link
Member Author

BohuTANG commented Jun 7, 2021

@PsiACE
Thanks for the reference.
But this is not only a hash implementation, Rust hash faster too.
The matter is that when we group by many columns, how to concat our key to one, for example:

select max(number) from numbers(100000000) group by number%3, number%4, number%5;

How to fast the group key hash: number%3, number%4, number%5 is the main point.

@PsiACE
Copy link
Member

PsiACE commented Jun 7, 2021

The matter is that when we group by many columns, how to concat our key to one

Thanks for your further explanation.

like this apache/arrow#10290 ?

@zhang2014
Copy link
Member

zhang2014 commented Jun 8, 2021

I have an example here, which may be helpful.

@lideen999
Copy link

lideen999 commented Jun 8, 2021

Suggested reference

https://github.com/influxdata/influxdb_iox/blob/faec98eab90b0708cacfce47047abd596ec105ef/read_buffer/src/row_group.rs#L1150-L1154

fn pack_ u32_ in_ u128(packed_ value: u128, encoded_ id: u32, pos: usize) -> u128 {
  packed_ value | (encoded_ id as u128) << (32 * pos)
}

@BohuTANG
Copy link
Member Author

BohuTANG commented Jun 8, 2021

Suggested reference

https://github.com/influxdata/influxdb_iox/blob/faec98eab90b0708cacfce47047abd596ec105ef/read_buffer/src/row_group.rs#L1150-L1154

fn pack_ u32_ in_ u128(packed_ value: u128, encoded_ id: u32, pos: usize) -> u128 {
  packed_ value | (encoded_ id as u128) << (32 * pos)
}

Thanks for the reference, but it looks no help here.

@sundy-li
Copy link
Member

It's already done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Category: Performance
Projects
None yet
Development

No branches or pull requests

5 participants