Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate vectorized hash table for group aggregation #7095

Open
sunchao opened this issue Jul 26, 2023 · 8 comments
Open

Evaluate vectorized hash table for group aggregation #7095

sunchao opened this issue Jul 26, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@sunchao
Copy link
Member

sunchao commented Jul 26, 2023

Is your feature request related to a problem or challenge?

Currently DF uses a RawTable from hashbrown as the hash table implementation in group aggregations. This requires first converting the input batches into a row format, and then process the converted rows one by one, does hash probing, equality check, as well as creating new entries accordingly.

A different approach, as discussed in the Photon paper (and is also used by DuckDB), is to adopt a new vectorized approach in the hash table design, so that each of the above steps can be vectorized. In addition this allows us to skip the row conversion and directly operates on the input batches.

Internally we have a draft implementation for this and it has shown considerable improvements (even without SIMD, although with a lot of unsafes 😂 ) on top of the current hash aggregation approach, so we'd like to contribute to DF and see if it can help to improve its aggregation performance even further.

Describe the solution you'd like

Design & implement a separate vectorized hash table. It can either replace the existing RawTable inside GroupValuesRows, or we can have a separate GroupValues implementation.

Describe alternatives you've considered

Not to implement this.

Additional context

No response

@Dandandan
Copy link
Contributor

@sunchao that sounds very exciting 🚀

@doki23
Copy link
Contributor

doki23 commented Mar 13, 2024

Hi @sunchao, are you still moving it forward?

@sunchao
Copy link
Member Author

sunchao commented Mar 13, 2024

@doki23 yes I'm still planning to. I have a POC branch for this work: https://github.com/sunchao/arrow-datafusion/commits/vectorized-hash-table/ but I haven't got time to go back to it yet. Will try to do it soon.

@Lordworms
Copy link
Contributor

@doki23 yes I'm still planning to. I have a POC branch for this work: https://github.com/sunchao/arrow-datafusion/commits/vectorized-hash-table/ but I haven't got time to go back to it yet. Will try to do it soon.

Would like to explore more on this ticket. should I start after this branch?

@alamb
Copy link
Contributor

alamb commented Aug 22, 2024

@viirya / @sunchao / @andygrove / @kazuyukitanimura , this idea came up in the context of #11943 -- I wonder if you have any updates or code we might be able to look at

@sunchao
Copy link
Member Author

sunchao commented Aug 22, 2024

@Lordworms @alamb the code is still in my branch: https://github.com/sunchao/arrow-datafusion/commits/vectorized-hash-table/ and hasn't been updated for a while, so not sure how easy it can rebase.

One missing piece for the work is to eliminate the convert_columns and convert_rows cost. Last time I measured it took ~30% of the total time. For this, I think we need to extend the Array builder to add a get API too, so that while we are building the group values (which is represented via Arrow vectors) we can still look into the elements in various operations such as equality check.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 23, 2024

I had tried the code in the branch but doesn't see the performance improvement (and slightly slower), I think RowConverter is what matters.

https://github.com/apache/datafusion/pull/12124/files

@sunchao
Copy link
Member Author

sunchao commented Aug 23, 2024

Yes, that matches with my last measurement too. In my last company we had an internal implementation which bypasses the row-column conversion, and it showed good improvements over the current implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants