ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks #8765

Dandandan · 2020-11-24T22:24:28Z

This PR reduces the size of GroupByScalar from 32 bytes to 16 bytes by using Box<String>. This will reduce the size of a Vec<GroupByScalar> and thus the key of hashmaps used for aggregates / joins.
Also, it changes the type of the key to Box<[GroupByScalar]> to reduce memory usage further by 8 bytes per key needed to hold the capacity of the vec.
Finally we can remove a Box around the Vec holding the indices.

Difference in speed seems to be minimal, at least in current state.

I think in the future, it could be nice to see if the data could be packed efficiently in one Box<[T]> (where T is a primitive value) when having no dynamically sized types by using the schema instead of creating "dynamic" values. That should also make the hashing faster. Currently, when grouping on multiple i32 values, we need 32 bytes per value (next to 24 bytes for the Vec holding the values) instead of just 4! Also using const generics https://rust-lang.github.io/rfcs/2000-const-generics.html#:~:text=Rust%20currently%20has%20one%20type,implement%20traits%20for%20all%20arrays could provide a further improvement (by not having to store the length of the slice).

This PR also tries to improve reproducability in the benchmarks a bit by using the seed in the random number generator (still a quite noisy on my machine though).

github-actions · 2020-11-24T22:37:53Z

https://issues.apache.org/jira/browse/ARROW-10722

Dandandan · 2020-11-25T11:14:42Z

Some further tweaks, uses Box<[GroupByScalar]> as key now to avoid a further 8 bytes in the key.

jorgecarleitao · 2020-11-25T16:33:56Z

Genuinely curious: does the key size has such a large impact? Or is there any memory constraints that you are looking for?

Dandandan · 2020-11-25T16:48:56Z

@jorgecarleitao Not really on performance as current benchmarks / queries show, just looking at ways to improve the aggregate / join performance.

The main thing I wanted to investigate is whether the aggregates / join can be made faster itself. I think one part would be to create a key that can be hashed faster. Now the hashing algorithm hashes each individual GroupByValue instead of working on a byte array. The latter is in principle faster. Also, some specialized code could also be made for hashing based on 1 column only.

The current changes have a larger impact on memory usage though if you are hashing / aggregating something with high cardinality as each key will generate extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using Vec and 8 bytes for boxing the inner Vec of the aggregation per row. So in a extreme case 32 bytes per row extra where every key is unique.

jorgecarleitao

LGTM. Thanks a lot, @Dandandan !

jorgecarleitao · 2020-12-07T05:50:51Z

Can you rebase this, so that I can merge it?

Dandandan · 2020-12-07T06:18:33Z

@jorgecarleitao thanks, updated against master!

Dandandan · 2020-12-07T06:34:37Z

Did also some test against master, it looks like this PR also has some small runtime savings on join queries, I think mostly because of .clone() being cheaper now. Will do some profiling later.
Would be nice to have some criterion benchmarks later to cover joins too & I think it would make sense to have some benchmarks / profiling covering maximum memory usage too.

Dandandan · 2020-12-07T17:32:34Z

I also have a follow-up on this PR which converts the key for the values to Vec<u8> for the hash joins which is again faster as well. Also reading up a bit on hash join algorithms... It looks like resolving the hash collisions is mostly done afterwards instead of in a hash map directly. I guess that could also be a bit easier, as we don't have to copy the data in the hashmap itself but can keep them in the columns/rows themselves. Would be interesting if we can hash a list of columns much more efficiently this way (e.g. hashing whole columns instead).

@jorgecarleitao

This PR is a follow up of #8765 . Here, the hashmap values for the key are converted to `Vec<u8>` to use as key instead. This is a bit faster as both hashing and cloning will be faster. It will also use less additional memory than the earlier usage of the dynamic `GroupByScalar` values (for hash join). [This PR] ``` Query 12 iteration 0 took 1315 ms Query 12 iteration 1 took 1324 ms Query 12 iteration 2 took 1329 ms Query 12 iteration 3 took 1334 ms Query 12 iteration 4 took 1335 ms Query 12 iteration 5 took 1338 ms Query 12 iteration 6 took 1337 ms Query 12 iteration 7 took 1349 ms Query 12 iteration 8 took 1348 ms Query 12 iteration 9 took 1358 ms ``` [Master] ``` Query 12 iteration 0 took 1379 ms Query 12 iteration 1 took 1383 ms Query 12 iteration 2 took 1401 ms Query 12 iteration 3 took 1406 ms Query 12 iteration 4 took 1420 ms Query 12 iteration 5 took 1435 ms Query 12 iteration 6 took 1401 ms Query 12 iteration 7 took 1404 ms Query 12 iteration 8 took 1418 ms Query 12 iteration 9 took 1416 ms ``` [This PR] ``` Query 1 iteration 0 took 871 ms Query 1 iteration 1 took 866 ms Query 1 iteration 2 took 869 ms Query 1 iteration 3 took 869 ms Query 1 iteration 4 took 867 ms Query 1 iteration 5 took 874 ms Query 1 iteration 6 took 870 ms Query 1 iteration 7 took 875 ms Query 1 iteration 8 took 871 ms Query 1 iteration 9 took 869 ms ``` [Master] ``` Query 1 iteration 0 took 1189 ms Query 1 iteration 1 took 1192 ms Query 1 iteration 2 took 1189 ms Query 1 iteration 3 took 1185 ms Query 1 iteration 4 took 1193 ms Query 1 iteration 5 took 1202 ms Query 1 iteration 6 took 1547 ms Query 1 iteration 7 took 1242 ms Query 1 iteration 8 took 1202 ms Query 1 iteration 9 took 1197 ms ``` FWIW, micro benchmark results for aggregate queries: ``` aggregate_query_no_group_by 15 12 time: [538.54 us 541.48 us 544.74 us] change: [+5.4384% +6.6260% +8.2034%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 7 (7.00%) high mild 1 (1.00%) high severe aggregate_query_no_group_by_count_distinct_wide 15 12 time: [4.8418 ms 4.8744 ms 4.9076 ms] change: [-13.890% -12.580% -11.260%] (p = 0.00 < 0.05) Performance has improved. aggregate_query_no_group_by_count_distinct_narrow 15 12 time: [2.1910 ms 2.2100 ms 2.2291 ms] change: [-30.490% -28.886% -27.271%] (p = 0.00 < 0.05) Performance has improved. Benchmarking aggregate_query_group_by 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by 15 12 time: [1.5905 ms 1.5977 ms 1.6054 ms] change: [-18.271% -16.780% -15.396%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe aggregate_query_group_by_with_filter 15 12 time: [788.26 us 792.05 us 795.74 us] change: [-9.8088% -8.5606% -7.4141%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe Benchmarking aggregate_query_group_by_u64 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by_u64 15 12 time: [1.8502 ms 1.8565 ms 1.8630 ms] change: [+8.6203% +9.8872% +10.973%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) low mild 2 (2.00%) high mild 3 (3.00%) high severe aggregate_query_group_by_with_filter_u64 15 12 time: [777.83 us 782.75 us 788.15 us] change: [-7.5157% -6.6393% -5.6558%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe ``` FYI @jorgecarleitao Closes #8863 from Dandandan/key_byte_vec Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

… aggregations / joins, improve benchmarks This PR reduces the size of `GroupByScalar` from 32 bytes to 16 bytes by using `Box<String>`. This will reduce the size of a `Vec<GroupByScalar>` and thus the key of hashmaps used for aggregates / joins. Also, it changes the type of the key to `Box<[GroupByScalar]>` to reduce memory usage further by 8 bytes per key needed to hold the capacity of the vec. Finally we can remove a `Box` around the `Vec` holding the indices. Difference in speed seems to be minimal, at least in current state. I think in the future, it could be nice to see if the data could be packed efficiently in one `Box<[T]>` (where T is a primitive value) when having no dynamically sized types by using the schema instead of creating "dynamic" values. That should also make the hashing faster. Currently, when grouping on multiple i32 values, we need 32 bytes per value (next to 24 bytes for the Vec holding the values) instead of just 4! Also using const generics https://rust-lang.github.io/rfcs/2000-const-generics.html#:~:text=Rust%20currently%20has%20one%20type,implement%20traits%20for%20all%20arrays could provide a further improvement (by not having to store the length of the slice). This PR also tries to improve reproducability in the benchmarks a bit by using the seed in the random number generator (still a quite noisy on my machine though). Closes apache#8765 from Dandandan/reduce_key_size Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@jorgecarleitao

This PR is a follow up of apache#8765 . Here, the hashmap values for the key are converted to `Vec<u8>` to use as key instead. This is a bit faster as both hashing and cloning will be faster. It will also use less additional memory than the earlier usage of the dynamic `GroupByScalar` values (for hash join). [This PR] ``` Query 12 iteration 0 took 1315 ms Query 12 iteration 1 took 1324 ms Query 12 iteration 2 took 1329 ms Query 12 iteration 3 took 1334 ms Query 12 iteration 4 took 1335 ms Query 12 iteration 5 took 1338 ms Query 12 iteration 6 took 1337 ms Query 12 iteration 7 took 1349 ms Query 12 iteration 8 took 1348 ms Query 12 iteration 9 took 1358 ms ``` [Master] ``` Query 12 iteration 0 took 1379 ms Query 12 iteration 1 took 1383 ms Query 12 iteration 2 took 1401 ms Query 12 iteration 3 took 1406 ms Query 12 iteration 4 took 1420 ms Query 12 iteration 5 took 1435 ms Query 12 iteration 6 took 1401 ms Query 12 iteration 7 took 1404 ms Query 12 iteration 8 took 1418 ms Query 12 iteration 9 took 1416 ms ``` [This PR] ``` Query 1 iteration 0 took 871 ms Query 1 iteration 1 took 866 ms Query 1 iteration 2 took 869 ms Query 1 iteration 3 took 869 ms Query 1 iteration 4 took 867 ms Query 1 iteration 5 took 874 ms Query 1 iteration 6 took 870 ms Query 1 iteration 7 took 875 ms Query 1 iteration 8 took 871 ms Query 1 iteration 9 took 869 ms ``` [Master] ``` Query 1 iteration 0 took 1189 ms Query 1 iteration 1 took 1192 ms Query 1 iteration 2 took 1189 ms Query 1 iteration 3 took 1185 ms Query 1 iteration 4 took 1193 ms Query 1 iteration 5 took 1202 ms Query 1 iteration 6 took 1547 ms Query 1 iteration 7 took 1242 ms Query 1 iteration 8 took 1202 ms Query 1 iteration 9 took 1197 ms ``` FWIW, micro benchmark results for aggregate queries: ``` aggregate_query_no_group_by 15 12 time: [538.54 us 541.48 us 544.74 us] change: [+5.4384% +6.6260% +8.2034%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 7 (7.00%) high mild 1 (1.00%) high severe aggregate_query_no_group_by_count_distinct_wide 15 12 time: [4.8418 ms 4.8744 ms 4.9076 ms] change: [-13.890% -12.580% -11.260%] (p = 0.00 < 0.05) Performance has improved. aggregate_query_no_group_by_count_distinct_narrow 15 12 time: [2.1910 ms 2.2100 ms 2.2291 ms] change: [-30.490% -28.886% -27.271%] (p = 0.00 < 0.05) Performance has improved. Benchmarking aggregate_query_group_by 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.1s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by 15 12 time: [1.5905 ms 1.5977 ms 1.6054 ms] change: [-18.271% -16.780% -15.396%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe aggregate_query_group_by_with_filter 15 12 time: [788.26 us 792.05 us 795.74 us] change: [-9.8088% -8.5606% -7.4141%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe Benchmarking aggregate_query_group_by_u64 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by_u64 15 12 time: [1.8502 ms 1.8565 ms 1.8630 ms] change: [+8.6203% +9.8872% +10.973%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) low mild 2 (2.00%) high mild 3 (3.00%) high severe aggregate_query_group_by_with_filter_u64 15 12 time: [777.83 us 782.75 us 788.15 us] change: [-7.5157% -6.6393% -5.6558%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe ``` FYI @jorgecarleitao Closes apache#8863 from Dandandan/key_byte_vec Lead-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

Dandandan added 5 commits November 24, 2020 22:17

Reduce size of key by boxing string

8bdefdf

Use Box<str>

8f1238d

Simplify using into

eb44aaa

Improve benchmarks

02d8da0

Format fixes

7d4ee70

Dandandan changed the title ~~ARROW-10722: Reduce overhead in GroupByScalar, improve benchmarks~~ ARROW-10722: Reduce overhead of GroupByScalar, improve benchmarks Nov 24, 2020

Dandandan changed the title ~~ARROW-10722: Reduce overhead of GroupByScalar, improve benchmarks~~ ARROW-10722: [Rust][DataFusion] Reduce overhead of GroupByScalar, improve benchmarks Nov 24, 2020

github-actions bot added Component: Rust - DataFusion Component: Rust labels Nov 24, 2020

Reduce to 16 bytes, use Box<[GroupByScalar]> for key

b28a6d4

Remove unnecessary Box around indices

9c8b8f0

Dandandan changed the title ~~ARROW-10722: [Rust][DataFusion] Reduce overhead of GroupByScalar, improve benchmarks~~ ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks Nov 25, 2020

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 1, 2020

jorgecarleitao approved these changes Dec 7, 2020

View reviewed changes

Merge master

3a6127b

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 7, 2020

Merge branch 'master' into reduce_key_size

f2b2995

Dandandan mentioned this pull request Dec 7, 2020

ARROW-10837: [Rust][DataFusion] Use Vec<u8> for hash keys #8863

Closed

jorgecarleitao closed this in 8711ca9 Dec 7, 2020

alamb mentioned this pull request Jul 27, 2021

Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy apache/datafusion#786

Merged

asfimport mentioned this pull request Dec 7, 2020

[Rust][DataFusion] Reduce overhead in data types in aggregations / joins, improve benchmarks #26669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks #8765

ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks #8765

Dandandan commented Nov 24, 2020 •

edited

Loading

github-actions bot commented Nov 24, 2020

Dandandan commented Nov 25, 2020

jorgecarleitao commented Nov 25, 2020

Dandandan commented Nov 25, 2020 •

edited

Loading

jorgecarleitao left a comment

jorgecarleitao commented Dec 7, 2020

Dandandan commented Dec 7, 2020

Dandandan commented Dec 7, 2020

Dandandan commented Dec 7, 2020

ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks #8765

ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks #8765

Conversation

Dandandan commented Nov 24, 2020 • edited Loading

github-actions bot commented Nov 24, 2020

Dandandan commented Nov 25, 2020

jorgecarleitao commented Nov 25, 2020

Dandandan commented Nov 25, 2020 • edited Loading

jorgecarleitao left a comment

Choose a reason for hiding this comment

jorgecarleitao commented Dec 7, 2020

Dandandan commented Dec 7, 2020

Dandandan commented Dec 7, 2020

Dandandan commented Dec 7, 2020

Dandandan commented Nov 24, 2020 •

edited

Loading

Dandandan commented Nov 25, 2020 •

edited

Loading