Simplify GroupByHash implementation (to prepare for more work) #4972

alamb · 2023-01-18T14:53:33Z

~~Draft as it builds on #4924~~

Which issue does this PR close?

Rationale for this change

Follow on to #4924 work from @mustafasrepo and @ozankabak

As we prepare to improve group by performance even more, we will be working on this code going forward.

There are several TODOs in the group by hash code as well as some out of date comments that make it harder to work with. Given the thinking / plans to improve this code it is important it remains relatively easy to work with

Since I had all the code paged in anyways as I was reviewing #4924 I figured I would add my comments here

What changes are included in this PR?

Remove extra level of unwrapping in GroupedHashAggregateStreamInner
Make group_aggregate_batch and create_batch_from_map member functions rather than free functions (and remove clippy warnings)

Are these changes tested?

Existing tests cover these cases (this is a refactor)

Are there any user-facing changes?

No

Benchmark results

git checkout 96cf046be57bf09548d51f50d0bc964904bcec7d
cargo bench -p datafusion --bench aggregate_query_sql -- --save-baseline pr4972-pre
git checkout alamb/simplify_group_by
cargo bench -p datafusion --bench aggregate_query_sql -- --baseline pr4972-pre

I think the benchmarks show no significant changes (other than noise)

Click me

aggregate_query_no_group_by 15 12
                        time:   [2.3153 ms 2.3279 ms 2.3414 ms]
                        change: [-2.4095% -1.5536% -0.7448%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

aggregate_query_no_group_by_min_max_f64
                        time:   [2.1700 ms 2.1826 ms 2.1958 ms]
                        change: [-1.7443% -0.9155% -0.1242%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

aggregate_query_no_group_by_count_distinct_wide
                        time:   [5.8379 ms 5.9051 ms 5.9713 ms]
                        change: [-1.2545% +0.2912% +1.9169%] (p = 0.72 > 0.05)
                        No change in performance detected.

aggregate_query_no_group_by_count_distinct_narrow
                        time:   [3.6279 ms 3.6631 ms 3.6990 ms]
                        change: [-2.6099% -1.2926% +0.0237%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

aggregate_query_group_by
                        time:   [5.4279 ms 5.4945 ms 5.5616 ms]
                        change: [-0.8897% +0.6369% +2.2354%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

aggregate_query_group_by_with_filter
                        time:   [3.5274 ms 3.5516 ms 3.5761 ms]
                        change: [-4.4837% -3.6534% -2.8178%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

aggregate_query_group_by_u64 15 12
                        time:   [5.1773 ms 5.2419 ms 5.3089 ms]
                        change: [-4.5574% -2.8527% -1.1438%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

aggregate_query_group_by_with_filter_u64 15 12
                        time:   [3.5820 ms 3.6025 ms 3.6236 ms]
                        change: [+2.4312% +3.2799% +4.1676%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

aggregate_query_group_by_u64_multiple_keys
                        time:   [35.172 ms 36.120 ms 37.089 ms]
                        change: [-0.2381% +3.6619% +7.7225%] (p = 0.06 > 0.05)
                        No change in performance detected.

aggregate_query_approx_percentile_cont_on_u64
                        time:   [10.832 ms 10.992 ms 11.152 ms]
                        change: [-2.4045% -0.3099% +1.7212%] (p = 0.77 > 0.05)
                        No change in performance detected.

aggregate_query_approx_percentile_cont_on_f32
                        time:   [9.7958 ms 9.9346 ms 10.076 ms]
                        change: [-3.4670% -1.4739% +0.5056%] (p = 0.15 > 0.05)
                        No change in performance detected.

alamb

Reviewing this PR with whitespace blind diff I think makes it easier to see what changed: https://github.com/apache/arrow-datafusion/pull/4972/files?w=1

alamb · 2023-01-19T12:06:33Z