Skipping partial aggregation when it is not helping for high cardinality aggregates #11627

korowa · 2024-07-23T20:10:49Z

Which issue does this PR close?

~~Related to #6937.~~
Closes #6937

Rationale for this change

Currently DF plans (almost always) two aggregation operators -- Partial and Final, executing one after another with Partial output being input for Final. In case aggregate input is almost/close to unique, Partial aggregation doesn't group data well (output row count +- same as input rowcount), and DF ends up with doing the same work twice.

Suggestion is to start skipping partial aggregation after some fixed amount of input rows, in case at that moment accumulated unique groups / input rows exceeds some fixed threshold value (which by default is somewhere between 0.5 and 1, but closer to 1), and produce batched "as-is", replacing aggregate accumulators inputs with corresponding intermediate aggregate states (in order not to break record batch schema for downstream operators -- specifically, for CoalesceBatches)

What changes are included in this PR?

Execution configuration options skip_partial_aggregation_probe_rows_threshold and skip_partial_aggregation_probe_ratio_threshold -- the first is responsible for input rows to aggregate before checking aggregation ratio, the second -- for rate threshold
GroupedHashAggregateStream.skip_aggregation_probe and related methods for updating state / obtaining information if further input aggregation may be skipped
GroupsAccumulator.convert_to_state, and its implementations for PrimitiveGroupsAccumulator (sum / min / max) and Count accumulators -- method responsible for converting RecordBatch to intermediate aggregate state, without grouping input data, and ``GroupsAccumulator.convert_to_state_supported`, which indicates that accumulator is able to perform conversion described above.

Are these changes tested?

Added tests for switching to SkippingAggregation state for aggregate stream, and sqllogictests to validate correctness of accumulators in skipping aggregation mode.

Are there any user-facing changes?

Partial aggregation results may now contain records with duplicating values of GROUP BY expressions

alamb · 2024-07-23T20:53:44Z

Thank you @korowa -- I think this is the right approach. The challenge when I tried it before was that it slowed down some queries. We should run some benchmarks (I can help maybe tomorrow)

Dandandan · 2024-07-24T08:31:59Z

datafusion/physical-expr-common/src/aggregate/groups_accumulator/prim_op.rs

+
+        match opt_filter {
+            Some(filter) => {
+                values


Can use filter kernel here instead of zipping?

Not sure about filter, but kernels sound like a good idea, I'll try to switch to using them.

I think @Dandandan is suggesting using https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html

However, that would likely require a second copy of the values (apply/collect filter result and then apply prim_fn)

Yes, but the example shows that it filters out values from the source array, and conversion to state must produce the same number of elements, just placing ~~nulls~~ zeros instead of filtered values, so I'm planning to look for smth like "apply null mask".

I've started with some benchmarks (criterion based ones) and they show that current code for nullable columns (at least for count) is significantly slower that for non nullable ones (~15 times 😞 ), probably some part of this time can be recovered.

I see, we can't use filter here as we need to produce the values as is.

I think we should be able to build the values based on the values buffer and handle nulls separately:

no filter: just pass null mask of values

filter present: bitwise_and both null masks

this should also be beneficial for the non-null case, as it avoids the iterator/builder

For "no filter" -- casting values.logical_nulls() to i64 helps a bit. Regarding bitwise_and -- I'll try (the problem with all logical functions is that filter may also contain nulls)

alamb · 2024-07-25T09:06:43Z

I am starting to run clickbench and tpch benchmarks on this PR. Will report results shortly.

It is a really neat idea to have the threshold's configurable

alamb · 2024-07-25T11:42:32Z

Here are my benchmark results - they look quite good. Other than ClickBench Q32 and TPCH Q17 they all looks faster 😍

Details

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.84ms │                   0.86ms │     no change │
│ QQuery 1     │    95.11ms │                  96.03ms │     no change │
│ QQuery 2     │   200.81ms │                 209.94ms │     no change │
│ QQuery 3     │   208.67ms │                 207.07ms │     no change │
│ QQuery 4     │  2233.18ms │                2095.56ms │ +1.07x faster │
│ QQuery 5     │  2059.66ms │                2015.05ms │     no change │
│ QQuery 6     │    83.99ms │                  86.96ms │     no change │
│ QQuery 7     │    99.67ms │                 101.10ms │     no change │
│ QQuery 8     │  3235.66ms │                3017.30ms │ +1.07x faster │
│ QQuery 9     │  2419.16ms │                2350.53ms │     no change │
│ QQuery 10    │   848.81ms │                 857.76ms │     no change │
│ QQuery 11    │   926.94ms │                 933.87ms │     no change │
│ QQuery 12    │  2176.13ms │                2087.42ms │     no change │
│ QQuery 13    │  4677.48ms │                3770.29ms │ +1.24x faster │
│ QQuery 14    │  2938.45ms │                2845.23ms │     no change │
│ QQuery 15    │  2504.24ms │                2371.75ms │ +1.06x faster │
│ QQuery 16    │  6069.34ms │                5811.38ms │     no change │
│ QQuery 17    │  5991.68ms │                5856.53ms │     no change │
│ QQuery 18    │ 12199.74ms │               11468.73ms │ +1.06x faster │
│ QQuery 19    │   171.89ms │                 171.08ms │     no change │
│ QQuery 20    │  2693.33ms │                2795.76ms │     no change │
│ QQuery 21    │  3491.08ms │                3566.37ms │     no change │
│ QQuery 22    │  9438.41ms │                9598.53ms │     no change │
│ QQuery 23    │ 22160.51ms │               22473.59ms │     no change │
│ QQuery 24    │  1344.81ms │                1409.66ms │     no change │
│ QQuery 25    │  1167.37ms │                1182.06ms │     no change │
│ QQuery 26    │  1482.09ms │                1518.54ms │     no change │
│ QQuery 27    │  4044.47ms │                4035.97ms │     no change │
│ QQuery 28    │ 29023.37ms │               30566.78ms │  1.05x slower │
│ QQuery 29    │  1064.52ms │                1076.49ms │     no change │
│ QQuery 30    │  2553.83ms │                2598.63ms │     no change │
│ QQuery 31    │  3274.52ms │                3309.47ms │     no change │
│ QQuery 32    │ 17306.62ms │               18361.28ms │  1.06x slower │
│ QQuery 33    │  9624.79ms │                9860.60ms │     no change │
│ QQuery 34    │  9610.64ms │                9676.40ms │     no change │
│ QQuery 35    │  3800.23ms │                3819.15ms │     no change │
│ QQuery 36    │   352.06ms │                 351.07ms │     no change │
│ QQuery 37    │   238.56ms │                 238.28ms │     no change │
│ QQuery 38    │   196.30ms │                 204.93ms │     no change │
│ QQuery 39    │  1122.84ms │                1152.25ms │     no change │
│ QQuery 40    │   101.24ms │                  96.72ms │     no change │
│ QQuery 41    │    85.59ms │                  84.80ms │     no change │
│ QQuery 42    │   104.55ms │                 104.32ms │     no change │
└──────────────┴────────────┴──────────────────────────┴───────────────┘
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 173423.19ms │
│ Total Time (skip-partial-aggregation)   │ 174436.08ms │
│ Average Time (main_base)                │   4033.10ms │
│ Average Time (skip-partial-aggregation) │   4056.65ms │
│ Queries Faster                          │           5 │
│ Queries Slower                          │           2 │
│ Queries with No Change                  │          36 │
└─────────────────────────────────────────┴─────────────┘
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3850.08ms │                3858.37ms │     no change │
│ QQuery 1     │ 1558.99ms │                1493.93ms │     no change │
│ QQuery 2     │ 3150.28ms │                2935.05ms │ +1.07x faster │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 8559.35ms │
│ Total Time (skip-partial-aggregation)   │ 8287.35ms │
│ Average Time (main_base)                │ 2853.12ms │
│ Average Time (skip-partial-aggregation) │ 2762.45ms │
│ Queries Faster                          │         1 │
│ Queries Slower                          │         0 │
│ Queries with No Change                  │         2 │
└─────────────────────────────────────────┴───────────┘

--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  203.55ms │                 194.22ms │     no change │
│ QQuery 2     │   38.30ms │                  35.41ms │ +1.08x faster │
│ QQuery 3     │   61.24ms │                  59.49ms │     no change │
│ QQuery 4     │   65.67ms │                  60.42ms │ +1.09x faster │
│ QQuery 5     │  102.54ms │                  92.60ms │ +1.11x faster │
│ QQuery 6     │   15.35ms │                  13.56ms │ +1.13x faster │
│ QQuery 7     │  210.69ms │                 202.60ms │     no change │
│ QQuery 8     │   39.94ms │                  39.42ms │     no change │
│ QQuery 9     │  115.73ms │                 107.84ms │ +1.07x faster │
│ QQuery 10    │  103.60ms │                 101.10ms │     no change │
│ QQuery 11    │   73.46ms │                  71.29ms │     no change │
│ QQuery 12    │   47.43ms │                  44.81ms │ +1.06x faster │
│ QQuery 13    │   80.77ms │                  74.77ms │ +1.08x faster │
│ QQuery 14    │   18.05ms │                  18.70ms │     no change │
│ QQuery 15    │   32.48ms │                  29.60ms │ +1.10x faster │
│ QQuery 16    │   42.73ms │                  37.71ms │ +1.13x faster │
│ QQuery 17    │  160.06ms │                 159.77ms │     no change │
│ QQuery 18    │  463.05ms │                 428.43ms │ +1.08x faster │
│ QQuery 19    │   48.21ms │                  46.87ms │     no change │
│ QQuery 20    │  102.42ms │                  80.20ms │ +1.28x faster │
│ QQuery 21    │  295.09ms │                 266.05ms │ +1.11x faster │
│ QQuery 22    │   23.35ms │                  21.86ms │ +1.07x faster │
└──────────────┴───────────┴──────────────────────────┴───────────────┘

I am going to rerun the numbers to make sure they are reproducable and then give this PR a closer look

alamb · 2024-07-25T12:39:13Z

I am going to rerun the numbers to make sure they are reproducable and then give this PR a closer look

The subsequent runs look good (I don't think there is any slowdown in TPCH Q17, but there is still a slowdown in ClickBench Q32)

Details


--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.83ms │                   0.88ms │  1.06x slower │
│ QQuery 1     │    96.66ms │                  97.23ms │     no change │
│ QQuery 2     │   193.35ms │                 197.50ms │     no change │
│ QQuery 3     │   207.49ms │                 207.81ms │     no change │
│ QQuery 4     │  2230.56ms │                2259.35ms │     no change │
│ QQuery 5     │  2045.20ms │                2127.06ms │     no change │
│ QQuery 6     │    85.28ms │                  88.51ms │     no change │
│ QQuery 7     │   102.05ms │                  99.53ms │     no change │
│ QQuery 8     │  3233.59ms │                3276.54ms │     no change │
│ QQuery 9     │  2411.04ms │                2454.00ms │     no change │
│ QQuery 10    │   857.96ms │                 869.92ms │     no change │
│ QQuery 11    │   941.82ms │                 938.58ms │     no change │
│ QQuery 12    │  2162.34ms │                2202.18ms │     no change │
│ QQuery 13    │  4619.49ms │                3945.16ms │ +1.17x faster │
│ QQuery 14    │  2925.89ms │                2965.62ms │     no change │
│ QQuery 15    │  2504.40ms │                2503.25ms │     no change │
│ QQuery 16    │  6050.13ms │                6101.81ms │     no change │
│ QQuery 17    │  6006.98ms │                5982.81ms │     no change │
│ QQuery 18    │ 12183.46ms │               11770.51ms │     no change │
│ QQuery 19    │   176.35ms │                 178.67ms │     no change │
│ QQuery 20    │  2748.47ms │                2728.24ms │     no change │
│ QQuery 21    │  3540.80ms │                3529.89ms │     no change │
│ QQuery 22    │  9516.53ms │                9674.69ms │     no change │
│ QQuery 23    │ 22398.86ms │               22611.21ms │     no change │
│ QQuery 24    │  1363.24ms │                1404.88ms │     no change │
│ QQuery 25    │  1171.56ms │                1210.13ms │     no change │
│ QQuery 26    │  1505.58ms │                1535.15ms │     no change │
│ QQuery 27    │  4077.71ms │                4075.32ms │     no change │
│ QQuery 28    │ 28976.66ms │               30911.26ms │  1.07x slower │
│ QQuery 29    │  1022.97ms │                1047.79ms │     no change │
│ QQuery 30    │  2589.79ms │                2533.81ms │     no change │
│ QQuery 31    │  3310.10ms │                3238.71ms │     no change │
│ QQuery 32    │ 17074.56ms │               17987.52ms │  1.05x slower │
│ QQuery 33    │  9640.00ms │                9704.17ms │     no change │
│ QQuery 34    │  9720.05ms │                9635.63ms │     no change │
│ QQuery 35    │  3796.26ms │                3825.34ms │     no change │
│ QQuery 36    │   344.07ms │                 357.52ms │     no change │
│ QQuery 37    │   237.58ms │                 238.73ms │     no change │
│ QQuery 38    │   201.65ms │                 205.97ms │     no change │
│ QQuery 39    │  1150.94ms │                1203.07ms │     no change │
│ QQuery 40    │    94.45ms │                 100.77ms │  1.07x slower │
│ QQuery 41    │    87.01ms │                  84.64ms │     no change │
│ QQuery 42    │   104.06ms │                 104.01ms │     no change │
└──────────────┴────────────┴──────────────────────────┴───────────────┘
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 173707.77ms │
│ Total Time (skip-partial-aggregation)   │ 176215.37ms │
│ Average Time (main_base)                │   4039.72ms │
│ Average Time (skip-partial-aggregation) │   4098.03ms │
│ Queries Faster                          │           1 │
│ Queries Slower                          │           4 │
│ Queries with No Change                  │          38 │
└─────────────────────────────────────────┴─────────────┘
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3837.97ms │                3863.80ms │     no change │
│ QQuery 1     │ 1551.46ms │                1475.34ms │     no change │
│ QQuery 2     │ 3138.85ms │                2967.22ms │ +1.06x faster │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 8528.27ms │
│ Total Time (skip-partial-aggregation)   │ 8306.35ms │
│ Average Time (main_base)                │ 2842.76ms │
│ Average Time (skip-partial-aggregation) │ 2768.78ms │
│ Queries Faster                          │         1 │
│ Queries Slower                          │         0 │
│ Queries with No Change                  │         2 │
└─────────────────────────────────────────┴───────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ skip-partial-aggregation ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  197.75ms │                 194.26ms │     no change │
│ QQuery 2     │   36.69ms │                  36.57ms │     no change │
│ QQuery 3     │   63.32ms │                  59.03ms │ +1.07x faster │
│ QQuery 4     │   71.51ms │                  65.53ms │ +1.09x faster │
│ QQuery 5     │   99.09ms │                  95.95ms │     no change │
│ QQuery 6     │   14.75ms │                  14.78ms │     no change │
│ QQuery 7     │  211.05ms │                 216.63ms │     no change │
│ QQuery 8     │   40.90ms │                  40.35ms │     no change │
│ QQuery 9     │  106.66ms │                 109.15ms │     no change │
│ QQuery 10    │  106.03ms │                 101.38ms │     no change │
│ QQuery 11    │   73.27ms │                  71.56ms │     no change │
│ QQuery 12    │   48.35ms │                  45.22ms │ +1.07x faster │
│ QQuery 13    │   81.21ms │                  80.28ms │     no change │
│ QQuery 14    │   19.21ms │                  18.23ms │ +1.05x faster │
│ QQuery 15    │   30.57ms │                  32.26ms │  1.06x slower │
│ QQuery 16    │   41.27ms │                  38.89ms │ +1.06x faster │
│ QQuery 17    │  153.12ms │                 159.38ms │     no change │
│ QQuery 18    │  451.90ms │                 447.09ms │     no change │
│ QQuery 19    │   47.87ms │                  46.85ms │     no change │
│ QQuery 20    │  107.44ms │                  87.75ms │ +1.22x faster │
│ QQuery 21    │  293.86ms │                 283.98ms │     no change │
│ QQuery 22    │   22.71ms │                  21.86ms │     no change │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                  │ 2318.52ms │
│ Total Time (skip-partial-aggregation)   │ 2266.97ms │
│ Average Time (main_base)                │  105.39ms │
│ Average Time (skip-partial-aggregation) │  103.04ms │
│ Queries Faster                          │         6 │
│ Queries Slower                          │         1 │
│ Queries with No Change                  │        15 │
└─────────────────────────────────────────┴───────────┘

alamb

This is really cool @korowa . Thank you so much

Not only is it cool that it improves performance in many cases, it is cool that it has in incremental approach (can implemente convert_to_state for GroupsAccumulators over time)

I have two concerns:

That this approach may overfit the problem (aka that it isn't generalizeable outside the context of the benchmark runs)
That this approach might preclude making some larger changes (like simply turning off the intermediate generation)

and produce batched "as-is", replacing aggregate accumulators inputs with corresponding intermediate aggregate states (in order not to break record batch schema for downstream operators -- specifically, for CoalesceBatches)

I wonder if you have thought about some way to disable aggregation entirely in the partial aggregation phase (as in avoid having to convert it into the state)? The challenge as you have pointed out is that the state types may be different than the input, so it would likely be a larger/more involved change 🤔

I want to think about this PR some more, but I think it is really nice and I am inclined to say we should proceed with this approach

I think to merge it I would like to see:

Some more background comments on why this approach (the existing code in this PR is already very well commented about what it does 🥇 ) -- I plan to help with this
Look into why the clickbench queries got slower (I am worried there is some tuning now required which will be hard to get totally optimal)

alamb · 2024-07-25T12:17:42Z

datafusion/physical-plan/src/aggregates/row_hash.rs

+    }
+
+    // Transforms input batch to intermediate aggregate state, without grouping it
+    fn transform_to_states(&self, batch: RecordBatch) -> Result<RecordBatch> {


This is quite clever

datafusion/functions-aggregate/src/count.rs

alamb · 2024-07-25T12:55:43Z

datafusion/functions-aggregate/src/count.rs

+                });
+                builder.finish()
+            }
+            (None, None) => Int64Array::from_value(1, values.len()),


it is unfortunate that we need to create this over and over again 🤔

korowa · 2024-07-25T17:23:42Z

@alamb thank you for sharing benchmark results -- I'll check out if any of them benefited from this feature (I suppose it shouldn't be triggered in many of them) and will look for the possible reasons of q32 (and other queries) slowdown (actually this one -- q32, should benefit most, when producing state will be implemented for avg) + additionally check if generating state can be done faster via kernels instead of loops.

Regarding you comments:

That this approach may overfit the problem (aka that it isn't generalizeable outside the context of the benchmark runs)

Probably, but I supposed this idea to be opposite to overfitting, since it relies more on the input data, rather then fixed settings (I may be wrong here however).

That this approach might preclude making some larger changes (like simply turning off the intermediate generation)
...
I wonder if you have thought about some way to disable aggregation entirely in the partial aggregation phase

Initially I've been considering to make partial aggregation just propagate input batches as is, and adding some internal flag into their schema metadata (pointing that final aggregation to use update_batch instead of merge_batch), but decided that it may be too "pipeline breaking" due to different batch schemas (as you've pointed out) -- I suppose it'll require an additional logic in CoalesceBatchesExec (it wont be able to concat batches with different schemas, coming from on/off partial aggregation partitions), it also could be a blocker for any (unplanned yet) optimizations of RepartitionExec (in case there will be some buffers with embedded batch concatenation), and it also may produce some burden for DF-based projects doing data shuffling across the nodes and having their own shuffle operators. Overall -- I've decided that current approach is more safe (at least at this moment), as it doesn't affect anything besides aggregation operator.

Dandandan · 2024-07-25T18:17:11Z

datafusion/functions-aggregate/src/count.rs

+                let mut builder = Int64Builder::with_capacity(values.len());
+                nulls
+                    .into_iter()
+                    .for_each(|is_valid| builder.append_value(is_valid as i64));


I believe .collect() should be slightly faster and less verbose than a builder here.

even better, we should be able to cast the null array to int64

FWIW: into_iter().map().collect::<Int64Array>() seems to be slower than appending values to the builder 🤔

alamb · 2024-07-27T10:57:58Z

@alamb thank you for sharing benchmark results -- I'll check out if any of them benefited from this feature (I suppose it shouldn't be triggered in many of them) and will look for the possible reasons of q32 (and other queries) slowdown (actually this one -- q32, should benefit most, when producing state will be implemented for avg) + additionally check if generating state can be done faster via kernels instead of loops.

Awesome -- I am planning to look into them as well

That this approach may overfit the problem (aka that it isn't generalizeable outside the context of the benchmark runs)

Probably, but I supposed this idea to be opposite to overfitting, since it relies more on the input data, rather then fixed settings (I may be wrong here however).

The more I think about it, the more I agree with you. While there are tuning knobs (e.g. the fraction of tuples aggregates) I do think they are general.

That this approach might preclude making some larger changes (like simply turning off the intermediate generation)
...
I wonder if you have thought about some way to disable aggregation entirely in the partial aggregation phase

Initially I've been considering to make partial aggregation just propagate input batches as is, and adding some internal flag into their schema metadata (pointing that final aggregation to use update_batch instead of merge_batch), but decided that it may be too "pipeline breaking" due to different batch schemas (as you've pointed out) -- I suppose it'll require an additional logic in CoalesceBatchesExec (it wont be able to concat batches with different schemas, coming from on/off partial aggregation partitions), it also could be a blocker for any (unplanned yet) optimizations of RepartitionExec (in case there will be some buffers with embedded batch concatenation), and it also may produce some burden for DF-based projects doing data shuffling across the nodes and having their own shuffle operators. Overall -- I've decided that current approach is more safe (at least at this moment), as it doesn't affect anything besides aggregation operator.

I think this makes sense and I agree with your conclusion

alamb · 2024-07-27T12:41:39Z

My plan here is to spend time tomorrow morning doing some additional investigation / testing on the branch and unless I find any blockers I think we should proceed with it.

What I am thinking is that between this PR and the StringView PR #11667 we are going to be in pretty sweet shape.

The improvements with this change are so compelling in my opinion that I think we can document any potential performance regressions that this PR causes, and then work on them as a follow on before the release.

korowa · 2024-07-28T09:45:49Z

FWIW: regarding benchmarks -- running with target_partitions=4 shows that this feature is enabling while clickbench Q13 (count distinct is rewritten to double group by) and tpch Q20 (one of the filters contains correlated subquery with aggregation). Also partial aggregation is skipped on 1/4 partitions in clickbench Q18 and tpch Q16. As a result -- I'd expect any performance improvements only in clickbench Q13 and tcph Q20 (don't think 1/4 partitions in other two queries is able to make any effect), and I suppose that improvements shown by any other queries to be just a matter of luck and fluctuations -- I wasn't able to find any stable regressions during local benchmark runs.

Regarding Q32 -- I've run it separately and got equal runtimes for both branches (due to AVG it's not able to skip partial aggregation yet)

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃     master ┃ skip-partial-aggregation ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │ 19399.38ms │               19424.11ms │ no change │
└──────────────┴────────────┴──────────────────────────┴───────────┘

alamb · 2024-07-28T12:21:03Z

I spent some time this morning playing around with ClickBench query 32 locally and I agree any slowdown does not look significant or a blocker.

Q32

SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;

Running from datafusion-cli:

./datafusion-cli-skip-partial -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Here are the timings I got:

skip-partial
4.570
4.280
4.528

main
4.564
4.444
4.441

alamb · 2024-07-28T13:41:45Z

I also tried out Q32 (that has AVG so can't use this optimization yet) but removed the AVG and set target partitions to something silly. I see this PR making a substantial difference (6s vs 7s)

1000 partitions, this PR

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ ./datafusion-cli-skip-partial -c "set datafusion.execution.target_partitions = 1000; SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

Elapsed 0.001 seconds.

+---------------------+-------------+---+-----------------------------+
| WatchID             | ClientIP    | c | sum(hits.parquet.IsRefresh) |
+---------------------+-------------+---+-----------------------------+
| 7904046282518428963 | 1509330109  | 2 | 0                           |
| 8566928176839891583 | -1402644643 | 2 | 0                           |
| 6655575552203051303 | 1611957945  | 2 | 0                           |
| 7224410078130478461 | -776509581  | 2 | 0                           |
| 9102894172721185728 | 1489622498  | 1 | 1                           |
| 8964981845434484863 | 1822336830  | 1 | 0                           |
| 6991883311913569583 | -745122562  | 1 | 0                           |
| 6787783378461221127 | -506600142  | 1 | 0                           |
| 6042898921955304644 | 2054220936  | 1 | 0                           |
| 5581365862985039198 | 104944290   | 1 | 0                           |
+---------------------+-------------+---+-----------------------------+
10 row(s) fetched.
Elapsed 6.378 seconds.

1000 partitions, main

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -c "set datafusion.execution.target_partitions = 1000; SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\") FROM 'hits.parquet' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"
DataFusion CLI v40.0.0
0 row(s) fetched.
Elapsed 0.002 seconds.

+---------------------+-------------+---+-----------------------------+
| WatchID             | ClientIP    | c | sum(hits.parquet.IsRefresh) |
+---------------------+-------------+---+-----------------------------+
| 7904046282518428963 | 1509330109  | 2 | 0                           |
| 8566928176839891583 | -1402644643 | 2 | 0                           |
| 6655575552203051303 | 1611957945  | 2 | 0                           |
| 7224410078130478461 | -776509581  | 2 | 0                           |
| 6780795588237729988 | 1894276368  | 1 | 1                           |
| 6158430646513894356 | -1557291761 | 1 | 0                           |
| 8433113762047612962 | 1214823432  | 1 | 0                           |
| 8783130976633619349 | 1072197582  | 1 | 0                           |
| 4959259883895284379 | 2023656393  | 1 | 0                           |
| 6328586531975293675 | 1549952556  | 1 | 1                           |
+---------------------+-------------+---+-----------------------------+
10 row(s) fetched.
Elapsed 7.771 seconds.

alamb

I spent a bunch more time reviewing this PR today and I think it is good and could be merged as is. Thank you so much @korowa and @Dandandan )

Before merging this PR I think we need

Run the benchmarks one more time
Give it a few more days to gather any more review comments

Here are the follow up items I suggest (and I can file tickets):

More documentation (I started here Improve aggregation documentation for multi-phase aggregation #11695)
Add a metric to record when group by switches to skip partial aggregate mode, so we can see when this happens in EXPLAIN ANALYZE plans
File tickets to support convert_to_state for other GroupsAccumulators (like AVG for example) -- I think this could be done by the larger community easier after the additional documentaiton (and they can follow the test pattern you have in this PR)

FYI @kazuyukitanimura -- I wonder if you have time to review this change in the context of hash aggregate spilling as you originally contributed #7400

Context:

Describes this issue: Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937
Additional background documentation: Improve aggregation documentation for multi-phase aggregation #11695

alamb · 2024-07-28T13:50:38Z

datafusion/physical-plan/src/aggregates/row_hash.rs

@@ -90,6 +94,69 @@ struct SpillState {
    merging_group_by: PhysicalGroupBy,
 }

+struct SkipAggregationProbe {


FYI @kazuyukitanimura -- I wonder if you have time to review this change to hash aggregate spilling as you originally contributed #7400

Context:

Describes this issue: Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937

Additional background documentation: Improve aggregation documentation for multi-phase aggregation #11695

alamb · 2024-07-28T13:55:09Z

datafusion/physical-expr-common/src/aggregate/groups_accumulator/prim_op.rs

+
+        match opt_filter {
+            Some(filter) => {
+                values


I think @Dandandan is suggesting using https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html

However, that would likely require a second copy of the values (apply/collect filter result and then apply prim_fn)

alamb · 2024-07-28T14:01:40Z

datafusion/physical-plan/src/aggregates/row_hash.rs

@@ -484,6 +612,12 @@ impl Stream for GroupedHashAggregateStream {
                        (
                            if self.input_done {
                                ExecutionState::Done
+                            } else if self


nit is that putting this into a function (like self.should_skip_aggregation()) would make this logic easier to follow

Filed #11821 with a proposal for this change

korowa · 2024-07-28T14:30:41Z

1000 partitions

@alamb this is also a bit unexpected, since default value of rows to fire check after is 100_000 and its applied per partition (each partition is going to process at least 100k rows normally, without skipping aggregation), and the total number of rows in the file ~100kk (if I'm not mistaken). So this optimization should not benefit in this case, as in case of 1000 partitions each partition will read ~100_000 rows anyway 🤔

ozankabak · 2024-07-29T07:19:58Z

We will also take a look today or tomorrow

Dandandan · 2024-07-29T08:28:01Z

datafusion/functions-aggregate/src/count.rs

+                let mut builder = Int64Builder::with_capacity(values.len());
+                nulls.into_iter().zip(filter.iter()).for_each(
+                    |(is_valid, filter_value)| {
+                        builder.append_value(


bitwise_and + cast?

Maybe we can use the nullif kernel here

Something like

let nulls = and(nulls, not(filter)); let output = nullif(values);

Update: or maybe we could just and the nulls from the input and the filter (as nulls is the validity mask` 🤔

I came up wtih this in #11734:

/// Converts a `BooleanBuffer` representing a filter to a `NullBuffer` /// where the NullBuffer is true for all values that were true /// in the filter and `null` for any values that were false or null fn filter_to_nulls(filter: &BooleanArray) -> Option<NullBuffer> { let (filter_bools, filter_nulls) = filter.clone().into_parts(); // Only keep values where the filter was true // convert all false to null let filter_bools = NullBuffer::from(filter_bools); NullBuffer::union(Some(&filter_bools), filter_nulls.as_ref()) } /// Compute the final null mask for an array /// /// The output null mask : /// * is true (non null) for all values that were true in the filter and non null in the input /// * is false (null) for all values that were false in the filter or null in the input fn filtered_null_mask( opt_filter: Option<&BooleanArray>, input: &dyn Array, ) -> Option<NullBuffer> { let opt_filter = opt_filter.and_then(filter_to_nulls); NullBuffer::union(opt_filter.as_ref(), input.nulls()) }

And then you compute the final null mask without messing with the input:

let nulls = filtered_null_mask(opt_filter, sums); let sums = PrimitiveArray::<T>::new(sums.values().clone(), nulls) .with_data_type(self.sum_data_type.clone());

Nice, using NullBuffer::union is much better for the readability

That was the missing link for me (thank you!) -- we can operate directly on underlying buffers.

I've rewritten state conversion for count on bitand on buffers + cast to Int64 in the end, and according to benchmarks from the commit it got 20-25% faster.

Just a suggestion -- won't it be better to use BooleanBuffer + & (bitand operator) instead of NullBuffer + union? NullBuffer is a bit confusing, so I've "pulled" the logic from union right into state conversion function.

Additionally, I plan to prepare benches and minimize ArrayBuilder usage for min / max / sum during tomorrow.

I've rewritten state conversion for count on bitand on buffers + cast to Int64 in the end, and according to benchmarks from the commit it got 20-25% faster.

🎉

Just a suggestion -- won't it be better to use BooleanBuffer + & (bitand operator) instead of NullBuffer + union? NullBuffer is a bit confusing, so I've "pulled" the logic from union right into state conversion function.

I think they are equivalent: NullBuffer just wraps BooleanBuffer and NullBuffer::union just calls & underneath: https://docs.rs/arrow-buffer/52.2.0/src/arrow_buffer/buffer/null.rs.html#76 (after replicating the match(nulls, filter) logic)

I don't have a strong opinion about which is more/less confusing

What I suggest we do is pull the logic to compute the output null mask basd on the optional input nullmask and the optional filter into a function (like fn filtered_null_mask) as it will be used in basically all of the convert_to_state implementations. As long as it is well documented, I think either implementation will work well

Additionally, I plan to prepare benches and minimize ArrayBuilder usage for min / max / sum during tomorrow.

Sounds good -- would you like to keep updating this PR or shall we merge this PR and continue improvements with additional PRs on main?

I'd like to make these few changes in this PR (along with merging docs update and review suggestions) -- don't think it'll take long enough to accumulate any significant conflicts.

Sounds good. We will wait for you to let us know when it is ready to merge

Dandandan · 2024-07-29T08:28:24Z

datafusion/functions-aggregate/src/count.rs

+                filter.into_iter().for_each(|filter_value| {
+                    builder.append_value(filter_value.is_some_and(|val| val) as i64)
+                });
+                builder.finish()


Dandandan · 2024-07-29T08:31:10Z

I added a couple of suggestions for performance

alamb · 2024-07-29T12:53:15Z

So this optimization should not benefit in this case, as in case of 1000 partitions each partition will read ~100_000 rows anyway 🤔

@korowa If we added a metric that tracks when this mode switched in, I think it would be easier to diagnose what is going on. I will make a PR to do so.

alamb · 2024-07-29T12:54:21Z

We will also take a look today or tomorrow

@ozankabak if I may toot my own horn a bit, I would personally suggest checking out the docs I wrote korowa#172 (and #11695) before the code of this PR as I tried to explain more at a high level what it is doing.

korowa · 2024-08-03T17:00:21Z

So if you are ok with that @korowa let's get this green and merge.

@alamb I'm totally fine with that -- taking into account, that there are already some followups/improvements for this feature, it's not worth blocking them (since making state conversion for COUNT faster, will probably take some time for me).

Please let me know if there are any changes/fixes that have to be done in order to make this PR ready for merging.

alamb

I took another look at this PR and I think it is looking very nice. Thank you again @korowa and all reviewers

I will plan to merge it tomorrow (Monday) and file follow on tickets to track additional work.

alamb · 2024-08-05T10:55:15Z

🚀

alamb · 2024-08-05T11:19:18Z

Thank you again everyone for all your work.

I am hoping this is the first step towards some significantly improved TPCH / ClickBench performance

I filed the following follow on tickets / PRs:

datafusion/physical-plan/src/aggregates/row_hash.rs

github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Jul 23, 2024

Dandandan reviewed Jul 24, 2024

View reviewed changes

alamb mentioned this pull request Jul 24, 2024

Blog post with DataFusion July - Sep 2024 #11631

Open

alamb reviewed Jul 25, 2024

View reviewed changes

Dandandan reviewed Jul 25, 2024

View reviewed changes

korowa force-pushed the skip-partial-aggregation branch from 5a33f11 to aeb6c84 Compare July 28, 2024 12:14

korowa force-pushed the skip-partial-aggregation branch from aeb6c84 to 9b7e4c8 Compare July 28, 2024 12:29

alamb mentioned this pull request Jul 28, 2024

Improve aggregation documentation for multi-phase aggregation #11695

Closed

alamb approved these changes Jul 28, 2024

View reviewed changes

Dandandan reviewed Jul 29, 2024

View reviewed changes

This was referenced Jul 29, 2024

Improve aggregatation documentation korowa/arrow-datafusion#172

Closed

Add skipped_aggregation_rows metric to aggregate operator #11706

Merged

benchmarks for convert_to_state

f7db8e3

korowa force-pushed the skip-partial-aggregation branch from a99c081 to bd0a2bd Compare August 3, 2024 16:33

korowa and others added 2 commits August 3, 2024 19:49

speeding up conversion to state

8549c60

Fix MSRV error on 1.76.0

8913fcf

korowa force-pushed the skip-partial-aggregation branch from bd0a2bd to 5b183d6 Compare August 3, 2024 16:49

Improve aggregatation documentation

b3c033f

korowa force-pushed the skip-partial-aggregation branch from 5b183d6 to b3c033f Compare August 3, 2024 16:51

alamb approved these changes Aug 4, 2024

View reviewed changes

alamb merged commit c340b6a into apache:main Aug 5, 2024
27 checks passed

2010YOUY01 mentioned this pull request Aug 6, 2024

Support partial aggregation skip for boolean functions #11847

Merged

2 tasks

This was referenced Aug 6, 2024

chore: bump DataFusion to rev f4e519f apache/datafusion-comet#783

Merged

Possible data corruption in "Skipping partial aggregation" change #11850

Closed

andygrove added the api change Changes the API exposed to users of the crate label Aug 7, 2024

alamb mentioned this pull request Aug 22, 2024

Aggregation fuzz testing #12114

Open

1 task

lewiszlw reviewed Sep 10, 2024

View reviewed changes

datafusion/physical-plan/src/aggregates/row_hash.rs Show resolved Hide resolved

alamb mentioned this pull request Sep 25, 2024

[Epic] High cardinality aggregation performance wishlist #11679

Open

4 tasks

alamb mentioned this pull request Oct 8, 2024

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Closed

alamb mentioned this pull request Nov 25, 2024

[EPIC] Improved performance in H20.ai benchmarks #13548

Open

4 tasks

Skipping partial aggregation when it is not helping for high cardinality aggregates #11627

Skipping partial aggregation when it is not helping for high cardinality aggregates #11627

Conversation

korowa commented Jul 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

korowa Jul 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 25, 2024

alamb commented Jul 25, 2024

alamb commented Jul 25, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

korowa commented Jul 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 27, 2024

alamb commented Jul 27, 2024

korowa commented Jul 28, 2024

alamb commented Jul 28, 2024

alamb commented Jul 28, 2024

1000 partitions, this PR

1000 partitions, main

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

korowa commented Jul 28, 2024

ozankabak commented Jul 29, 2024

Choose a reason for hiding this comment

alamb Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Jul 29, 2024

alamb commented Jul 29, 2024

alamb commented Jul 29, 2024 • edited Loading

korowa commented Aug 3, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Aug 5, 2024

alamb commented Aug 5, 2024

korowa commented Jul 23, 2024 •

edited

Loading

korowa Jul 28, 2024 •

edited

Loading

alamb commented Jul 25, 2024 •

edited

Loading

alamb Jul 30, 2024 •

edited

Loading

alamb Aug 1, 2024 •

edited

Loading

alamb commented Jul 29, 2024 •

edited

Loading