perf: improve performance of `vectorized_equal_to` for `PrimitiveGroupValueBuilder` in multi group by aggregation #17977

rluvaton · 2025-10-08T18:34:02Z

Which issue does this PR close?

N/A

Rationale for this change

Making multi column aggregation even faster

What changes are included in this PR?

In PrimitiveGroupValueBuilder.vectorized_equal_to always evaluate and use unchecked as both of these changes are what making the code compile to SIMD.

Are these changes tested?

Existing tests

Are there any user-facing changes?

Nope

I tried a LOT of variations GodBolt
from splitting to fixed size chunks and trying to get auto-vectorization to use gather and creating bitmask to even testing portable SIMD (just to see what it will generate).

this version only optimize the non null path for the moment as it is the easiest.

once and if we change from &mut [bool] to mutable packed bits we could:

evaluate in chunks of 64 items (I tried different variations to see what is the best - you can tweak in the godbolt above with different type and size to check for yourself), 64 is not necessarily the best but it will be the fastest I think for doing AND with the equal_to_results boolean buffer
add optimization for nullable as well by just doing bitwise operation at 64 items at a time and avoid the cost of getting each bit manually
skip 64 items right away if the the equal_to_results equal to 0x00 (i.e. all false)

rluvaton · 2025-10-08T18:35:28Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/primitive.rs

+        let iter = izip!(
+            lhs_rows.iter(),
+            rhs_rows.iter(),
+            equal_to_results.iter_mut(),
+        );
+
+        for (&lhs_row, &rhs_row, equal_to_result) in iter {
+            // Has found not equal to in previous column, don't need to check
+            if !*equal_to_result {
+                continue;
+            }
+
+            // Perf: skip null check (by short circuit) if input is not nullable
+            let exist_null = self.nulls.is_null(lhs_row);
+            let input_null = array.is_null(rhs_row);
+            if let Some(result) = nulls_equal_to(exist_null, input_null) {
+                *equal_to_result = result;
+                continue;
+            }
+
+            // Otherwise, we need to check their values
+            *equal_to_result = self.group_values[lhs_row].is_eq(array.value(rhs_row));
+        }
+    }


moved the code from vectorized_equal_to and removed the if NULLABLE as we will always get here if nullable

rluvaton · 2025-10-08T18:59:44Z

@alamb can you please run aggregate_vectorized benchmark with these changes?

datafusion/datafusion/physical-plan/benches/aggregate_vectorized.rs

Line 39 in 07a7eb2

fn bench_vectorized_append(c: &mut Criterion) {

Dandandan · 2025-10-09T07:44:24Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/primitive.rs

+                    self.group_values[lhs_row]
+                } else {
+                    // SAFETY: indices are guaranteed to be in bounds
+                    unsafe { *self.group_values.get_unchecked(lhs_row) }


As lhs_row is not checked here te be in bounds, this method would need to be marked unsafe as well.

what do you mean?

ctsk · 2025-10-14T06:43:51Z

I could run the benchmarks locally so that we can see the performance gains. Can you provide two git hashes at which to run the criterion benchmark and compare?

alamb · 2025-11-07T22:52:05Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-primitive-multi-group-by-to-use-simd (0a6f6d3) to 969fc13 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-11-07T23:50:00Z

🤖: Benchmark completed

Details

Comparing HEAD and optimize-primitive-multi-group-by-to-use-simd
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-primitive-multi-group-by-to-use-simd ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2846.59 ms │                                    2788.14 ms │     no change │
│ QQuery 1     │  1279.77 ms │                                    1307.16 ms │     no change │
│ QQuery 2     │  2564.09 ms │                                    2565.42 ms │     no change │
│ QQuery 3     │  1189.15 ms │                                    1102.74 ms │ +1.08x faster │
│ QQuery 4     │  2357.47 ms │                                    2331.30 ms │     no change │
│ QQuery 5     │ 28146.52 ms │                                   27825.38 ms │     no change │
│ QQuery 6     │  4194.17 ms │                                    4225.22 ms │     no change │
│ QQuery 7     │  3892.58 ms │                                    3763.97 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                            │ 46470.35ms │
│ Total Time (optimize-primitive-multi-group-by-to-use-simd)   │ 45909.32ms │
│ Average Time (HEAD)                                          │  5808.79ms │
│ Average Time (optimize-primitive-multi-group-by-to-use-simd) │  5738.67ms │
│ Queries Faster                                               │          1 │
│ Queries Slower                                               │          0 │
│ Queries with No Change                                       │          7 │
│ Queries with Failure                                         │          0 │
└──────────────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-primitive-multi-group-by-to-use-simd ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.23 ms │                                       2.69 ms │  1.20x slower │
│ QQuery 1     │    49.36 ms │                                      49.67 ms │     no change │
│ QQuery 2     │   135.52 ms │                                     138.02 ms │     no change │
│ QQuery 3     │   165.74 ms │                                     163.69 ms │     no change │
│ QQuery 4     │  1156.77 ms │                                    1119.71 ms │     no change │
│ QQuery 5     │  1569.66 ms │                                    1545.00 ms │     no change │
│ QQuery 6     │     2.21 ms │                                       2.24 ms │     no change │
│ QQuery 7     │    56.97 ms │                                      53.62 ms │ +1.06x faster │
│ QQuery 8     │  1535.53 ms │                                    1443.48 ms │ +1.06x faster │
│ QQuery 9     │  1913.64 ms │                                    1820.77 ms │     no change │
│ QQuery 10    │   382.50 ms │                                     381.31 ms │     no change │
│ QQuery 11    │   432.43 ms │                                     423.29 ms │     no change │
│ QQuery 12    │  1459.06 ms │                                    1369.77 ms │ +1.07x faster │
│ QQuery 13    │  2198.64 ms │                                    2179.86 ms │     no change │
│ QQuery 14    │  1330.94 ms │                                    1278.89 ms │     no change │
│ QQuery 15    │  1300.73 ms │                                    1248.36 ms │     no change │
│ QQuery 16    │  2772.70 ms │                                    2743.92 ms │     no change │
│ QQuery 17    │  2753.94 ms │                                    2714.33 ms │     no change │
│ QQuery 18    │  5370.36 ms │                                    5007.91 ms │ +1.07x faster │
│ QQuery 19    │   126.02 ms │                                     127.77 ms │     no change │
│ QQuery 20    │  2055.82 ms │                                    2029.83 ms │     no change │
│ QQuery 21    │  2358.66 ms │                                    2332.38 ms │     no change │
│ QQuery 22    │  4013.44 ms │                                    3986.75 ms │     no change │
│ QQuery 23    │ 16492.87 ms │                                   12877.44 ms │ +1.28x faster │
│ QQuery 24    │   225.36 ms │                                     206.53 ms │ +1.09x faster │
│ QQuery 25    │   491.65 ms │                                     475.23 ms │     no change │
│ QQuery 26    │   221.32 ms │                                     217.25 ms │     no change │
│ QQuery 27    │  2934.68 ms │                                    2845.20 ms │     no change │
│ QQuery 28    │ 23576.15 ms │                                   23468.83 ms │     no change │
│ QQuery 29    │  1010.95 ms │                                    1002.77 ms │     no change │
│ QQuery 30    │  1402.75 ms │                                    1328.95 ms │ +1.06x faster │
│ QQuery 31    │  1421.24 ms │                                    1368.73 ms │     no change │
│ QQuery 32    │  5254.62 ms │                                    4971.25 ms │ +1.06x faster │
│ QQuery 33    │  6029.16 ms │                                    5822.23 ms │     no change │
│ QQuery 34    │  6185.48 ms │                                    6002.99 ms │     no change │
│ QQuery 35    │  2160.68 ms │                                    1880.87 ms │ +1.15x faster │
│ QQuery 36    │   121.91 ms │                                     122.82 ms │     no change │
│ QQuery 37    │    51.34 ms │                                      51.40 ms │     no change │
│ QQuery 38    │   122.22 ms │                                     121.74 ms │     no change │
│ QQuery 39    │   197.67 ms │                                     201.67 ms │     no change │
│ QQuery 40    │    44.98 ms │                                      42.53 ms │ +1.06x faster │
│ QQuery 41    │    41.26 ms │                                      39.36 ms │     no change │
│ QQuery 42    │    32.20 ms │                                      32.82 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                                            ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                                            │ 101161.37ms │
│ Total Time (optimize-primitive-multi-group-by-to-use-simd)   │  95243.85ms │
│ Average Time (HEAD)                                          │   2352.59ms │
│ Average Time (optimize-primitive-multi-group-by-to-use-simd) │   2214.97ms │
│ Queries Faster                                               │          10 │
│ Queries Slower                                               │           1 │
│ Queries with No Change                                       │          32 │
│ Queries with Failure                                         │           0 │
└──────────────────────────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ optimize-primitive-multi-group-by-to-use-simd ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 169.89 ms │                                     132.86 ms │ +1.28x faster │
│ QQuery 2     │  32.08 ms │                                      28.56 ms │ +1.12x faster │
│ QQuery 3     │  42.45 ms │                                      38.83 ms │ +1.09x faster │
│ QQuery 4     │  29.70 ms │                                      28.38 ms │     no change │
│ QQuery 5     │  88.06 ms │                                      87.88 ms │     no change │
│ QQuery 6     │  19.61 ms │                                      19.69 ms │     no change │
│ QQuery 7     │ 238.46 ms │                                     233.19 ms │     no change │
│ QQuery 8     │  33.11 ms │                                      34.86 ms │  1.05x slower │
│ QQuery 9     │ 105.27 ms │                                     104.52 ms │     no change │
│ QQuery 10    │  63.72 ms │                                      62.52 ms │     no change │
│ QQuery 11    │  17.69 ms │                                      18.82 ms │  1.06x slower │
│ QQuery 12    │  53.57 ms │                                      51.21 ms │     no change │
│ QQuery 13    │  46.92 ms │                                      47.43 ms │     no change │
│ QQuery 14    │  14.24 ms │                                      14.07 ms │     no change │
│ QQuery 15    │  25.61 ms │                                      24.89 ms │     no change │
│ QQuery 16    │  25.51 ms │                                      25.09 ms │     no change │
│ QQuery 17    │ 153.98 ms │                                     152.68 ms │     no change │
│ QQuery 18    │ 277.15 ms │                                     274.47 ms │     no change │
│ QQuery 19    │  37.60 ms │                                      36.98 ms │     no change │
│ QQuery 20    │  49.29 ms │                                      50.02 ms │     no change │
│ QQuery 21    │ 329.70 ms │                                     324.28 ms │     no change │
│ QQuery 22    │  21.06 ms │                                      21.81 ms │     no change │
└──────────────┴───────────┴───────────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                            ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                            │ 1874.67ms │
│ Total Time (optimize-primitive-multi-group-by-to-use-simd)   │ 1813.04ms │
│ Average Time (HEAD)                                          │   85.21ms │
│ Average Time (optimize-primitive-multi-group-by-to-use-simd) │   82.41ms │
│ Queries Faster                                               │         3 │
│ Queries Slower                                               │         2 │
│ Queries with No Change                                       │        17 │
│ Queries with Failure                                         │         0 │
└──────────────────────────────────────────────────────────────┴───────────┘

alamb · 2025-11-11T21:24:33Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-primitive-multi-group-by-to-use-simd (0a6f6d3) to 969fc13 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-11-11T22:22:44Z

🤖: Benchmark completed

Details

Comparing HEAD and optimize-primitive-multi-group-by-to-use-simd
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-primitive-multi-group-by-to-use-simd ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2816.14 ms │                                    2710.67 ms │     no change │
│ QQuery 1     │  1285.37 ms │                                    1316.13 ms │     no change │
│ QQuery 2     │  2513.05 ms │                                    2517.89 ms │     no change │
│ QQuery 3     │  1170.25 ms │                                    1096.87 ms │ +1.07x faster │
│ QQuery 4     │  2324.76 ms │                                    2318.39 ms │     no change │
│ QQuery 5     │ 28876.14 ms │                                   28063.05 ms │     no change │
│ QQuery 6     │  4228.86 ms │                                    4202.73 ms │     no change │
│ QQuery 7     │  3703.70 ms │                                    3717.83 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                            │ 46918.28ms │
│ Total Time (optimize-primitive-multi-group-by-to-use-simd)   │ 45943.57ms │
│ Average Time (HEAD)                                          │  5864.79ms │
│ Average Time (optimize-primitive-multi-group-by-to-use-simd) │  5742.95ms │
│ Queries Faster                                               │          1 │
│ Queries Slower                                               │          0 │
│ Queries with No Change                                       │          7 │
│ Queries with Failure                                         │          0 │
└──────────────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ optimize-primitive-multi-group-by-to-use-simd ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.33 ms │                                       2.52 ms │  1.08x slower │
│ QQuery 1     │    49.63 ms │                                      48.39 ms │     no change │
│ QQuery 2     │   138.52 ms │                                     135.79 ms │     no change │
│ QQuery 3     │   168.63 ms │                                     164.06 ms │     no change │
│ QQuery 4     │  1108.77 ms │                                    1123.36 ms │     no change │
│ QQuery 5     │  1548.18 ms │                                    1520.34 ms │     no change │
│ QQuery 6     │     2.26 ms │                                       2.25 ms │     no change │
│ QQuery 7     │    54.48 ms │                                      55.77 ms │     no change │
│ QQuery 8     │  1508.88 ms │                                    1453.15 ms │     no change │
│ QQuery 9     │  1902.02 ms │                                    1835.81 ms │     no change │
│ QQuery 10    │   377.77 ms │                                     375.84 ms │     no change │
│ QQuery 11    │   429.78 ms │                                     422.50 ms │     no change │
│ QQuery 12    │  1393.06 ms │                                    1372.61 ms │     no change │
│ QQuery 13    │  2131.18 ms │                                    2133.79 ms │     no change │
│ QQuery 14    │  1273.84 ms │                                    1282.05 ms │     no change │
│ QQuery 15    │  1269.81 ms │                                    1279.07 ms │     no change │
│ QQuery 16    │  2721.52 ms │                                    2709.96 ms │     no change │
│ QQuery 17    │  2711.19 ms │                                    2713.23 ms │     no change │
│ QQuery 18    │  5212.97 ms │                                    4981.45 ms │     no change │
│ QQuery 19    │   128.99 ms │                                     125.04 ms │     no change │
│ QQuery 20    │  2007.47 ms │                                    2010.37 ms │     no change │
│ QQuery 21    │  2338.01 ms │                                    2337.84 ms │     no change │
│ QQuery 22    │  3953.59 ms │                                    4004.56 ms │     no change │
│ QQuery 23    │ 14868.86 ms │                                   12866.83 ms │ +1.16x faster │
│ QQuery 24    │   221.96 ms │                                     210.94 ms │     no change │
│ QQuery 25    │   476.66 ms │                                     481.59 ms │     no change │
│ QQuery 26    │   218.59 ms │                                     213.26 ms │     no change │
│ QQuery 27    │  2887.50 ms │                                    2806.91 ms │     no change │
│ QQuery 28    │ 23494.53 ms │                                   23284.37 ms │     no change │
│ QQuery 29    │  1012.55 ms │                                     978.53 ms │     no change │
│ QQuery 30    │  1359.82 ms │                                    1321.19 ms │     no change │
│ QQuery 31    │  1421.23 ms │                                    1402.98 ms │     no change │
│ QQuery 32    │  4954.29 ms │                                    4925.85 ms │     no change │
│ QQuery 33    │  6076.59 ms │                                    5951.85 ms │     no change │
│ QQuery 34    │  5977.99 ms │                                    5847.68 ms │     no change │
│ QQuery 35    │  2088.70 ms │                                    1866.69 ms │ +1.12x faster │
│ QQuery 36    │   121.32 ms │                                     122.92 ms │     no change │
│ QQuery 37    │    53.50 ms │                                      51.50 ms │     no change │
│ QQuery 38    │   122.51 ms │                                     121.11 ms │     no change │
│ QQuery 39    │   199.60 ms │                                     201.22 ms │     no change │
│ QQuery 40    │    43.88 ms │                                      40.28 ms │ +1.09x faster │
│ QQuery 41    │    40.04 ms │                                      38.48 ms │     no change │
│ QQuery 42    │    33.42 ms │                                      33.44 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                            │ 98106.37ms │
│ Total Time (optimize-primitive-multi-group-by-to-use-simd)   │ 94857.40ms │
│ Average Time (HEAD)                                          │  2281.54ms │
│ Average Time (optimize-primitive-multi-group-by-to-use-simd) │  2205.99ms │
│ Queries Faster                                               │          3 │
│ Queries Slower                                               │          1 │
│ Queries with No Change                                       │         39 │
│ Queries with Failure                                         │          0 │
└──────────────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ optimize-primitive-multi-group-by-to-use-simd ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 133.00 ms │                                     131.61 ms │     no change │
│ QQuery 2     │  29.48 ms │                                      29.36 ms │     no change │
│ QQuery 3     │  40.08 ms │                                      34.39 ms │ +1.17x faster │
│ QQuery 4     │  29.80 ms │                                      29.12 ms │     no change │
│ QQuery 5     │  87.19 ms │                                      87.66 ms │     no change │
│ QQuery 6     │  19.69 ms │                                      19.89 ms │     no change │
│ QQuery 7     │ 236.56 ms │                                     227.85 ms │     no change │
│ QQuery 8     │  34.52 ms │                                      35.40 ms │     no change │
│ QQuery 9     │ 110.61 ms │                                     109.57 ms │     no change │
│ QQuery 10    │  66.17 ms │                                      64.84 ms │     no change │
│ QQuery 11    │  18.37 ms │                                      17.86 ms │     no change │
│ QQuery 12    │  53.11 ms │                                      52.06 ms │     no change │
│ QQuery 13    │  47.37 ms │                                      47.61 ms │     no change │
│ QQuery 14    │  16.25 ms │                                      14.59 ms │ +1.11x faster │
│ QQuery 15    │  25.27 ms │                                      25.47 ms │     no change │
│ QQuery 16    │  25.29 ms │                                      25.89 ms │     no change │
│ QQuery 17    │ 150.98 ms │                                     153.85 ms │     no change │
│ QQuery 18    │ 275.68 ms │                                     284.36 ms │     no change │
│ QQuery 19    │  37.50 ms │                                      38.27 ms │     no change │
│ QQuery 20    │  49.30 ms │                                      50.80 ms │     no change │
│ QQuery 21    │ 348.45 ms │                                     333.59 ms │     no change │
│ QQuery 22    │  21.78 ms │                                      22.01 ms │     no change │
└──────────────┴───────────┴───────────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                            ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                            │ 1856.45ms │
│ Total Time (optimize-primitive-multi-group-by-to-use-simd)   │ 1836.05ms │
│ Average Time (HEAD)                                          │   84.38ms │
│ Average Time (optimize-primitive-multi-group-by-to-use-simd) │   83.46ms │
│ Queries Faster                                               │         2 │
│ Queries Slower                                               │         0 │
│ Queries with No Change                                       │        20 │
│ Queries with Failure                                         │         0 │
└──────────────────────────────────────────────────────────────┴───────────┘

alamb · 2025-11-12T14:38:16Z

The results look promising and consistent -- thank you @rluvaton -- I plan to review this PR over the next few days

rluvaton · 2025-11-12T14:39:31Z

@alamb can you please run aggregate_vectorized benchmark with these changes?

datafusion/datafusion/physical-plan/benches/aggregate_vectorized.rs

Line 39 in 07a7eb2

fn bench_vectorized_append(c: &mut Criterion) {

@alamb this is more relevant bench

alamb · 2025-11-13T17:32:49Z

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing optimize-primitive-multi-group-by-to-use-simd (0a6f6d3) to 969fc13 diff
BENCH_NAME=aggregate_vectorized
BENCH_COMMAND=cargo bench --bench aggregate_vectorized
BENCH_FILTER=
BENCH_BRANCH_NAME=optimize-primitive-multi-group-by-to-use-simd
Results will be posted here when complete

alamb

Thank you @rluvaton -- this PR looks good to me

alamb · 2025-11-13T17:44:21Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/primitive.rs

+                    unsafe { *array_values.get_unchecked(rhs_row) }
+                };
+
+                // Always evaluate, to allow for auto-vectorization


this makes sense for primitive values -- namely that the cost of checking if we should compare dominated just always comparing

alamb · 2025-11-13T17:52:25Z

once and if we change from &mut [bool] to mutable packed bits we could:

This is a good idea. I filed a ticket to track it

Potential Improved multiple column aggregation performance by using bitmasks rather than Vec<bool> #18676

alamb · 2025-11-13T19:02:08Z

🤖: Benchmark completed

Details

group                                                                                                             main                                    optimize-primitive-multi-group-by-to-use-simd
-----                                                                                                             ----                                    ---------------------------------------------
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_1000/append_val                                  1.00      9.1±0.02µs        ? ?/sec     1.08      9.9±0.06µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_1000/vectorized_append                           1.00      5.4±0.04µs        ? ?/sec     1.01      5.5±0.03µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_1000/vectorized_equal_to_0.25 true               1.03      2.6±0.00µs        ? ?/sec     1.00      2.5±0.00µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_1000/vectorized_equal_to_0.5 true                1.03      4.9±0.00µs        ? ?/sec     1.00      4.8±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_1000/vectorized_equal_to_0.75 true               1.03      7.4±0.01µs        ? ?/sec     1.00      7.2±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_1000/vectorized_equal_to_all_true                1.03      9.6±0.06µs        ? ?/sec     1.00      9.3±0.01µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_10000/append_val                                 1.00     89.2±0.30µs        ? ?/sec     1.08     96.1±1.23µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_10000/vectorized_append                          1.00     51.4±0.28µs        ? ?/sec     1.01     51.7±0.27µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_10000/vectorized_equal_to_0.25 true              1.01     33.7±0.09µs        ? ?/sec     1.00     33.4±0.11µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_10000/vectorized_equal_to_0.5 true               1.04     61.1±0.16µs        ? ?/sec     1.00     58.9±0.13µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_10000/vectorized_equal_to_0.75 true              1.02     73.7±0.11µs        ? ?/sec     1.00     71.9±1.30µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_10000/vectorized_equal_to_all_true               1.03     96.2±0.16µs        ? ?/sec     1.00     93.5±0.15µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_100000/append_val                                1.00    965.6±2.59µs        ? ?/sec     1.07   1028.6±3.86µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_100000/vectorized_append                         1.00    593.8±1.97µs        ? ?/sec     1.00    591.9±3.00µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_100000/vectorized_equal_to_0.25 true             1.01    395.6±3.10µs        ? ?/sec     1.00    392.6±2.30µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_100000/vectorized_equal_to_0.5 true              1.03    658.9±4.16µs        ? ?/sec     1.00    641.4±2.66µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_100000/vectorized_equal_to_0.75 true             1.03    752.5±3.72µs        ? ?/sec     1.00    730.1±2.68µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.0_size_100000/vectorized_equal_to_all_true              1.03    971.0±1.66µs        ? ?/sec     1.00    943.3±1.95µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_1000/append_val                                  1.00     12.6±0.04µs        ? ?/sec     1.03     13.0±0.10µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_1000/vectorized_append                           1.02     14.4±0.04µs        ? ?/sec     1.00     14.1±0.07µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_1000/vectorized_equal_to_0.25 true               1.03      2.8±0.00µs        ? ?/sec     1.00      2.7±0.00µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_1000/vectorized_equal_to_0.5 true                1.03      5.4±0.01µs        ? ?/sec     1.00      5.3±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_1000/vectorized_equal_to_0.75 true               1.03      8.2±0.01µs        ? ?/sec     1.00      8.0±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_1000/vectorized_equal_to_all_true                1.02     10.8±0.01µs        ? ?/sec     1.00     10.5±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_10000/append_val                                 1.00    124.1±0.45µs        ? ?/sec     1.01    125.2±0.33µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_10000/vectorized_append                          1.02    139.5±1.29µs        ? ?/sec     1.00    136.9±0.33µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_10000/vectorized_equal_to_0.25 true              1.00     37.4±0.18µs        ? ?/sec     1.01     37.7±0.22µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_10000/vectorized_equal_to_0.5 true               1.00     72.4±0.28µs        ? ?/sec     1.00     72.5±0.31µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_10000/vectorized_equal_to_0.75 true              1.01     85.3±0.35µs        ? ?/sec     1.00     84.0±0.24µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_10000/vectorized_equal_to_all_true               1.02    107.1±0.19µs        ? ?/sec     1.00    104.9±0.12µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_100000/append_val                                1.00   1328.9±3.32µs        ? ?/sec     1.00   1327.1±6.03µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_100000/vectorized_append                         1.02   1483.3±4.80µs        ? ?/sec     1.00   1452.7±4.44µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_100000/vectorized_equal_to_0.25 true             1.00    445.1±3.29µs        ? ?/sec     1.04    464.6±2.55µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_100000/vectorized_equal_to_0.5 true              1.01    800.7±9.22µs        ? ?/sec     1.00    794.5±3.15µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_100000/vectorized_equal_to_0.75 true             1.02    894.1±3.11µs        ? ?/sec     1.00    873.1±3.41µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.1_size_100000/vectorized_equal_to_all_true              1.03   1085.3±3.56µs        ? ?/sec     1.00   1057.7±2.58µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_1000/append_val                                  1.00     11.2±0.07µs        ? ?/sec     1.05     11.7±0.05µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_1000/vectorized_append                           1.00     13.2±0.06µs        ? ?/sec     1.03     13.5±0.09µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_1000/vectorized_equal_to_0.25 true               1.03   1940.8±8.05ns        ? ?/sec     1.00   1884.7±3.58ns        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_1000/vectorized_equal_to_0.5 true                1.03      3.7±0.01µs        ? ?/sec     1.00      3.6±0.01µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_1000/vectorized_equal_to_0.75 true               1.03      5.5±0.02µs        ? ?/sec     1.00      5.4±0.03µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_1000/vectorized_equal_to_all_true                1.00      7.4±0.03µs        ? ?/sec     1.25      9.3±0.04µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_10000/append_val                                 1.00    134.8±0.95µs        ? ?/sec     1.03    138.4±0.32µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_10000/vectorized_append                          1.00    153.8±0.38µs        ? ?/sec     1.01    155.2±0.92µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_10000/vectorized_equal_to_0.25 true              1.00     41.2±0.29µs        ? ?/sec     1.02     41.9±0.28µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_10000/vectorized_equal_to_0.5 true               1.00     82.6±0.43µs        ? ?/sec     1.00     82.5±0.41µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_10000/vectorized_equal_to_0.75 true              1.00     90.8±1.73µs        ? ?/sec     1.00     90.8±1.91µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_10000/vectorized_equal_to_all_true               1.00     98.3±0.39µs        ? ?/sec     1.02    100.2±0.21µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_100000/append_val                                1.00   1473.3±4.52µs        ? ?/sec     1.01   1481.6±4.90µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_100000/vectorized_append                         1.00   1649.7±4.46µs        ? ?/sec     1.01   1661.8±8.69µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_100000/vectorized_equal_to_0.25 true             1.00    504.4±4.57µs        ? ?/sec     1.03    518.1±2.81µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_100000/vectorized_equal_to_0.5 true              1.01    913.5±3.81µs        ? ?/sec     1.00   906.2±15.61µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_100000/vectorized_equal_to_0.75 true             1.00    984.8±3.55µs        ? ?/sec     1.00    981.3±2.43µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/inline_null_0.5_size_100000/vectorized_equal_to_all_true              1.01   1034.7±9.17µs        ? ?/sec     1.00   1025.2±4.76µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_1000/append_val                                  1.00     20.1±0.19µs        ? ?/sec     1.16     23.3±0.38µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_1000/vectorized_append                           1.00     16.4±0.23µs        ? ?/sec     1.20     19.7±0.60µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_1000/vectorized_equal_to_0.25 true               1.00      3.7±0.04µs        ? ?/sec     1.00      3.7±0.01µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_1000/vectorized_equal_to_0.5 true                1.00      6.4±0.03µs        ? ?/sec     1.01      6.4±0.06µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_1000/vectorized_equal_to_0.75 true               1.00      9.3±0.06µs        ? ?/sec     1.02      9.5±0.07µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_1000/vectorized_equal_to_all_true                1.01     13.1±0.05µs        ? ?/sec     1.00     12.9±0.07µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_10000/append_val                                 1.02    377.0±7.67µs        ? ?/sec     1.00    369.6±4.34µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_10000/vectorized_append                          1.00    339.3±2.25µs        ? ?/sec     1.03    350.4±6.46µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_10000/vectorized_equal_to_0.25 true              1.19    87.5±13.48µs        ? ?/sec     1.00     73.7±2.37µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_10000/vectorized_equal_to_0.5 true               1.13   144.5±11.03µs        ? ?/sec     1.00    127.6±8.54µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_10000/vectorized_equal_to_0.75 true              1.12    171.8±7.29µs        ? ?/sec     1.00    153.4±9.30µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_10000/vectorized_equal_to_all_true               1.06    184.5±7.28µs        ? ?/sec     1.00    173.7±6.81µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_100000/append_val                                1.06     14.8±0.30ms        ? ?/sec     1.00     13.9±0.22ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_100000/vectorized_append                         1.07     14.4±0.24ms        ? ?/sec     1.00     13.4±0.21ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_100000/vectorized_equal_to_0.25 true             1.08  1710.8±241.37µs        ? ?/sec    1.00  1581.8±105.31µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_100000/vectorized_equal_to_0.5 true              1.00      2.8±0.07ms        ? ?/sec     1.04      2.9±0.07ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_100000/vectorized_equal_to_0.75 true             1.01      3.2±0.06ms        ? ?/sec     1.00      3.2±0.06ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.0_size_100000/vectorized_equal_to_all_true              1.00      3.6±0.06ms        ? ?/sec     1.00      3.6±0.07ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_1000/append_val                                  1.00     20.3±0.18µs        ? ?/sec     1.06     21.6±0.20µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_1000/vectorized_append                           1.00     22.3±0.17µs        ? ?/sec     1.08     24.1±0.27µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_1000/vectorized_equal_to_0.25 true               1.00      3.7±0.02µs        ? ?/sec     1.00      3.7±0.01µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_1000/vectorized_equal_to_0.5 true                1.00      6.5±0.03µs        ? ?/sec     1.00      6.5±0.05µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_1000/vectorized_equal_to_0.75 true               1.00      9.5±0.04µs        ? ?/sec     1.02      9.7±0.15µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_1000/vectorized_equal_to_all_true                1.00     12.4±0.07µs        ? ?/sec     1.04     12.9±0.10µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_10000/append_val                                 1.00    318.8±2.36µs        ? ?/sec     1.23    393.2±7.31µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_10000/vectorized_append                          1.00    342.6±2.07µs        ? ?/sec     1.16    399.1±6.51µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_10000/vectorized_equal_to_0.25 true              1.00     76.8±8.97µs        ? ?/sec     1.07    81.8±10.63µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_10000/vectorized_equal_to_0.5 true               1.04    138.7±9.10µs        ? ?/sec     1.00    133.6±1.78µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_10000/vectorized_equal_to_0.75 true              1.05    174.3±6.37µs        ? ?/sec     1.00    166.8±7.42µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_10000/vectorized_equal_to_all_true               1.03    193.3±3.93µs        ? ?/sec     1.00    187.4±3.91µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_100000/append_val                                1.08     13.8±0.25ms        ? ?/sec     1.00     12.8±0.21ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_100000/vectorized_append                         1.01     13.1±0.23ms        ? ?/sec     1.00     13.1±0.22ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_100000/vectorized_equal_to_0.25 true             1.00  1443.4±123.40µs        ? ?/sec    1.00  1441.2±121.84µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_100000/vectorized_equal_to_0.5 true              1.05      2.8±0.06ms        ? ?/sec     1.00      2.7±0.07ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_100000/vectorized_equal_to_0.75 true             1.01      3.1±0.06ms        ? ?/sec     1.00      3.1±0.07ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.1_size_100000/vectorized_equal_to_all_true              1.01      3.5±0.07ms        ? ?/sec     1.00      3.5±0.06ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_1000/append_val                                  1.08     17.6±0.33µs        ? ?/sec     1.00     16.3±0.21µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_1000/vectorized_append                           1.05     20.6±0.34µs        ? ?/sec     1.00     19.6±0.24µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_1000/vectorized_equal_to_0.25 true               1.00      2.7±0.02µs        ? ?/sec     1.01      2.8±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_1000/vectorized_equal_to_0.5 true                1.00      4.7±0.04µs        ? ?/sec     1.02      4.8±0.05µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_1000/vectorized_equal_to_0.75 true               1.00      6.7±0.06µs        ? ?/sec     1.03      6.9±0.09µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_1000/vectorized_equal_to_all_true                1.00      8.5±0.08µs        ? ?/sec     1.04      8.8±0.13µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_10000/append_val                                 1.03    246.6±3.57µs        ? ?/sec     1.00    239.6±1.47µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_10000/vectorized_append                          1.03    265.1±1.89µs        ? ?/sec     1.00    257.5±1.59µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_10000/vectorized_equal_to_0.25 true              1.00     59.2±1.98µs        ? ?/sec     1.03     61.0±1.40µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_10000/vectorized_equal_to_0.5 true               1.00    114.6±3.07µs        ? ?/sec     1.05    120.0±3.09µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_10000/vectorized_equal_to_0.75 true              1.00    137.0±1.09µs        ? ?/sec     1.06    145.5±1.50µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_10000/vectorized_equal_to_all_true               1.00    153.1±2.40µs        ? ?/sec     1.05    161.5±1.78µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_100000/append_val                                1.09      8.4±0.15ms        ? ?/sec     1.00      7.7±0.12ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_100000/vectorized_append                         1.00      8.4±0.19ms        ? ?/sec     1.00      8.4±0.20ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_100000/vectorized_equal_to_0.25 true             1.00   881.0±55.10µs        ? ?/sec     1.10   965.2±49.58µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_100000/vectorized_equal_to_0.5 true              1.00  1612.5±55.07µs        ? ?/sec     1.08  1741.2±85.55µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_100000/vectorized_equal_to_0.75 true             1.00  1876.9±47.41µs        ? ?/sec     1.06  1992.9±71.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/random_null_0.5_size_100000/vectorized_equal_to_all_true              1.00      2.2±0.04ms        ? ?/sec     1.02      2.2±0.05ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_1000/append_val                                1.00     17.4±0.16µs        ? ?/sec     1.06     18.4±0.07µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_1000/vectorized_append                         1.00     13.1±0.02µs        ? ?/sec     1.05     13.7±0.08µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_1000/vectorized_equal_to_0.25 true             1.01      2.4±0.02µs        ? ?/sec     1.00      2.4±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_1000/vectorized_equal_to_0.5 true              1.00      4.3±0.04µs        ? ?/sec     1.00      4.2±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_1000/vectorized_equal_to_0.75 true             1.00      6.2±0.05µs        ? ?/sec     1.01      6.2±0.04µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_1000/vectorized_equal_to_all_true              1.04      8.8±0.14µs        ? ?/sec     1.00      8.5±0.05µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_10000/append_val                               1.00    178.3±0.79µs        ? ?/sec     1.16    207.0±0.68µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_10000/vectorized_append                        1.00    149.3±0.50µs        ? ?/sec     1.02    152.8±0.66µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_10000/vectorized_equal_to_0.25 true            1.00     35.4±0.24µs        ? ?/sec     1.11     39.4±0.22µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_10000/vectorized_equal_to_0.5 true             1.00     69.8±0.40µs        ? ?/sec     1.02     71.3±0.33µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_10000/vectorized_equal_to_0.75 true            1.01     81.2±0.33µs        ? ?/sec     1.00     80.3±0.34µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_10000/vectorized_equal_to_all_true             1.02     95.3±0.25µs        ? ?/sec     1.00     93.0±0.45µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_100000/append_val                              1.01      4.0±0.08ms        ? ?/sec     1.00      3.9±0.09ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_100000/vectorized_append                       1.03      3.7±0.06ms        ? ?/sec     1.00      3.6±0.06ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_100000/vectorized_equal_to_0.25 true           1.02    513.5±4.99µs        ? ?/sec     1.00    505.6±5.00µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_100000/vectorized_equal_to_0.5 true            1.03    836.2±2.25µs        ? ?/sec     1.00    815.0±3.85µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_100000/vectorized_equal_to_0.75 true           1.03    908.8±3.01µs        ? ?/sec     1.00    883.3±8.20µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.0_size_100000/vectorized_equal_to_all_true            1.03   1029.9±4.30µs        ? ?/sec     1.00   1001.8±3.71µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_1000/append_val                                1.00     18.8±0.05µs        ? ?/sec     1.00     18.8±0.05µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_1000/vectorized_append                         1.00     20.5±0.04µs        ? ?/sec     1.02     20.9±0.07µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_1000/vectorized_equal_to_0.25 true             1.01      2.8±0.01µs        ? ?/sec     1.00      2.7±0.01µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_1000/vectorized_equal_to_0.5 true              1.01      5.0±0.02µs        ? ?/sec     1.00      5.0±0.02µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_1000/vectorized_equal_to_0.75 true             1.00      7.2±0.02µs        ? ?/sec     1.01      7.3±0.03µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_1000/vectorized_equal_to_all_true              1.00      9.3±0.03µs        ? ?/sec     1.02      9.5±0.03µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_10000/append_val                               1.00    211.5±0.81µs        ? ?/sec     1.01    212.7±1.91µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_10000/vectorized_append                        1.00    225.2±0.68µs        ? ?/sec     1.02    228.6±0.74µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_10000/vectorized_equal_to_0.25 true            1.00     39.3±0.44µs        ? ?/sec     1.10     43.3±0.45µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_10000/vectorized_equal_to_0.5 true             1.00     78.2±0.71µs        ? ?/sec     1.07     83.3±0.42µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_10000/vectorized_equal_to_0.75 true            1.00     94.7±0.50µs        ? ?/sec     1.03     97.2±1.14µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_10000/vectorized_equal_to_all_true             1.00    107.5±0.56µs        ? ?/sec     1.02    109.8±0.23µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_100000/append_val                              1.02      2.3±0.01ms        ? ?/sec     1.00      2.2±0.01ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_100000/vectorized_append                       1.01      2.4±0.02ms        ? ?/sec     1.00      2.4±0.01ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_100000/vectorized_equal_to_0.25 true           1.00    561.4±4.86µs        ? ?/sec     1.03    578.6±2.59µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_100000/vectorized_equal_to_0.5 true            1.00    959.3±3.12µs        ? ?/sec     1.00    959.2±4.47µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_100000/vectorized_equal_to_0.75 true           1.00   1063.5±3.92µs        ? ?/sec     1.00   1065.8±3.44µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.1_size_100000/vectorized_equal_to_all_true            1.00   1181.4±3.74µs        ? ?/sec     1.00   1186.5±4.40µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_1000/append_val                                1.00     13.5±0.11µs        ? ?/sec     1.01     13.6±0.09µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_1000/vectorized_append                         1.04     17.0±0.15µs        ? ?/sec     1.00     16.3±0.08µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_1000/vectorized_equal_to_0.25 true             1.02   1959.9±5.89ns        ? ?/sec     1.00   1921.0±6.49ns        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_1000/vectorized_equal_to_0.5 true              1.01      3.5±0.01µs        ? ?/sec     1.00      3.5±0.01µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_1000/vectorized_equal_to_0.75 true             1.01      5.2±0.01µs        ? ?/sec     1.00      5.1±0.01µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_1000/vectorized_equal_to_all_true              1.00      6.6±0.03µs        ? ?/sec     1.07      7.1±0.03µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_10000/append_val                               1.00    186.6±4.68µs        ? ?/sec     1.00    187.0±0.75µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_10000/vectorized_append                        1.00    202.7±0.46µs        ? ?/sec     1.01    205.2±0.68µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_10000/vectorized_equal_to_0.25 true            1.00     43.1±0.38µs        ? ?/sec     1.08     46.5±0.45µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_10000/vectorized_equal_to_0.5 true             1.00     85.7±0.75µs        ? ?/sec     1.05     89.8±0.73µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_10000/vectorized_equal_to_0.75 true            1.00     97.2±0.54µs        ? ?/sec     1.06    103.1±1.38µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_10000/vectorized_equal_to_all_true             1.00    103.6±0.40µs        ? ?/sec     1.06    110.0±0.45µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_100000/append_val                              1.01   1937.6±7.25µs        ? ?/sec     1.00   1926.4±7.92µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_100000/vectorized_append                       1.00      2.1±0.01ms        ? ?/sec     1.01      2.1±0.01ms        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_100000/vectorized_equal_to_0.25 true           1.00    567.5±3.48µs        ? ?/sec     1.03    586.0±2.51µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_100000/vectorized_equal_to_0.5 true            1.00   1002.3±3.82µs        ? ?/sec     1.00   1002.2±4.93µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_100000/vectorized_equal_to_0.75 true           1.00  1103.7±18.07µs        ? ?/sec     1.01   1119.3±3.31µs        ? ?/sec
ByteViewGroupValueBuilder_vectorized_append/scenario_null_0.5_size_100000/vectorized_equal_to_all_true            1.00   1163.2±5.85µs        ? ?/sec     1.03   1194.0±3.84µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_1000/append_val                         1.00      3.6±0.01µs        ? ?/sec     1.26      4.5±0.01µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_1000/vectorized_append                  1.00      2.1±0.00µs        ? ?/sec     1.00      2.1±0.01µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_1000/vectorized_equal_to_0.25 true      1.23   868.4±42.82ns        ? ?/sec     1.00    705.4±2.41ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_1000/vectorized_equal_to_0.5 true       1.42   999.4±34.20ns        ? ?/sec     1.00    704.4±0.88ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_1000/vectorized_equal_to_0.75 true      2.06  1455.8±31.09ns        ? ?/sec     1.00    705.7±2.72ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_1000/vectorized_equal_to_all_true       1.97   1387.0±3.29ns        ? ?/sec     1.00    704.6±1.77ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_10000/append_val                        1.00     32.1±0.13µs        ? ?/sec     1.31     42.1±0.14µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_10000/vectorized_append                 1.01     17.0±0.12µs        ? ?/sec     1.00     16.9±0.13µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_10000/vectorized_equal_to_0.25 true     3.37     23.8±0.06µs        ? ?/sec     1.00      7.1±0.08µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_10000/vectorized_equal_to_0.5 true      5.00     35.4±0.09µs        ? ?/sec     1.00      7.1±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_10000/vectorized_equal_to_0.75 true     3.70     26.2±0.05µs        ? ?/sec     1.00      7.1±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_10000/vectorized_equal_to_all_true      1.92     13.6±0.03µs        ? ?/sec     1.00      7.1±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_100000/append_val                       1.00    310.4±0.71µs        ? ?/sec     1.37    424.6±1.29µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_100000/vectorized_append                1.00    160.2±0.61µs        ? ?/sec     1.09    173.9±1.17µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_100000/vectorized_equal_to_0.25 true    3.61    267.4±0.79µs        ? ?/sec     1.00     74.2±1.37µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_100000/vectorized_equal_to_0.5 true     5.65    411.0±0.84µs        ? ?/sec     1.00     72.7±0.50µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_100000/vectorized_equal_to_0.75 true    4.02    292.4±0.93µs        ? ?/sec     1.00     72.7±0.46µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_false_size_100000/vectorized_equal_to_all_true     1.90    138.6±0.38µs        ? ?/sec     1.00     73.0±0.58µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_1000/append_val                          1.04      7.2±0.07µs        ? ?/sec     1.00      6.9±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_1000/vectorized_append                   1.00      2.2±0.00µs        ? ?/sec     1.02      2.2±0.07µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_1000/vectorized_equal_to_0.25 true       1.11    785.0±7.88ns        ? ?/sec     1.00    706.9±1.20ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_1000/vectorized_equal_to_0.5 true        1.46  1034.7±10.20ns        ? ?/sec     1.00    706.8±1.02ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_1000/vectorized_equal_to_0.75 true       1.93  1366.4±15.36ns        ? ?/sec     1.00    706.6±0.75ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_1000/vectorized_equal_to_all_true        2.34  1650.0±10.52ns        ? ?/sec     1.00    705.8±1.11ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_10000/append_val                         1.05     67.5±0.18µs        ? ?/sec     1.00     64.5±0.17µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_10000/vectorized_append                  1.00     18.6±0.11µs        ? ?/sec     1.00     18.6±0.28µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_10000/vectorized_equal_to_0.25 true      3.70     26.2±0.16µs        ? ?/sec     1.00      7.1±0.04µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_10000/vectorized_equal_to_0.5 true       5.67     40.1±0.18µs        ? ?/sec     1.00      7.1±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_10000/vectorized_equal_to_0.75 true      4.08     28.9±0.14µs        ? ?/sec     1.00      7.1±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_10000/vectorized_equal_to_all_true       2.31     16.3±0.02µs        ? ?/sec     1.00      7.1±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_100000/append_val                        1.04    673.4±5.10µs        ? ?/sec     1.00    647.5±4.55µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_100000/vectorized_append                 1.00    176.2±0.76µs        ? ?/sec     1.08    190.6±2.52µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_100000/vectorized_equal_to_0.25 true     3.98    289.2±2.11µs        ? ?/sec     1.00     72.7±0.31µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_100000/vectorized_equal_to_0.5 true      6.10    444.8±0.97µs        ? ?/sec     1.00     72.9±1.19µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_100000/vectorized_equal_to_0.75 true     4.37    327.1±1.65µs        ? ?/sec     1.00     74.8±0.82µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.0_nullable_true_size_100000/vectorized_equal_to_all_true      2.28    165.6±0.54µs        ? ?/sec     1.00     72.8±0.53µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_1000/append_val                          1.02      8.5±0.01µs        ? ?/sec     1.00      8.3±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_1000/vectorized_append                   1.00      7.7±0.03µs        ? ?/sec     1.00      7.7±0.01µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_1000/vectorized_equal_to_0.25 true       1.00   1102.2±6.86ns        ? ?/sec     1.16  1278.7±46.97ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_1000/vectorized_equal_to_0.5 true        1.00  1890.5±20.06ns        ? ?/sec     1.07      2.0±0.01µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_1000/vectorized_equal_to_0.75 true       1.00      2.7±0.01µs        ? ?/sec     1.03      2.8±0.01µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_1000/vectorized_equal_to_all_true        1.00      3.6±0.01µs        ? ?/sec     1.02      3.7±0.00µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_10000/append_val                         1.02     83.0±0.18µs        ? ?/sec     1.00     81.2±0.14µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_10000/vectorized_append                  1.00     75.8±0.21µs        ? ?/sec     1.08     82.1±0.18µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_10000/vectorized_equal_to_0.25 true      1.04     28.2±0.08µs        ? ?/sec     1.00     27.0±0.13µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_10000/vectorized_equal_to_0.5 true       1.00     52.0±0.23µs        ? ?/sec     1.02     53.0±0.26µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_10000/vectorized_equal_to_0.75 true      1.00     46.3±0.24µs        ? ?/sec     1.01     46.8±0.27µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_10000/vectorized_equal_to_all_true       1.00     40.2±0.21µs        ? ?/sec     1.02     41.2±0.10µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_100000/append_val                        1.00    824.8±1.78µs        ? ?/sec     1.00    822.6±1.81µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_100000/vectorized_append                 1.00    759.0±1.71µs        ? ?/sec     1.02    777.6±6.08µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_100000/vectorized_equal_to_0.25 true     1.00    346.0±1.50µs        ? ?/sec     1.01    348.3±0.80µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_100000/vectorized_equal_to_0.5 true      1.00    579.8±1.14µs        ? ?/sec     1.02    589.5±1.03µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_100000/vectorized_equal_to_0.75 true     1.00    521.2±0.97µs        ? ?/sec     1.01    527.5±2.07µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.1_nullable_true_size_100000/vectorized_equal_to_all_true      1.00    423.5±1.02µs        ? ?/sec     1.01    426.4±1.21µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_1000/append_val                          1.00      7.8±0.06µs        ? ?/sec     1.05      8.2±0.06µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_1000/vectorized_append                   1.02      8.4±0.06µs        ? ?/sec     1.00      8.2±0.05µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_1000/vectorized_equal_to_0.25 true       1.00  1029.3±10.48ns        ? ?/sec     1.17   1203.0±8.82ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_1000/vectorized_equal_to_0.5 true        1.00  1715.6±12.43ns        ? ?/sec     1.09  1873.0±20.02ns        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_1000/vectorized_equal_to_0.75 true       1.00      2.4±0.04µs        ? ?/sec     1.05      2.5±0.02µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_1000/vectorized_equal_to_all_true        1.00      3.1±0.01µs        ? ?/sec     1.06      3.2±0.04µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_10000/append_val                         1.00    114.5±0.27µs        ? ?/sec     1.02    116.6±0.35µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_10000/vectorized_append                  1.00    121.6±0.27µs        ? ?/sec     1.10    134.0±0.39µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_10000/vectorized_equal_to_0.25 true      1.01     32.9±0.13µs        ? ?/sec     1.00     32.5±0.13µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_10000/vectorized_equal_to_0.5 true       1.00     61.9±0.36µs        ? ?/sec     1.00     61.9±0.49µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_10000/vectorized_equal_to_0.75 true      1.00     62.8±0.36µs        ? ?/sec     1.01     63.5±0.38µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_10000/vectorized_equal_to_all_true       1.00     59.2±0.50µs        ? ?/sec     1.01     59.7±0.25µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_100000/append_val                        1.00   1166.9±8.76µs        ? ?/sec     1.02   1186.9±2.08µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_100000/vectorized_append                 1.00  1249.2±30.31µs        ? ?/sec     1.01   1261.0±6.95µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_100000/vectorized_equal_to_0.25 true     1.00    397.4±1.57µs        ? ?/sec     1.01    402.1±0.96µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_100000/vectorized_equal_to_0.5 true      1.00    677.9±1.53µs        ? ?/sec     1.01    682.5±1.56µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_100000/vectorized_equal_to_0.75 true     1.00    681.7±1.61µs        ? ?/sec     1.01    689.3±1.01µs        ? ?/sec
PrimitiveGroupValueBuilder_vectorized_append/null_0.5_nullable_true_size_100000/vectorized_equal_to_all_true      1.00    667.7±2.25µs        ? ?/sec     1.01    671.2±1.33µs        ? ?/sec

rluvaton · 2025-11-13T19:14:14Z

As suspected up to 6 times faster:

From @alamb results this is the parsed as table:

all below are for PrimitiveGroupValueBuilder vectorize_equal_to (note that there are sometimes some regression for some reason in some other function)

The good thing you will notice below is that regardless of the number of true count the results are consistent - meaning it really not using branches

and also, the null case which I did not optimize is roughly the same

Table

null	nullable	size	equal_to_results	main (ratio)	main (time)	optimize (ratio)	optimize (time)
0.0	false	1000	0.25 true	1.23	868.4±42.82ns	1.00	705.4±2.41ns
0.0	false	1000	0.5 true	1.42	999.4±34.20ns	1.00	704.4±0.88ns
0.0	false	1000	0.75 true	2.06	1455.8±31.09ns	1.00	705.7±2.72ns
0.0	false	1000	all_true	1.97	1387.0±3.29ns	1.00	704.6±1.77ns
0.0	false	10000	0.25 true	3.37	23.8±0.06µs	1.00	7.1±0.08µs
0.0	false	10000	0.5 true	5.00	35.4±0.09µs	1.00	7.1±0.02µs
0.0	false	10000	0.75 true	3.70	26.2±0.05µs	1.00	7.1±0.02µs
0.0	false	10000	all_true	1.92	13.6±0.03µs	1.00	7.1±0.02µs
0.0	false	100000	0.25 true	3.61	267.4±0.79µs	1.00	74.2±1.37µs
0.0	false	100000	0.5 true	5.65	411.0±0.84µs	1.00	72.7±0.50µs
0.0	false	100000	0.75 true	4.02	292.4±0.93µs	1.00	72.7±0.46µs
0.0	false	100000	all_true	1.90	138.6±0.38µs	1.00	73.0±0.58µs
0.0	true	1000	0.25 true	1.11	785.0±7.88ns	1.00	706.9±1.20ns
0.0	true	1000	0.5 true	1.46	1034.7±10.20ns	1.00	706.8±1.02ns
0.0	true	1000	0.75 true	1.93	1366.4±15.36ns	1.00	706.6±0.75ns
0.0	true	1000	all_true	2.34	1650.0±10.52ns	1.00	705.8±1.11ns
0.0	true	10000	0.25 true	3.70	26.2±0.16µs	1.00	7.1±0.04µs
0.0	true	10000	0.5 true	5.67	40.1±0.18µs	1.00	7.1±0.02µs
0.0	true	10000	0.75 true	4.08	28.9±0.14µs	1.00	7.1±0.02µs
0.0	true	10000	all_true	2.31	16.3±0.02µs	1.00	7.1±0.02µs
0.0	true	100000	0.25 true	3.98	289.2±2.11µs	1.00	72.7±0.31µs
0.0	true	100000	0.5 true	6.10	444.8±0.97µs	1.00	72.9±1.19µs
0.0	true	100000	0.75 true	4.37	327.1±1.65µs	1.00	74.8±0.82µs
0.0	true	100000	all_true	2.28	165.6±0.54µs	1.00	72.8±0.53µs
0.1	true	1000	0.25 true	1.00	1102.2±6.86ns	1.16	1278.7±46.97ns
0.1	true	1000	0.5 true	1.00	1890.5±20.06ns	1.07	2.0±0.01µs
0.1	true	1000	0.75 true	1.00	2.7±0.01µs	1.03	2.8±0.01µs
0.1	true	1000	all_true	1.00	3.6±0.01µs	1.02	3.7±0.00µs
0.1	true	10000	0.25 true	1.04	28.2±0.08µs	1.00	27.0±0.13µs
0.1	true	10000	0.5 true	1.00	52.0±0.23µs	1.02	53.0±0.26µs
0.1	true	10000	0.75 true	1.00	46.3±0.24µs	1.01	46.8±0.27µs
0.1	true	10000	all_true	1.00	40.2±0.21µs	1.02	41.2±0.10µs
0.1	true	100000	0.25 true	1.00	346.0±1.50µs	1.01	348.3±0.80µs
0.1	true	100000	0.5 true	1.00	579.8±1.14µs	1.02	589.5±1.03µs
0.1	true	100000	0.75 true	1.00	521.2±0.97µs	1.01	527.5±2.07µs
0.1	true	100000	all_true	1.00	423.5±1.02µs	1.01	426.4±1.21µs
0.5	true	1000	0.25 true	1.00	1029.3±10.48ns	1.17	1203.0±8.82ns
0.5	true	1000	0.5 true	1.00	1715.6±12.43ns	1.09	1873.0±20.02ns
0.5	true	1000	0.75 true	1.00	2.4±0.04µs	1.05	2.5±0.02µs
0.5	true	1000	all_true	1.00	3.1±0.01µs	1.06	3.2±0.04µs
0.5	true	10000	0.25 true	1.01	32.9±0.13µs	1.00	32.5±0.13µs
0.5	true	10000	0.5 true	1.00	61.9±0.36µs	1.00	61.9±0.49µs
0.5	true	10000	0.75 true	1.00	62.8±0.36µs	1.01	63.5±0.38µs
0.5	true	10000	all_true	1.00	59.2±0.50µs	1.01	59.7±0.25µs
0.5	true	100000	0.25 true	1.00	397.4±1.57µs	1.01	402.1±0.96µs
0.5	true	100000	0.5 true	1.00	677.9±1.53µs	1.01	682.5±1.56µs
0.5	true	100000	0.75 true	1.00	681.7±1.61µs	1.01	689.3±1.01µs
0.5	true	100000	all_true	1.00	667.7±2.25µs	1.01	671.2±1.33µs

rluvaton · 2025-11-13T19:22:00Z

once we have mutable bit packed buffer, we will also evaluate the nulls separately and do bit mask operation on the compare result to make that case faster as well

alamb · 2025-11-17T17:39:37Z

Thanks again @rluvaton

…pValueBuilder` in multi group by aggregation (apache#17977) ## Which issue does this PR close? N/A ## Rationale for this change Making multi column aggregation even faster ## What changes are included in this PR? In `PrimitiveGroupValueBuilder.vectorized_equal_to` always evaluate and use unchecked as both of these changes are what making the code compile to SIMD. ## Are these changes tested? Existing tests ## Are there any user-facing changes? Nope ----- I tried a LOT of variations [GodBolt](https://godbolt.org/z/Kc8ze6E9n) from splitting to fixed size chunks and trying to get auto-vectorization to use gather and creating bitmask to even testing portable SIMD (just to see what it will generate). this version only optimize the non null path for the moment as it is the easiest. once and if we change from `&mut [bool]` to mutable packed bits we could: 1. evaluate in chunks of `64` items (I tried different variations to see what is the best - you can tweak in the godbolt above with different type and size to check for yourself), 64 is not necessarily the best but it will be the fastest I think for doing AND with the `equal_to_results` boolean buffer 2. add optimization for nullable as well by just doing bitwise operation at 64 items at a time and avoid the cost of getting each bit manually 3. skip 64 items right away if the the `equal_to_results` equal to `0x00` (i.e. all false) --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

rluvaton added 11 commits October 5, 2025 23:40

change to boolean buffer builder and make primitive simd

1eb2d6a

update bench to test primitive

e5b65bd

update bench

01dd5c9

remove boolean buffer builder

cd99e29

Merge branch 'main' into optimize-primitive-multi-group-by-to-use-simd

0c6295c

update bench

825b8c7

Merge branch 'main' into optimize-primitive-multi-group-by-to-use-simd

f19d439

fix not setting correctly equal to results

f31b5c0

Merge branch 'main' into optimize-primitive-multi-group-by-to-use-simd

dd73823

format and port as_chunks

5b2704f

avoid chunks at the moment

27d8026

rluvaton added the performance Make DataFusion faster label Oct 8, 2025

github-actions bot added the physical-plan Changes to the physical-plan crate label Oct 8, 2025

rluvaton commented Oct 8, 2025

View reviewed changes

rename function and add comment

d5b5caa

Dandandan reviewed Oct 9, 2025

View reviewed changes

Merge branch 'main' into optimize-primitive-multi-group-by-to-use-simd

0a6f6d3

alamb mentioned this pull request Nov 11, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 #18486

Closed

53 tasks

alamb approved these changes Nov 13, 2025

View reviewed changes

alamb mentioned this pull request Nov 13, 2025

Potential Improved multiple column aggregation performance by using bitmasks rather than Vec<bool> #18676

Open

Merge branch 'main' into optimize-primitive-multi-group-by-to-use-simd

c201d6a

alamb added this pull request to the merge queue Nov 17, 2025

Merged via the queue into apache:main with commit af22336 Nov 17, 2025
32 checks passed

rluvaton deleted the optimize-primitive-multi-group-by-to-use-simd branch November 17, 2025 18:00

perf: improve performance of vectorized_equal_to for PrimitiveGroupValueBuilder in multi group by aggregation #17977

perf: improve performance of vectorized_equal_to for PrimitiveGroupValueBuilder in multi group by aggregation #17977

Uh oh!

Conversation

rluvaton commented Oct 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

rluvaton commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

rluvaton Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

ctsk commented Oct 14, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 11, 2025

Uh oh!

alamb commented Nov 11, 2025

Uh oh!

alamb commented Nov 12, 2025

Uh oh!

rluvaton commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Nov 13, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 13, 2025

Uh oh!

alamb commented Nov 13, 2025

Uh oh!

rluvaton commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rluvaton commented Nov 13, 2025

Uh oh!

alamb commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf: improve performance of `vectorized_equal_to` for `PrimitiveGroupValueBuilder` in multi group by aggregation #17977

perf: improve performance of `vectorized_equal_to` for `PrimitiveGroupValueBuilder` in multi group by aggregation #17977

rluvaton commented Oct 8, 2025 •

edited

Loading

rluvaton commented Nov 12, 2025 •

edited

Loading

rluvaton commented Nov 13, 2025 •

edited

Loading