Fix incorrect results with multiple `COUNT(DISTINCT..)` aggregates on dictionaries #9679

alamb · 2024-03-18T18:18:24Z

Which issue does this PR close?

Closes #9586

Rationale for this change

Fix a bug that was introduced in #9234

What changes are included in this PR?

Fix bug (3 lines)
Add test coverage
Update comments

Are these changes tested?

Yes they are covered but I think we could do better with the coverage. I will see if I can get a fuzz test here too

Are there any user-facing changes?

Bug fix

alamb · 2024-03-18T18:19:00Z

datafusion/physical-expr/src/aggregate/count_distinct/mod.rs

    fn merge_batch(&mut self, states: &[ArrayRef]) -> Result<()> {
        if states.is_empty() {
            return Ok(());
        }
        assert_eq!(states.len(), 1, "array_agg states must be singleton!");
        let array = &states[0];
        let list_array = array.as_list::<i32>();
-        let inner_array = list_array.value(0);
-        self.update_batch(&[inner_array])
+        for inner_array in list_array.iter() {


This is the actual bug fix -- to use all rows not just the first. The rest of this PR is tests / comment improvements

alamb · 2024-03-18T18:19:20Z

datafusion/sqllogictest/test_files/dictionary.slt

+FROM m3
+GROUP BY column3;
+----
+1 2


this query returns 1 1 without the code change in this PR

Dandandan · 2024-03-18T21:58:01Z

Nice!

alamb · 2024-03-18T22:55:34Z

Thank you for the review @Dandandan

alamb · 2024-03-18T22:56:00Z

cc @jayzhan211

jayzhan211 · 2024-03-19T00:24:01Z

datafusion/sqllogictest/test_files/dictionary.slt

@@ -280,3 +280,70 @@ ORDER BY
 2023-12-20T01:20:00 1000 f2 foo
 2023-12-20T01:30:00 1000 f1 32.0
 2023-12-20T01:30:00 1000 f2 foo
+
+# Cleanup
+statement error DataFusion error: Execution error: Table 'm1' doesn't exist\.


why is it not "ok", but an error?

Good call -- this is a mistake and I will fix it

jayzhan211 · 2024-03-19T00:25:58Z

datafusion/sqllogictest/test_files/dictionary.slt

+    select * from (values('foo', 'baz', 1));
+
+######
+# Now, create a table with the same data, but column2 has type `Dictionary(Int32)` to trigger the fallback code


why does the cast to the dictionary trigger the fallback code? Does it refer to merge_batch?

specifically, why the key of dict is the sub group index after casting? 🤔

group 1: "a", "b",
group 2: "c"

we get
(0, a), (1, "b"), and (0, "c")

why does the cast to the dictionary trigger the fallback code?

The reason the dictionary triggers merge is that when grouping on strings or primitive values, the DistinctCountAccumulator code path is not used. Instead one of the specialized implementations (like BytesDistinctCountAccumulator) is used instead, which use the GroupsAccumulator interface.

Dictionary encoded columns run this path DistinctCountAccumulator
https://github.com/apache/arrow-datafusion/blob/b0b329ba39403b9e87156d6f9b8c5464dc6d2480/datafusion/physical-expr/src/aggregate/count_distinct/mod.rs#L160-L163

specifically, why the key of dict is the sub group index after casting? 🤔

What is happening is that we are doing a two phase groupby (illustated here)

https://github.com/apache/arrow-datafusion/blob/b0b329ba39403b9e87156d6f9b8c5464dc6d2480/datafusion/expr/src/accumulator.rs#L99-L131

And so there are two different Partial group bys happening. Each PartialGroupBy produces a a set of distinct values. Using your example, I think it would be more like the following (where we have the same group in multiple partial results):

group 1 (partial): "a", "b",
group 1 (partial): "c"

The merge is called to combine the results together with a two element array

("a, "b") ("c")

But I may be misunderstanding your question

I also filed #9695 to add some more coverage of array operations

I'm curious about how and where the DictionarayArray has been built. It is quite hard to trace the previous caller of GroupedHashAggregateStream::poll_next with RUST_BACKTRACE.

https://github.com/apache/arrow-datafusion/blob/b0b329ba39403b9e87156d6f9b8c5464dc6d2480/datafusion/physical-plan/src/aggregates/row_hash.rs#L434

batch: RecordBatch { schema: Schema { fields: [Field { name: "column3", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "COUNT(DISTINCT m3.column1)[count distinct]", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "COUNT(DISTINCT m3.column2)[count distinct]", data_type: List(Field { name: "item", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [PrimitiveArray<Int64> [ 1, 1, ], ListArray [ StringArray [ "foo", ], StringArray [ "foo", ], ], ListArray [ DictionaryArray {keys: PrimitiveArray<Int32> [ 0, ] values: StringArray [ "bar", "baz", ]} , DictionaryArray {keys: PrimitiveArray<Int32> [ 1, ] values: StringArray [ "bar", "baz", ]} , ]], row_count: 2 }

I'm curious about how and where the DictionarayArray has been built.

I think it comes from emitting ScalarValue::Dictionary that are combined into an array here

https://github.com/apache/arrow-datafusion/blob/b87dd6143c2dc089b07f74780bd525c4369e68a3/datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs#L304-L309

It seems converted at this point already

alamb · 2024-03-19T15:38:31Z

Thank you for the comments/reviews @jayzhan211 and @Dandandan

Omega359 · 2024-03-20T00:22:39Z

datafusion/physical-expr/src/aggregate/count_distinct/mod.rs

-        self.update_batch(&[inner_array])
+        for inner_array in list_array.iter() {
+            let inner_array = inner_array
+                .expect("counts are always non null, so are intermediate results");


I noticed this when updating my local repo .... is expect something that should be used here ... my understanding that it panics on None. Given the method returns Result I would expect err to be returned instead - am I missing something in my understanding of Rust here?

It panics, but it is fine if it is ensured to be non-null. I am looking into how the array was built in the above comment but failed. 😢

This is a good point and I think it would be a better UX to avoid panic'ing even if something "impossible" happens. I made the change in #9712

… dictionaries (apache#9679) * Add test for multiple count distincts on a dictionary * Fix accumulator merge bug * Fix cleanup code

alamb added 2 commits March 18, 2024 14:16

Add test for multiple count distincts on a dictionary

2b2811e

Fix accumulator merge bug

14907c8

github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Mar 18, 2024

alamb commented Mar 18, 2024

View reviewed changes

alamb marked this pull request as ready for review March 18, 2024 18:51

Dandandan approved these changes Mar 18, 2024

View reviewed changes

jayzhan211 reviewed Mar 19, 2024

View reviewed changes

alamb added 2 commits March 19, 2024 09:32

Fix cleanup code

219cd32

Merge remote-tracking branch 'apache/main' into alamb/multi-distinct

7ece57e

alamb mentioned this pull request Mar 19, 2024

Add tests for filtering, grouping, aggregation of ARRAYs #9695

Merged

alamb merged commit 3c3b228 into apache:main Mar 19, 2024
23 checks passed

Omega359 reviewed Mar 20, 2024

View reviewed changes

alamb deleted the alamb/multi-distinct branch March 20, 2024 17:26

alamb mentioned this pull request Mar 20, 2024

Minor: return internal error rather than panic on unexpected error in COUNT DISTINCT #9712

Merged

alamb mentioned this pull request Mar 22, 2024

Branch for upgrade to DataFusion March 5 Upgrade alamb/datafusion#18

Open

This was referenced Mar 28, 2024

WIP: df patched upgrade to 2024-03-05, requiring new DF fixes wiedld/arrow-datafusion#3

Closed

WIP: df patched upgrade to 2024-03-05, requiring new DF fixes #9901

Closed

WIP: df patched upgrade to 2024-03-05 influxdata/arrow-datafusion#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect results with multiple `COUNT(DISTINCT..)` aggregates on dictionaries #9679

Fix incorrect results with multiple `COUNT(DISTINCT..)` aggregates on dictionaries #9679

alamb commented Mar 18, 2024 •

edited

Loading

alamb Mar 18, 2024

alamb Mar 18, 2024

Dandandan commented Mar 18, 2024

alamb commented Mar 18, 2024

alamb commented Mar 18, 2024

jayzhan211 Mar 19, 2024

alamb Mar 19, 2024

jayzhan211 Mar 19, 2024

jayzhan211 Mar 19, 2024 •

edited

Loading

alamb Mar 19, 2024

alamb Mar 19, 2024

jayzhan211 Mar 19, 2024

alamb Mar 19, 2024

jayzhan211 Mar 20, 2024

alamb commented Mar 19, 2024

Omega359 Mar 20, 2024

jayzhan211 Mar 20, 2024 •

edited

Loading

alamb Mar 20, 2024

Fix incorrect results with multiple COUNT(DISTINCT..) aggregates on dictionaries #9679

Fix incorrect results with multiple COUNT(DISTINCT..) aggregates on dictionaries #9679

Conversation

alamb commented Mar 18, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Mar 18, 2024

alamb commented Mar 18, 2024

alamb commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Mar 19, 2024

Choose a reason for hiding this comment

jayzhan211 Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix incorrect results with multiple `COUNT(DISTINCT..)` aggregates on dictionaries #9679

Fix incorrect results with multiple `COUNT(DISTINCT..)` aggregates on dictionaries #9679

alamb commented Mar 18, 2024 •

edited

Loading

jayzhan211 Mar 19, 2024 •

edited

Loading

jayzhan211 Mar 20, 2024 •

edited

Loading