-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: Wrong result when there are 2 count(distinct) #9586
Comments
cc @huaxingao |
Here is the schema:
|
An update here is that @erratic-pattern is working on creating a reproducer we can share on this ticket |
Many thanks to @erratic-pattern who has identified the PR that introduce: Issue is related to this PR for array aggregate order and distinct. the commit immediately before it works correctly with the above queries. in arrow-datafusion:
if you then try to run the same query after the above PR, you get the incorrect result:
bug.sql: SELECT COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY datacenter; We are working to share the file |
I am continuing to investigate this to see if I can create a minimal reproduction. |
I have a theory about what is wrong I think this change from #9234 effectively skips all but the first intermediate result when combining data together Here is the code on main, specificaly, that I think should look at all elements in the array: So to find a reproducer it would need:
I think the best idea here is to make a fuzz test that triggers the issue (randomly send inputs). We can potentially follow the model here: I will try to do so later today if no one beats me to it |
Here is a self contained reproducer with just datafusion-cli Step 1: Create data ------
-- Create a table using UNION ALL to get 2 partitions (very important)
------
create table test as
select * from (values('foo', 'bar', 1))
UNION ALL
select * from (values('foo', 'baz', 1));
------
-- Now, create a table with the same data, but column2 has type `Dictionary(Int32)` to trigger the fallback code
-----
create table test_dict as
select
column1,
arrow_cast(column2, 'Dictionary(Int32, Utf8)') as "column2",
column3
from test; Now, run queries -- there are two distinct values of column 2: 'bar' and 'baz'
select * from test_dict;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| foo | bar | 1 |
| foo | baz | 1 |
+---------+---------+---------+
2 rows in set. Query took 0.000 seconds.
-- but this query says there is only 1 distinct value in column2
select
count(distinct column1),
count(distinct column2)
FROM test_dict
GROUP BY column3;
+-----------------------------------+-----------------------------------+
| COUNT(DISTINCT test_dict.column1) | COUNT(DISTINCT test_dict.column2) |
+-----------------------------------+-----------------------------------+
| 1 | 1 |
+-----------------------------------+-----------------------------------+
1 row in set. Query took 0.002 seconds. Running the same query against the non dictionary encoded column in select
count(distinct column1),
count(distinct column2)
FROM test_dict
GROUP BY column3;
+------------------------------+------------------------------+
| COUNT(DISTINCT test.column1) | COUNT(DISTINCT test.column2) |
+------------------------------+------------------------------+
| 1 | 2 |
+------------------------------+------------------------------+
1 row in set. Query took 0.001 seconds. |
I plan to fix this issue |
Describe the bug
Wrong result when there are 2 count distinct in the select clause,
To Reproduce
I will share there file with @alamb because I cannot attached .parquet file here
Bug
Here are other queries running in DF CLI that tell me the right results
Expected behavior
The text was updated successfully, but these errors were encountered: