Skip to content

COUNT(DISTINCT ..) is ignored in Widow Aggregates #16887

@crepererum

Description

@crepererum

Describe the bug

It seems that COUNT(DISTINCT c) OVER ... is treated as COUNT(c) OVER ..., i.e. an ordinary count.

To Reproduce

Use the following sqllogictest:

statement ok
CREATE TABLE table_test_distinct_count (
    k VARCHAR,
    v Int,
    time TIMESTAMP WITH TIME ZONE
);

statement ok
INSERT INTO table_test_distinct_count (k, v, time) VALUES
    ('a', 1, '1970-01-01T00:01:00.00Z'),
    ('a', 1, '1970-01-01T00:02:00.00Z'),
    ('a', 1, '1970-01-01T00:03:00.00Z'),
    ('a', 2, '1970-01-01T00:03:00.00Z'),
    ('a', 1, '1970-01-01T00:04:00.00Z'),
    ('b', 3, '1970-01-01T00:01:00.00Z'),
    ('b', 3, '1970-01-01T00:02:00.00Z'),
    ('b', 4, '1970-01-01T00:03:00.00Z'),
    ('b', 4, '1970-01-01T00:03:00.00Z');

query TPII
SELECT
    k,
    time,
    COUNT(v) OVER (
        PARTITION BY k
        ORDER BY time
        RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
    ) AS normal_count,
    COUNT(DISTINCT v) OVER (
        PARTITION BY k
        ORDER BY time
        RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
    ) AS distinct_count
FROM table_test_distinct_count
ODER BY k, time;
----
a 1970-01-01T00:01:00Z 1 1
a 1970-01-01T00:02:00Z 2 2
a 1970-01-01T00:03:00Z 4 4
a 1970-01-01T00:03:00Z 4 4
a 1970-01-01T00:04:00Z 4 4
b 1970-01-01T00:01:00Z 1 1
b 1970-01-01T00:02:00Z 2 2
b 1970-01-01T00:03:00Z 4 4
b 1970-01-01T00:03:00Z 4 4

query TT
EXPLAIN SELECT
    k,
    time,
    COUNT(v) OVER (
        PARTITION BY k
        ORDER BY time
        RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
    ) AS normal_count,
    COUNT(DISTINCT v) OVER (
        PARTITION BY k
        ORDER BY time
        RANGE BETWEEN INTERVAL '2 minutes' PRECEDING AND CURRENT ROW
    ) AS distinct_count
FROM table_test_distinct_count
ODER BY k, time;
----
logical_plan
01)Projection: oder.k, oder.time, count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW AS normal_count, count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW AS distinct_count
02)--WindowAggr: windowExpr=[[count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 120000000000 } PRECEDING AND CURRENT ROW AS count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW]]
03)----SubqueryAlias: oder
04)------TableScan: table_test_distinct_count projection=[k, v, time]
physical_plan
01)ProjectionExec: expr=[k@0 as k, time@2 as time, count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW@3 as normal_count, count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW@3 as distinct_count]
02)--BoundedWindowAggExec: wdw=[count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW: Field { name: "count(oder.v) PARTITION BY [oder.k] ORDER BY [oder.time ASC NULLS LAST] RANGE BETWEEN 2 minutes PRECEDING AND CURRENT ROW", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, frame: RANGE BETWEEN IntervalMonthDayNano { months: 0, days: 0, nanoseconds: 120000000000 } PRECEDING AND CURRENT ROW], mode=[Sorted]
03)----SortExec: expr=[k@0 ASC NULLS LAST, time@2 ASC NULLS LAST], preserve_partitioning=[false]
04)------DataSourceExec: partitions=1, partition_sizes=[1]

Expected behavior

normal_count and distinct_count should clearly not be the same.

Additional context

This reproduces on 3d4fdf2 .

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions