Add sql support for `DISTINCT ON` #7827

universalmind303 · 2023-10-15T18:07:22Z

Is your feature request related to a problem or challenge?

It'd be nice to be able to use DISTINCT ON expressions

Describe the solution you'd like

sql support for DISTINCT ON

Describe alternatives you've considered

No response

Additional context

https://www.geeksforgeeks.org/postgresql-distinct-on-expression/

The text was updated successfully, but these errors were encountered:

gruuya · 2023-10-23T13:56:46Z

I'd like to take a shot at this if there're no other takers. For reference, here's a formal description of this clause from the Postgres docs:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.

From what I see, the most promising options to implement this are:

Add a new GroupsAccumulator that accumulates only a single value based on a given (ORDER BY) expression, i.e. something like argmax/argmin. In other words the above functionality could be obtained if one projects the DISTINCT ON expressions alongside the selection expressions, and then argmaxes them based on the ORDER BY expressions (if none are present the problem reduces to a simple group by). This might enable re-use of some aggregation/grouping logic, and could itself be re-used somewhere else.
Implement a custom DISTINCT ON logic/physical plan, that would basically do the same thing as option 1 above, but without trying to adhere to existing aggregation/grouping semantics. I think this variant is likely if option 1 turns out to be infeasible for some reason.
Based on my (potentially flawed) observation that DISTINCT ON is a subset of window expressions, i.e. that
```
select distinct on (username) username, browser, logged_at from logins order by username, logged_at desc
```
can be re-written as
```
select distinct 
    first_value(username) over w as username, 
    first_value(browser) over w as browser, 
    first_value(logged_at) over w as logged_at 
from logins 
window w as (partition by username order by username, logged_at desc) 
order by username, logged_at desc
```
this could be used to construct an alternative logical/physical plan representation using the already existing primitives (i.e. window plans). Besides being hacky, I think this is sub-optimal since in the case of window functions the entire table seems to be materialized, and only after that DISTINCTd (which is represented via a grouping logical plan after optimizations). Alternatively, the window physical plan could be revised to take into account a limit per window, which solve this problem and perhaps address Optimize "per partition" top-k : ROW_NUMBER < 5 / TopK #6899.

UPDATE: I realized that the first_value aggregation function has a corresponding GroupsAccumulator via the GroupsAccumulatorAdapter, so that seems like the ideal way to pursue option 1.

universalmind303 added the enhancement New feature or request label Oct 15, 2023

universalmind303 mentioned this issue Oct 15, 2023

Failing PRQL integration tests GlareDB/glaredb#1918

Open

5 tasks

gruuya mentioned this issue Oct 30, 2023

Implement DISTINCT ON from Postgres #7981

Merged

alamb closed this as completed in #7981 Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sql support for `DISTINCT ON` #7827

Add sql support for `DISTINCT ON` #7827

universalmind303 commented Oct 15, 2023

gruuya commented Oct 23, 2023 •

edited

Loading

Add sql support for DISTINCT ON #7827

Add sql support for DISTINCT ON #7827

Comments

universalmind303 commented Oct 15, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

gruuya commented Oct 23, 2023 • edited Loading

Add sql support for `DISTINCT ON` #7827

Add sql support for `DISTINCT ON` #7827

gruuya commented Oct 23, 2023 •

edited

Loading