Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sql support for DISTINCT ON #7827

Closed
universalmind303 opened this issue Oct 15, 2023 · 1 comment · Fixed by #7981
Closed

Add sql support for DISTINCT ON #7827

universalmind303 opened this issue Oct 15, 2023 · 1 comment · Fixed by #7981
Labels
enhancement New feature or request

Comments

@universalmind303
Copy link
Contributor

Is your feature request related to a problem or challenge?

It'd be nice to be able to use DISTINCT ON expressions

Describe the solution you'd like

sql support for DISTINCT ON

Describe alternatives you've considered

No response

Additional context

https://www.geeksforgeeks.org/postgresql-distinct-on-expression/

@gruuya
Copy link
Contributor

gruuya commented Oct 23, 2023

I'd like to take a shot at this if there're no other takers. For reference, here's a formal description of this clause from the Postgres docs:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.

From what I see, the most promising options to implement this are:

  1. Add a new GroupsAccumulator that accumulates only a single value based on a given (ORDER BY) expression, i.e. something like argmax/argmin. In other words the above functionality could be obtained if one projects the DISTINCT ON expressions alongside the selection expressions, and then argmaxes them based on the ORDER BY expressions (if none are present the problem reduces to a simple group by). This might enable re-use of some aggregation/grouping logic, and could itself be re-used somewhere else.
  2. Implement a custom DISTINCT ON logic/physical plan, that would basically do the same thing as option 1 above, but without trying to adhere to existing aggregation/grouping semantics. I think this variant is likely if option 1 turns out to be infeasible for some reason.
  3. Based on my (potentially flawed) observation that DISTINCT ON is a subset of window expressions, i.e. that
    select distinct on (username) username, browser, logged_at from logins order by username, logged_at desc
    can be re-written as
    select distinct 
        first_value(username) over w as username, 
        first_value(browser) over w as browser, 
        first_value(logged_at) over w as logged_at 
    from logins 
    window w as (partition by username order by username, logged_at desc) 
    order by username, logged_at desc
    this could be used to construct an alternative logical/physical plan representation using the already existing primitives (i.e. window plans). Besides being hacky, I think this is sub-optimal since in the case of window functions the entire table seems to be materialized, and only after that DISTINCTd (which is represented via a grouping logical plan after optimizations). Alternatively, the window physical plan could be revised to take into account a limit per window, which solve this problem and perhaps address Optimize "per partition" top-k : ROW_NUMBER < 5 / TopK #6899.

UPDATE: I realized that the first_value aggregation function has a corresponding GroupsAccumulator via the GroupsAccumulatorAdapter, so that seems like the ideal way to pursue option 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants