Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add first_value, last_value, and nth_value built-in window functions #403

Merged
merged 1 commit into from
May 31, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented May 24, 2021

Which issue does this PR close?

add first_value, last_value, and nth_value

Closes #298

Based on #375 so review that first

Rationale for this change

adding three window functions

What changes are included in this PR?

Are there any user-facing changes?

@jimexist jimexist marked this pull request as draft May 24, 2021 01:07
@codecov-commenter
Copy link

Codecov Report

Merging #403 (59663eb) into master (174226c) will increase coverage by 0.48%.
The diff coverage is 73.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #403      +/-   ##
==========================================
+ Coverage   74.94%   75.42%   +0.48%     
==========================================
  Files         146      148       +2     
  Lines       24314    24595     +281     
==========================================
+ Hits        18223    18552     +329     
+ Misses       6091     6043      -48     
Impacted Files Coverage Δ
datafusion/src/physical_plan/expressions/mod.rs 71.42% <ø> (ø)
...fusion/src/physical_plan/expressions/row_number.rs 0.00% <0.00%> (ø)
datafusion/src/physical_plan/hash_aggregate.rs 85.21% <ø> (ø)
datafusion/src/physical_plan/sort.rs 92.07% <ø> (ø)
datafusion/src/physical_plan/window_functions.rs 86.95% <64.70%> (-1.77%) ⬇️
datafusion/src/physical_plan/mod.rs 80.58% <68.75%> (-2.18%) ⬇️
...afusion/src/physical_plan/expressions/nth_value.rs 70.76% <70.76%> (ø)
datafusion/src/physical_plan/windows.rs 82.12% <88.09%> (+82.12%) ⬆️
datafusion/src/execution/context.rs 92.07% <100.00%> (+0.02%) ⬆️
datafusion/src/physical_plan/planner.rs 80.45% <100.00%> (+3.94%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 174226c...59663eb. Read the comment docs.

@jimexist jimexist force-pushed the add-row-number-exec branch 2 times, most recently from a0d43c6 to 2355533 Compare May 24, 2021 14:36
@jimexist jimexist marked this pull request as ready for review May 24, 2021 14:56
@jimexist jimexist force-pushed the add-row-number-exec branch from 2355533 to c7cc798 Compare May 25, 2021 01:46
@jimexist jimexist marked this pull request as draft May 25, 2021 07:04
@jimexist jimexist force-pushed the add-row-number-exec branch from c7cc798 to d1f53ce Compare May 26, 2021 05:25
@jimexist jimexist marked this pull request as ready for review May 26, 2021 05:29
@alamb
Copy link
Contributor

alamb commented May 26, 2021

@jimexist I plan to review this PR later today or tomorrow

@jimexist jimexist force-pushed the add-row-number-exec branch 2 times, most recently from 1e543bb to 8daa8b7 Compare May 26, 2021 20:47
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jimexist -- the code structure looks clean to me . Nice work

However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).

I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output

I also see some change to the parquet-testing module which I wonder if that was intended


#[derive(Debug)]
struct NthValueAccumulator {
// n the target nth_value, however we'll reuse it for last_value acc, so when n == 0 it specifically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a more idomatic Rust way of doing this would be an enum like

enum NthValue {
  First,
  Last,
  Nth(u32)
}

And then in NthValueAccumulator::scan you would have something like

match self.n {
  NthValue::First|NthValue::Nth(1)  => {...}
  Nth::Last => {..}
}

There is nothing wrong with the special value approach either -- I just figured I would point it out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that sounds like a more idiomatic way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to address later: #448

use std::convert::TryFrom;
use std::sync::Arc;

/// first_value expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how similar FirstValue, LastValue and NthValue are, would it possible to just use the NthValue struct rather than have three mostly repeated structs?

}

impl WindowAccumulator for NthValueAccumulator {
fn scan(&mut self, values: &[ScalarValue]) -> Result<Option<ScalarValue>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed that scan() will see the entire window in a single call? Or would it be possible to see scan() called for two different slices?

Copy link
Contributor

@Dandandan Dandandan May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future performance improvement idea: rather than converting each row to ScalarValue and passing values: &[ScalarValue] we should strive to slice into the original arrays (i.e. don't do a lot of work / don't copy data) and produce arrays based on offsets and then use take to build new arrays out of all of the indices (in this case 1, but in case of smaller windows / partitions etc. this could grow by a lot).

fn scan(&mut self, values: &[ScalarValue]) -> Result<Option<ScalarValue>> {
if self.n == SPECIAL_SIZE_VALUE_FOR_LAST {
// for last_value function
self.value = values[0].clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be values.last() rather than the first (0th) value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the values array is per row


#[tokio::test]
async fn window_function() -> Result<()> {
let (input, schema) = create_test_schema(1)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be valuable to create a test with more than one partition (so that the data may not arrive to the WindowFunction as a single RecordBatch)

vec!["1", "781", "7.81", "100", "125", "-117"],
vec!["1", "781", "7.81", "100", "125", "-117"],
vec!["1", "781", "7.81", "100", "125", "-117"],
vec!["1", "781", "7.81", "100", "125", "-117", "1", "30", "-40"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about these results. "first_value", "last_value" and "nth_value" over an unsorted window (as this query is is using) seems undefined to me.

For example, Using datafusion-cli:

    CREATE EXTERNAL TABLE aggregate_test_100 (
        c1  VARCHAR NOT NULL,
        c2  INT NOT NULL,
        c3  SMALLINT NOT NULL,
        c4  SMALLINT NOT NULL,
        c5  INT NOT NULL,
        c6  BIGINT NOT NULL,
        c7  SMALLINT NOT NULL,
        c8  INT NOT NULL,
        c9  BIGINT NOT NULL,
        c10 VARCHAR NOT NULL,
        c11 FLOAT NOT NULL,
        c12 DOUBLE NOT NULL,
        c13 VARCHAR NOT NULL
    )
    STORED AS CSV
    WITH HEADER ROW
    LOCATION '/Users/alamb/Software/arrow-datafusion/testing/data/csv/';

You can see there are many values for c3, and the values of first_value, last_value and nth_value seem to be picking some arbitrary rows:

> select c2, c3 from aggregate_test_100 order by c2;
+----+------+
| c2 | c3   |
+----+------+
| 1  | 29   |
| 1  | -85  |
| 1  | 38   |
| 1  | 57   |
| 1  | 54   |
| 1  | 103  |
| 1  | -98  |
| 1  | -99  |
| 1  | -25  |
| 1  | 36   |
| 1  | 41   |
| 1  | -8   |
| 1  | -24  |
| 1  | 125  |
| 1  | 70   |
| 1  | -72  |
| 1  | 71   |
| 1  | -56  |
| 1  | -5   |
| 1  | 12   |
| 1  | 83   |
| 1  | 120  |
| 2  | 1    |
| 2  | 113  |
| 2  | 49   |
| 2  | 97   |
| 2  | -29  |
| 2  | 45   |
| 2  | -60  |
| 2  | 93   |
| 2  | 63   |
| 2  | 52   |
| 2  | 31   |
| 2  | -106 |
| 2  | -60  |
| 2  | 68   |
| 2  | -61  |
| 2  | 122  |
| 2  | -48  |
| 2  | 52   |
| 2  | -117 |
| 2  | 29   |
| 2  | -107 |
| 2  | -43  |
| 3  | 104  |
| 3  | 13   |
| 3  | 112  |
| 3  | 77   |
| 3  | 17   |
| 3  | 13   |
| 3  | 73   |
| 3  | -2   |
| 3  | 22   |
| 3  | 17   |
| 3  | -76  |
| 3  | 71   |
| 3  | 14   |
| 3  | -12  |
| 3  | -72  |
| 3  | 97   |
| 3  | -101 |
| 3  | -95  |
| 3  | 123  |
| 4  | -111 |
| 4  | -38  |
| 4  | -54  |
| 4  | -56  |
| 4  | -53  |
| 4  | 123  |
| 4  | 97   |
| 4  | 102  |
| 4  | 65   |
| 4  | 17   |
| 4  | 55   |
| 4  | 73   |
| 4  | -117 |
| 4  | -101 |
| 4  | -79  |
| 4  | 74   |
| 4  | 96   |
| 4  | -90  |
| 4  | -59  |
| 4  | 3    |
| 4  | 5    |
| 4  | 47   |
| 4  | 30   |
| 5  | -40  |
| 5  | -82  |
| 5  | 36   |
| 5  | -31  |
| 5  | -5   |
| 5  | 68   |
| 5  | -59  |
| 5  | 62   |
| 5  | -94  |
| 5  | 64   |
| 5  | -86  |
| 5  | 118  |
| 5  | -101 |
| 5  | -44  |
+----+------+

It is not clear that 1, 30 and -40 are the "right" answers (there is no good answer for this dataset)

@jimexist
Copy link
Member Author

Thank you @jimexist -- the code structure looks clean to me . Nice work

However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).

I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output

I also see some change to the parquet-testing module which I wonder if that was intended

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

@jimexist
Copy link
Member Author

Thank you @jimexist -- the code structure looks clean to me . Nice work

However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).

I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output

I also see some change to the parquet-testing module which I wonder if that was intended

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

See also #429

Both can be independently merged first because with the sort clause implemented these two logic shall stay unchanged as sorting happens as a separate physical plan that precedes these and feeds immediately to these.

Regarding the submodule change it's not intended - will revert.

@Dandandan
Copy link
Contributor

Thank you @jimexist -- the code structure looks clean to me . Nice work
However, I am concerned about the correctness of these results. As I understand it, first_value, last_value and nth_value are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).
I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output
I also see some change to the parquet-testing module which I wonder if that was intended

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

I think the implementation might take the order as is, however the order as given by the underlying plan might give different result. For example, a table scan might give the results in a different order each time it runs.

@alamb
Copy link
Contributor

alamb commented May 28, 2021

I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent.

Sounds good -- makes sense. As @Dandandan says what this means to me is that there is no well defined "correct result" for the query (and thus the test is ensuring the code doesn't crash, but doesn't really ensure it is getting the correct values)

Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test.

👍

I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner.

I think that makes sense.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge this PR as is. Rationale:

  1. while I have concerns about the "correctness" of this implementation, it really represents an incremental step towards full window function support
  2. I also trust that @jimexist has a plan for verifying correctness subsequently and that we aren't going to leave this in a half done state

So I think we should merge this one in and let @jimexist keep on coding 👍

@Dandandan
Copy link
Contributor

@jimexist there is some change in parquet-testing can you revert that change?
Otherwise it's looking good (added some comments), I agree with @alamb we can merge it as is and improving the implementation in following PRs.

Thanks for all the great work so far!!

@jimexist
Copy link
Member Author

@jimexist there is some change in parquet-testing can you revert that change?

Otherwise it's looking good (added some comments), I agree with @alamb we can merge it as is and improving the implementation in following PRs.

Thanks for all the great work so far!!

Thanks I'll make sure to figure out a way to consistently test these and also try to stay on par with psql.

I'm AFK until Monday so we can either way until then or feel free to help me revert that submodule change and merge

commit 7fb3640
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 16:38:25 2021 +0800

    row number done

commit 1723926
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 16:05:50 2021 +0800

    add row number

commit bf5b8a5
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 15:04:49 2021 +0800

    save

commit d2ce852
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 14:53:05 2021 +0800

    add streams

commit 0a861a7
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Thu May 20 22:28:34 2021 +0800

    save stream

commit a9121af
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Thu May 20 22:01:51 2021 +0800

    update unit test

commit 2af2a27
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 14:25:12 2021 +0800

    fix unit test

commit bb57c76
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 14:23:34 2021 +0800

    use upper case

commit 5d96e52
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 14:16:16 2021 +0800

    fix unit test

commit 1ecae8f
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 12:27:26 2021 +0800

    fix unit test

commit bc2271d
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 10:04:29 2021 +0800

    fix error

commit 880b94f
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 08:24:00 2021 +0800

    fix unit test

commit 4e792e1
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 08:05:17 2021 +0800

    fix test

commit c36c04a
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Fri May 21 00:07:54 2021 +0800

    add more tests

commit f5e64de
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Thu May 20 23:41:36 2021 +0800

    update

commit a1eae86
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Thu May 20 23:36:15 2021 +0800

    enrich unit test

commit 0d2a214
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Thu May 20 23:25:43 2021 +0800

    adding filter by todo

commit 8b486d5
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Thu May 20 23:17:22 2021 +0800

    adding more built-in functions

commit abf08cd
Author: Jiayu Liu <Jimexist@users.noreply.github.com>
Date:   Thu May 20 22:36:27 2021 +0800

    Update datafusion/src/physical_plan/window_functions.rs

    Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

commit 0cbca53
Author: Jiayu Liu <Jimexist@users.noreply.github.com>
Date:   Thu May 20 22:34:57 2021 +0800

    Update datafusion/src/physical_plan/window_functions.rs

    Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

commit 831c069
Author: Jiayu Liu <Jimexist@users.noreply.github.com>
Date:   Thu May 20 22:34:04 2021 +0800

    Update datafusion/src/logical_plan/builder.rs

    Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

commit f70c739
Author: Jiayu Liu <Jimexist@users.noreply.github.com>
Date:   Thu May 20 22:33:04 2021 +0800

    Update datafusion/src/logical_plan/builder.rs

    Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

commit 3ee87aa
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Wed May 19 22:55:08 2021 +0800

    fix unit test

commit 5c4d92d
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Wed May 19 22:48:26 2021 +0800

    fix clippy

commit a0b7526
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Wed May 19 22:46:38 2021 +0800

    fix unused imports

commit 1d3b076
Author: Jiayu Liu <jiayu.liu@airbnb.com>
Date:   Thu May 13 18:51:14 2021 +0800

    add window expr
@jimexist jimexist force-pushed the add-row-number-exec branch from 8daa8b7 to 0c5efac Compare May 30, 2021 17:06
@jimexist
Copy link
Member Author

@jimexist there is some change in parquet-testing can you revert that change?
Otherwise it's looking good (added some comments), I agree with @alamb we can merge it as is and improving the implementation in following PRs.

Thanks for all the great work so far!!

@Dandandan this is reverted

@alamb alamb merged commit 2b5b009 into apache:master May 31, 2021
@jimexist jimexist deleted the add-row-number-exec branch May 31, 2021 14:29
@houqp houqp added datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support window functions with empty OVER clause
6 participants