-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add first_value
, last_value
, and nth_value
built-in window functions
#403
Conversation
Codecov Report
@@ Coverage Diff @@
## master #403 +/- ##
==========================================
+ Coverage 74.94% 75.42% +0.48%
==========================================
Files 146 148 +2
Lines 24314 24595 +281
==========================================
+ Hits 18223 18552 +329
+ Misses 6091 6043 -48
Continue to review full report at Codecov.
|
a0d43c6
to
2355533
Compare
2355533
to
c7cc798
Compare
c7cc798
to
d1f53ce
Compare
@jimexist I plan to review this PR later today or tomorrow |
1e543bb
to
8daa8b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jimexist -- the code structure looks clean to me . Nice work
However, I am concerned about the correctness of these results. As I understand it, first_value
, last_value
and nth_value
are not well defined unless there is an ordering on the window (aka without an ordering you basically can get some arbitrary value from the window).
I wonder if it would make sense to implement ordering for windows first, so we can write tests will well defined output
I also see some change to the parquet-testing
module which I wonder if that was intended
|
||
#[derive(Debug)] | ||
struct NthValueAccumulator { | ||
// n the target nth_value, however we'll reuse it for last_value acc, so when n == 0 it specifically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a more idomatic Rust way of doing this would be an enum like
enum NthValue {
First,
Last,
Nth(u32)
}
And then in NthValueAccumulator::scan
you would have something like
match self.n {
NthValue::First|NthValue::Nth(1) => {...}
Nth::Last => {..}
}
There is nothing wrong with the special value approach either -- I just figured I would point it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree that sounds like a more idiomatic way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to address later: #448
use std::convert::TryFrom; | ||
use std::sync::Arc; | ||
|
||
/// first_value expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how similar FirstValue
, LastValue
and NthValue
are, would it possible to just use the NthValue
struct rather than have three mostly repeated structs?
} | ||
|
||
impl WindowAccumulator for NthValueAccumulator { | ||
fn scan(&mut self, values: &[ScalarValue]) -> Result<Option<ScalarValue>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed that scan()
will see the entire window in a single call? Or would it be possible to see scan()
called for two different slices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future performance improvement idea: rather than converting each row to ScalarValue
and passing values: &[ScalarValue]
we should strive to slice into the original arrays (i.e. don't do a lot of work / don't copy data) and produce arrays based on offsets and then use take
to build new arrays out of all of the indices (in this case 1, but in case of smaller windows / partitions etc. this could grow by a lot).
fn scan(&mut self, values: &[ScalarValue]) -> Result<Option<ScalarValue>> { | ||
if self.n == SPECIAL_SIZE_VALUE_FOR_LAST { | ||
// for last_value function | ||
self.value = values[0].clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be values.last()
rather than the first (0
th) value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the values array is per row
|
||
#[tokio::test] | ||
async fn window_function() -> Result<()> { | ||
let (input, schema) = create_test_schema(1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be valuable to create a test with more than one partition (so that the data may not arrive to the WindowFunction
as a single RecordBatch
)
vec!["1", "781", "7.81", "100", "125", "-117"], | ||
vec!["1", "781", "7.81", "100", "125", "-117"], | ||
vec!["1", "781", "7.81", "100", "125", "-117"], | ||
vec!["1", "781", "7.81", "100", "125", "-117", "1", "30", "-40"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about these results. "first_value", "last_value" and "nth_value" over an unsorted window (as this query is is using) seems undefined to me.
For example, Using datafusion-cli
:
CREATE EXTERNAL TABLE aggregate_test_100 (
c1 VARCHAR NOT NULL,
c2 INT NOT NULL,
c3 SMALLINT NOT NULL,
c4 SMALLINT NOT NULL,
c5 INT NOT NULL,
c6 BIGINT NOT NULL,
c7 SMALLINT NOT NULL,
c8 INT NOT NULL,
c9 BIGINT NOT NULL,
c10 VARCHAR NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION '/Users/alamb/Software/arrow-datafusion/testing/data/csv/';
You can see there are many values for c3
, and the values of first_value
, last_value
and nth_value
seem to be picking some arbitrary rows:
> select c2, c3 from aggregate_test_100 order by c2;
+----+------+
| c2 | c3 |
+----+------+
| 1 | 29 |
| 1 | -85 |
| 1 | 38 |
| 1 | 57 |
| 1 | 54 |
| 1 | 103 |
| 1 | -98 |
| 1 | -99 |
| 1 | -25 |
| 1 | 36 |
| 1 | 41 |
| 1 | -8 |
| 1 | -24 |
| 1 | 125 |
| 1 | 70 |
| 1 | -72 |
| 1 | 71 |
| 1 | -56 |
| 1 | -5 |
| 1 | 12 |
| 1 | 83 |
| 1 | 120 |
| 2 | 1 |
| 2 | 113 |
| 2 | 49 |
| 2 | 97 |
| 2 | -29 |
| 2 | 45 |
| 2 | -60 |
| 2 | 93 |
| 2 | 63 |
| 2 | 52 |
| 2 | 31 |
| 2 | -106 |
| 2 | -60 |
| 2 | 68 |
| 2 | -61 |
| 2 | 122 |
| 2 | -48 |
| 2 | 52 |
| 2 | -117 |
| 2 | 29 |
| 2 | -107 |
| 2 | -43 |
| 3 | 104 |
| 3 | 13 |
| 3 | 112 |
| 3 | 77 |
| 3 | 17 |
| 3 | 13 |
| 3 | 73 |
| 3 | -2 |
| 3 | 22 |
| 3 | 17 |
| 3 | -76 |
| 3 | 71 |
| 3 | 14 |
| 3 | -12 |
| 3 | -72 |
| 3 | 97 |
| 3 | -101 |
| 3 | -95 |
| 3 | 123 |
| 4 | -111 |
| 4 | -38 |
| 4 | -54 |
| 4 | -56 |
| 4 | -53 |
| 4 | 123 |
| 4 | 97 |
| 4 | 102 |
| 4 | 65 |
| 4 | 17 |
| 4 | 55 |
| 4 | 73 |
| 4 | -117 |
| 4 | -101 |
| 4 | -79 |
| 4 | 74 |
| 4 | 96 |
| 4 | -90 |
| 4 | -59 |
| 4 | 3 |
| 4 | 5 |
| 4 | 47 |
| 4 | 30 |
| 5 | -40 |
| 5 | -82 |
| 5 | 36 |
| 5 | -31 |
| 5 | -5 |
| 5 | 68 |
| 5 | -59 |
| 5 | 62 |
| 5 | -94 |
| 5 | 64 |
| 5 | -86 |
| 5 | 118 |
| 5 | -101 |
| 5 | -44 |
+----+------+
It is not clear that 1
, 30
and -40
are the "right" answers (there is no good answer for this dataset)
I guess it's not arbitrary but rather just take the ordering as is. When #425 is merged I'll add one test case to compare with psql so the behavior is consistent. Of course when ordering clause is implemented then the behavior can also be tested in the same way, along with some unit test. I plan to implement ordering after this and the lead/lag PR are merged because it requires some structural changes to the planner. |
See also #429 Both can be independently merged first because with the sort clause implemented these two logic shall stay unchanged as sorting happens as a separate physical plan that precedes these and feeds immediately to these. Regarding the submodule change it's not intended - will revert. |
I think the implementation might take the order as is, however the order as given by the underlying plan might give different result. For example, a table scan might give the results in a different order each time it runs. |
Sounds good -- makes sense. As @Dandandan says what this means to me is that there is no well defined "correct result" for the query (and thus the test is ensuring the code doesn't crash, but doesn't really ensure it is getting the correct values)
👍
I think that makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should merge this PR as is. Rationale:
- while I have concerns about the "correctness" of this implementation, it really represents an incremental step towards full window function support
- I also trust that @jimexist has a plan for verifying correctness subsequently and that we aren't going to leave this in a half done state
So I think we should merge this one in and let @jimexist keep on coding 👍
Thanks I'll make sure to figure out a way to consistently test these and also try to stay on par with psql. I'm AFK until Monday so we can either way until then or feel free to help me revert that submodule change and merge |
commit 7fb3640 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 16:38:25 2021 +0800 row number done commit 1723926 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 16:05:50 2021 +0800 add row number commit bf5b8a5 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 15:04:49 2021 +0800 save commit d2ce852 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:53:05 2021 +0800 add streams commit 0a861a7 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 22:28:34 2021 +0800 save stream commit a9121af Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 22:01:51 2021 +0800 update unit test commit 2af2a27 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:25:12 2021 +0800 fix unit test commit bb57c76 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:23:34 2021 +0800 use upper case commit 5d96e52 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 14:16:16 2021 +0800 fix unit test commit 1ecae8f Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 12:27:26 2021 +0800 fix unit test commit bc2271d Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 10:04:29 2021 +0800 fix error commit 880b94f Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 08:24:00 2021 +0800 fix unit test commit 4e792e1 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 08:05:17 2021 +0800 fix test commit c36c04a Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Fri May 21 00:07:54 2021 +0800 add more tests commit f5e64de Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:41:36 2021 +0800 update commit a1eae86 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:36:15 2021 +0800 enrich unit test commit 0d2a214 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:25:43 2021 +0800 adding filter by todo commit 8b486d5 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 20 23:17:22 2021 +0800 adding more built-in functions commit abf08cd Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:36:27 2021 +0800 Update datafusion/src/physical_plan/window_functions.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit 0cbca53 Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:34:57 2021 +0800 Update datafusion/src/physical_plan/window_functions.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit 831c069 Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:34:04 2021 +0800 Update datafusion/src/logical_plan/builder.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit f70c739 Author: Jiayu Liu <Jimexist@users.noreply.github.com> Date: Thu May 20 22:33:04 2021 +0800 Update datafusion/src/logical_plan/builder.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> commit 3ee87aa Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Wed May 19 22:55:08 2021 +0800 fix unit test commit 5c4d92d Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Wed May 19 22:48:26 2021 +0800 fix clippy commit a0b7526 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Wed May 19 22:46:38 2021 +0800 fix unused imports commit 1d3b076 Author: Jiayu Liu <jiayu.liu@airbnb.com> Date: Thu May 13 18:51:14 2021 +0800 add window expr
8daa8b7
to
0c5efac
Compare
@Dandandan this is reverted |
Which issue does this PR close?
add first_value, last_value, and nth_value
Closes #298
Based on #375 so review that first
Rationale for this change
adding three window functions
What changes are included in this PR?
Are there any user-facing changes?