Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement lead and lag built-in window function #429

Merged
merged 3 commits into from
Jul 2, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented May 26, 2021

Which issue does this PR close?

implement lead and lag built-in window function.

based on #520 so review that first

Closes #553

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@jimexist jimexist force-pushed the add-lag-and-lead branch from b78b355 to 26ca0fe Compare May 26, 2021 20:51
@codecov-commenter
Copy link

codecov-commenter commented May 26, 2021

Codecov Report

Merging #429 (f763205) into master (321fda4) will decrease coverage by 0.01%.
The diff coverage is 65.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #429      +/-   ##
==========================================
- Coverage   75.16%   75.15%   -0.02%     
==========================================
  Files         150      152       +2     
  Lines       25144    25357     +213     
==========================================
+ Hits        18899    19056     +157     
- Misses       6245     6301      +56     
Impacted Files Coverage Δ
ballista/rust/client/src/columnar_batch.rs 0.00% <ø> (ø)
ballista/rust/core/src/serde/scheduler/mod.rs 14.28% <ø> (ø)
datafusion/src/physical_plan/expressions/mod.rs 71.42% <ø> (ø)
...tafusion/src/physical_plan/expressions/lead_lag.rs 38.29% <38.29%> (ø)
...afusion/src/physical_plan/expressions/nth_value.rs 70.76% <70.76%> (ø)
datafusion/src/physical_plan/windows.rs 78.32% <72.11%> (+4.54%) ⬆️
datafusion/src/physical_plan/mod.rs 79.09% <100.00%> (+0.38%) ⬆️
datafusion/tests/sql.rs 99.89% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 321fda4...f763205. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Jun 1, 2021

I plan to review this PR tomorrow

@jimexist jimexist force-pushed the add-lag-and-lead branch 2 times, most recently from 9d153a7 to 737c2dd Compare June 2, 2021 07:44
@jimexist
Copy link
Member Author

jimexist commented Jun 2, 2021

Actually let's park this pull request for a while - I plan to implement sort and partition first and then window frame, after which the window shift approach might not be relevant.

@jimexist jimexist force-pushed the add-lag-and-lead branch from 737c2dd to 225c7ec Compare June 4, 2021 15:27
@alamb alamb added the datafusion Changes in the datafusion crate label Jun 4, 2021
@jimexist jimexist force-pushed the add-lag-and-lead branch 8 times, most recently from a4523e6 to f676db8 Compare June 13, 2021 15:00
@jimexist
Copy link
Member Author

Actually let's park this pull request for a while - I plan to implement sort and partition first and then window frame, after which the window shift approach might not be relevant.

now that #520 is implemented, this PR is ready

@jimexist jimexist marked this pull request as ready for review June 13, 2021 15:03
@jimexist
Copy link
Member Author

putting this back to draft as this relies on apache/arrow-rs#388 which is not yet in arrow 4.3

@jimexist jimexist marked this pull request as draft June 15, 2021 01:32
@alamb
Copy link
Contributor

alamb commented Jun 15, 2021

putting this back to draft as this relies on apache/arrow-rs#388 which is not yet in arrow 4.3

Oh no!

Can we possibly use the API that is in Arrow 4.3 (and then we can upgrade datafusion to use the new api when the next version of Arrow comes out)?

@jimexist
Copy link
Member Author

putting this back to draft as this relies on apache/arrow-rs#388 which is not yet in arrow 4.3

Oh no!

Can we possibly use the API that is in Arrow 4.3 (and then we can upgrade datafusion to use the new api when the next version of Arrow comes out)?

I don't mind parking this one here for a while since there would be many other window frame stuff to be done before revisiting this and by that time newer version would be released

@alamb
Copy link
Contributor

alamb commented Jun 15, 2021

I don't mind parking this one here for a while since there would be many other window frame stuff to be done before revisiting this and by that time newer version would be released

Ok, thank you. The plan is to do a 4.4 release in ~ 2 weeks

@jimexist jimexist force-pushed the add-lag-and-lead branch 5 times, most recently from e2d40bc to 9f78341 Compare June 23, 2021 00:34
@jimexist jimexist force-pushed the add-lag-and-lead branch 4 times, most recently from ca475b4 to 1fae443 Compare June 28, 2021 23:54
@jimexist jimexist marked this pull request as ready for review June 28, 2021 23:59
@jimexist
Copy link
Member Author

@alamb and @Dandandan this pull request is ready now

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great start @jimexist -- I do have a question if this will generate the correct answer with multiple partitions.

Also, I suggest an end-to-end integration test using your great harness, but I suspect you plan to do so in a subsequent PR :) 👍

impl PartitionEvaluator for WindowShiftEvaluator {
fn evaluate_partition(&self, _partition: Range<usize>) -> Result<ArrayRef> {
let value = &self.values[0];
shift(value.as_ref(), self.shift_offset).map_err(DataFusionError::ArrowError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to restrict the window to the partition bounds? If the input array had 10 rows in 2 partitions, wouldn't this code produce 2 output partitions of 10 rows each (rather than 2 output partitions of 5 rows each)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb good catch, this is fixed and add with integration tests.

@jimexist jimexist force-pushed the add-lag-and-lead branch from 1fae443 to 29fdc24 Compare July 1, 2021 00:21
@alamb
Copy link
Contributor

alamb commented Jul 1, 2021

Thanks @jimexist -- I ran out of time today but will check this out tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jimexist !

@alamb alamb merged commit c0de9bb into apache:master Jul 2, 2021
@jimexist jimexist deleted the add-lag-and-lead branch July 3, 2021 03:15
@houqp houqp added the enhancement New feature or request label Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement lead and lag window functions
5 participants