Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement window functions with order_by clause #520

Merged
merged 1 commit into from
Jun 16, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented Jun 7, 2021

Which issue does this PR close?

Closes #360

for now this pull request relies on arrow 4.3.0 to merge

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@jimexist jimexist changed the title Impl window order by WIP Impl window order by Jun 7, 2021
@jimexist jimexist force-pushed the impl-window-order-by branch 8 times, most recently from 1511b62 to 6cc7f96 Compare June 10, 2021 15:18
@codecov-commenter
Copy link

codecov-commenter commented Jun 10, 2021

Codecov Report

Merging #520 (2650680) into master (d382854) will decrease coverage by 0.09%.
The diff coverage is 75.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #520      +/-   ##
==========================================
- Coverage   76.09%   75.99%   -0.10%     
==========================================
  Files         156      156              
  Lines       27047    27036      -11     
==========================================
- Hits        20581    20547      -34     
- Misses       6466     6489      +23     
Impacted Files Coverage Δ
datafusion/src/physical_plan/window_functions.rs 86.42% <ø> (+0.71%) ⬆️
datafusion/src/sql/planner.rs 84.85% <ø> (ø)
datafusion/src/physical_plan/planner.rs 79.35% <33.33%> (+1.81%) ⬆️
datafusion/src/physical_plan/mod.rs 76.72% <55.55%> (-2.37%) ⬇️
datafusion/src/physical_plan/windows.rs 82.57% <70.90%> (-3.90%) ⬇️
...afusion/src/physical_plan/expressions/nth_value.rs 78.78% <74.28%> (-11.69%) ⬇️
datafusion/src/execution/context.rs 92.04% <100.00%> (+0.03%) ⬆️
...fusion/src/physical_plan/expressions/row_number.rs 94.28% <100.00%> (+13.03%) ⬆️
datafusion/src/physical_plan/hash_aggregate.rs 86.54% <100.00%> (ø)
datafusion/src/scalar.rs 56.19% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d382854...2650680. Read the comment docs.

@jimexist jimexist force-pushed the impl-window-order-by branch 4 times, most recently from 3d76e38 to 2a038e1 Compare June 13, 2021 09:31
@jimexist jimexist changed the title WIP Impl window order by Implement window functions with order_by clause Jun 13, 2021
@jimexist jimexist marked this pull request as ready for review June 13, 2021 10:12
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice @jimexist -- I went over the code and saw only goodness :)

All that this PR needs to be mergeable in my opinion is to reset the Cargo arrow* references (now that arrow 4.3.0 has been released)

ballista/rust/core/Cargo.toml Outdated Show resolved Hide resolved
4,
)
.await?;
// result in one batch, although e.g. having 2 batches do not change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if num_rows == 0 {
return Ok(new_empty_array(value.data_type()));
}
let index: usize = match self.kind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

value.len()
)));
}
if num_rows == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function could ever be passed a 0 row input? This check isn't a problem I am just wondering if my mental model is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be changed in later pull request

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you are right this would not be passed with 0 length input. this check is just being pedantic.

let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, -2, 3, -4, 5, -6, 7, 8]));
let values = vec![arr];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test change shows the nice refactoring

/// the accumulator expects the same number of arguments as `expressions` and must
/// return states with the same description as `state_fields`
fn create_accumulator(&self) -> Result<Box<dyn WindowAccumulator>>;

/// expressions that are passed to the WindowAccumulator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WindowExpr trait is looking 👌

/// A window expression that is a built-in window function.
///
/// Note that unlike aggregation based window functions, built-in window functions normally ignore
/// window frame spec, with th expression of first_value, last_value, and nth_value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// window frame spec, with th expression of first_value, last_value, and nth_value.
/// window frame spec, with the exception of first_value, last_value, and nth_value.

/// peer based evaluation based on the fact that batch is pre-sorted given the sort columns
/// and then per partition point we'll evaluate the peer group (e.g. SUM or MAX gives the same
/// results for peers) and concatenate the results.
fn peer_based_evaluate(&self, batch: &RecordBatch) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the naming of peer here (rather than range_based_evaluate for example, to match with WindowFrameUnits::Range)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will possibly change this naming in implementing #361 but for the moment, range and groups both evaluates with peers but rows evaluates based on rows on each scan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is private function i guess i can leave the naming part for later changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it is fine for now

let len = value_range.end - value_range.start;
let values = values
.iter()
.map(|v| v.slice(value_range.start, len))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

-- See the License for the specific language gOVERning permissions and
-- limitations under the License.

SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 👍

@alamb alamb added the datafusion Changes in the datafusion crate label Jun 14, 2021
@jimexist
Copy link
Member Author

jimexist commented Jun 15, 2021

This looks very nice @jimexist -- I went over the code and saw only goodness :)

All that this PR needs to be mergeable in my opinion is to reset the Cargo arrow* references (now that arrow 4.3.0 has been released)

thank you for taking time to review. the changes to arrow references are now reverted.

@jimexist
Copy link
Member Author

@alamb this pull request is ready now

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jimexist

/// peer based evaluation based on the fact that batch is pre-sorted given the sort columns
/// and then per partition point we'll evaluate the peer group (e.g. SUM or MAX gives the same
/// results for peers) and concatenate the results.
fn peer_based_evaluate(&self, batch: &RecordBatch) -> Result<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it is fine for now

@alamb alamb merged commit 51e5445 into apache:master Jun 16, 2021
@jimexist jimexist deleted the impl-window-order-by branch June 16, 2021 14:08
@houqp houqp added the enhancement New feature or request label Jul 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support window functions with order by
5 participants