-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support min max aggregates in window functions with sliding windows #4675
Conversation
The functions are run with only float64 columns now as in the test case. Support for all types will be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @berkaycpp -- this looks like a good start
I have some performance concerns, which I described in detail.
I also think the code needs significantly more test coverage - the sql level MIN/MAX tests are a good start but I don't think they necessarily hit all the corner cases.
The sliding_window
is a fascinating proposal
cc @Ted-Jiang
Thanks @alamb for your feedbacks. We will address them as soon as possible. @Ted-Jiang if you have time, we would like to have your feedback also regarding this design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the changes in datafusion/core/src/datasource/file_format/parquet.rs
and datafusion/core/src/datasource/mod.rs
I think this PR looks ready to go.
Thank you for your work on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaycpp and @mustafasrepo
Benchmark runs are scheduled for baseline = 77991a3 and contender = afb1ae2. afb1ae2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
||
// The implementation is taken from https://github.com/spebern/moving_min_max/blob/master/src/lib.rs. | ||
|
||
//! Keep track of the minimum or maximum value in a sliding window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an update here, I subsequently discovered a possible better algorithm for this work -- see #4904 for details
Which issue does this PR close?
Closes #4603 and closes #4402.
Rationale for this change
This PR adds support for running MIN-MAX accumulators using custom window frames. Since adding support for sliding window for min-max introduces performance penalty, we have added
ForwardAggregateWindowExpr
, which is a special implementation for expressions that don't require retract. By using this struct, if not needed, we do not introduce performance penalty. As an example window expression,MIN(c12) OVER (ORDER BY C12 RANGE BETWEEN 0.3 PRECEDING AND 0.2 FOLLOWING)
will useMinAccumulator
to calculate its result (where implementation is sliding and introduces additional overhead). However, window expressionMIN(c12) OVER (ORDER BY C12 RANGE BETWEEN UNBOUNDED PRECEDING AND 0.2 FOLLOWING)
will useMinRowAccumulator
which is a better implementation when retract is not a concern.What changes are included in this PR?
update_batch
andretract_batch
methods of MIN, MAX accumulators are implemented. The algorithm and data structure used are described here. If an accumulator has a support for better implementation when retract is unnecessary, we now use optimized implementation during execution.Are these changes tested?
Yes.
aggregate_min_max_w_custom_window_frames
test function is added.Are there any user-facing changes?
No.