-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support IGNORE NULLS for LAG window function #9221
Conversation
@mustafasrepo please review whenever you have time. |
@@ -1114,6 +1116,7 @@ pub fn parse_expr( | |||
partition_by, | |||
order_by, | |||
window_frame, | |||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proto can be done as followup
range.end as i64 - self.shift_offset - 1 | ||
} else { | ||
// LEAD mode | ||
range.start as i64 - self.shift_offset | ||
}; | ||
|
||
if idx < 0 || idx as usize >= array.len() { | ||
// Support LAG only for now, as LEAD requires some refactoring first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for LEAD function we likely need to refactor the evaluator and how it works.
The problem is for LEAD we have to adjust values that have already been emitted by evaluator which is not doable afaik. @mustafasrepo I would love to get your input how we can solve such challenge. One of solution is to emit not the single value like now, but the entire resulting array so it gives more control
Thanks @comphead for this PR.
This PR changes Also it seems that current
I think, we can generate correct result without changing the API by keeping track of null_count within the offset interval in running fashion. However, I am not sure though. I think
I can work on the support for |
Thanks @mustafasrepo for the detailed feedback. I'll remove leftovers not related to PR.
|
SELECT LAG(c9, 2) OVER(ORDER BY c9), SUM(c9) OVER(order by c9 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM aggregate_test_100;
we would construct vector
where count is incremented each time a new non-null value is seen. |
After thinking about the possible solutions, another approach might be for the table below
constructing a vector with same length where each entry contains the index of the previous non-null entry.
To find |
@mustafasrepo thanks for suggestions, I've implemented similar approach with tracking nonnull row indexes, so likely it works for non default offset. https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_planner.rs#L750 is the condition to select preferred mode. It will be always Bounded for LAG/LEAD because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comphead for this PR. It is LGTM!.
I sent two commits
- to include a test which triggers
evaluate_all
call (we can work on this test in following PR to add support forWindowAggExec
), with some minor stylistic changes. - to make algorithm pruning friendly. Previous implementation relied on indices kept track to be correct (when pruned this might not be the case). Hence, I modified implementation so that it produces correct results when pruned.
set datafusion.execution.batch_size = 1; | ||
|
||
query I | ||
SELECT LAG(c1, 2) IGNORE NULLS OVER() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test, triggers pruning internally. previous implementation was producing different result than the above result where data is fed as single chunk (because of large batch size), where no pruning is done.
# LAG window function IGNORE/RESPECT NULLS support with descending order and nondefault offset. | ||
# To trigger WindowAggExec, we added a sum window function with all of the ranges. | ||
statement error Execution error: IGNORE NULLS mode for LAG and LEAD is not supported for WindowAggExec | ||
select lag(a, 2, null) ignore nulls over (order by id desc) as x1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test triggers evaluate_all call
Thanks @mustafasrepo I'll wait for couple of more hours and then merge it if no other feedback shows up |
Which issue does this PR close?
Closes #.
Related #9055
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?