Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support IGNORE NULLS for LAG window function #9221

Merged
merged 7 commits into from
Feb 23, 2024
Merged

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Feb 13, 2024

Which issue does this PR close?

Closes #.
Related #9055

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate substrait labels Feb 13, 2024
@comphead comphead changed the title [WIP] lag/lead ignore nulls [WIP] Support IGNORE NULLS for LAG window function Feb 19, 2024
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 19, 2024
@comphead comphead marked this pull request as ready for review February 19, 2024 18:23
@comphead
Copy link
Contributor Author

@mustafasrepo please review whenever you have time.
The IGNORE NULLS is supported to LAG function only

@@ -1114,6 +1116,7 @@ pub fn parse_expr(
partition_by,
order_by,
window_frame,
None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto can be done as followup

range.end as i64 - self.shift_offset - 1
} else {
// LEAD mode
range.start as i64 - self.shift_offset
};

if idx < 0 || idx as usize >= array.len() {
// Support LAG only for now, as LEAD requires some refactoring first
Copy link
Contributor Author

@comphead comphead Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for LEAD function we likely need to refactor the evaluator and how it works.
The problem is for LEAD we have to adjust values that have already been emitted by evaluator which is not doable afaik. @mustafasrepo I would love to get your input how we can solve such challenge. One of solution is to emit not the single value like now, but the entire resulting array so it gives more control

@comphead comphead changed the title [WIP] Support IGNORE NULLS for LAG window function Support IGNORE NULLS for LAG window function Feb 19, 2024
@comphead comphead requested a review from viirya February 19, 2024 22:12
@mustafasrepo
Copy link
Contributor

mustafasrepo commented Feb 20, 2024

Thanks @comphead for this PR.
Currently we have two API for window function calculations:

  • evaluate_all
  • evaluate

evaluate_all takes all the data as single batch. In this version we have all available data for decision.
evaluate takes only absolutely necessary section for the calculation of the window function result.
The difference is that WindowAggExec uses evaluate_all, whereas BoundedWindowAggExec uses evaluate API to decrease memory usage, and possibly incremental calculations.

This PR changes evaluate implementation to support LAG. However, as far as I can see its corresponding evaluate_all handling is not done.

Also it seems that current LAG implementation only works for lag 1 (which is the default). However, for other lag values I don't think current implementation will generate correct results.

for LEAD function we likely need to refactor the evaluator and how it works.
The problem is for LEAD we have to adjust values that have already been emitted by evaluator which is not doable afaik. @mustafasrepo I would love to get your input how we can solve such challenge. One of solution is to emit not the single value like now, but the entire resulting array so it gives more control

I think, we can generate correct result without changing the API by keeping track of null_count within the offset interval in running fashion. However, I am not sure though.

I think

  • we should first add support for evaluate_all API. This support should be easier than the evaluate support. Since evaluate_all API has all the data possible. Lag, Lead support can be implemented for this API. There won't be much difference, as far as I can presume.
  • Then, we should add support for evaluate API.

I can work on the support for evaluate API. If that is Ok for you.

@comphead
Copy link
Contributor Author

Thanks @mustafasrepo for the detailed feedback. I'll remove leftovers not related to PR.
My next steps:

  • add tests for LAG with non default offset
  • I'll try to use evaluate_all, it will be easier you are right as we have all data in place. The only thing concerns me when I run tests I didn't see .evaluate_all has been called
  • For the evaluate, I appreciate your help. One idea I had is to reverse input array before calling evaluate for only LEAD function and it potentially should work, but there reversing might be expensive

@comphead comphead marked this pull request as draft February 20, 2024 18:58
@mustafasrepo
Copy link
Contributor

The only thing concerns me when I run tests I didn't see .evaluate_all has been called

evaluate_all is called from WindowAggExec. WindowAggExec works when one of the window frame boundaries include UNBOUNDED FOLLOWING. Hence for the query below

SELECT LAG(c9, 2) OVER(ORDER BY c9), SUM(c9) OVER(order by c9 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM aggregate_test_100;

WindowAggExec will be used. Unfortunately, from a SQL query I couldn't come up with a simpler reproducer.

One idea I had is to reverse input array before calling evaluate for only LEAD function and it potentially should work, but there reversing might be expensive
Another approach might be for the table, we can construct a vector with same length to track non null count. Such as for the table below

a
1
2
null
3
null
4

we would construct vector

non_null_count
1
2
2
3
3
4

where count is incremented each time a new non-null value is seen.
LAG, LEAD can work on this vector to determine the place of the value. For instance, for the row index 5 in the original table (where a=4)
LAG(2) should produce 2. We can determine this by finding non_null_count of the row (which is 4 in our case). When we determine LAG on this table we can determine that non_null_count of the result should be 2. Then we can find the index of first 2 in the non_null_count vector. (Other 2s point to null values.). Index would be 1 which has value a=2 in the table.
LEAD can work similarly.

@mustafasrepo
Copy link
Contributor

After thinking about the possible solutions, another approach might be for the table below

a
1
2
null
3
null
4

constructing a vector with same length where each entry contains the index of the previous non-null entry.
For the table above, this would be

non-null_pointers
-1
0
0
1
1
3

To find LAG(2) for the row=5 (where a=4). We would follow pointers twice (idx 5 -> 3, 3 ->1). Hence result would the value at index 1, which have a=2. For the LEAD we might need to reverse data, apply LAG as you suggest. If we encounter -1, that would indicate result is None. Because there is no more previous data.

@comphead
Copy link
Contributor Author

@mustafasrepo thanks for suggestions, I've implemented similar approach with tracking nonnull row indexes, so likely it works for non default offset.
I was not able to call LAG in .evaluate_all mode as it just not configured that way.

https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_planner.rs#L750 is the condition to select preferred mode.

It will be always Bounded for LAG/LEAD because
https://github.com/apache/arrow-datafusion/blob/cf11a700eb6a5385a6ebade2b92c684380940296/datafusion/physical-expr/src/window/built_in.rs#L276 refers to evaluator.supports_bounded_execution which is true for LAG/LEAD https://github.com/apache/arrow-datafusion/blob/cf11a700eb6a5385a6ebade2b92c684380940296/datafusion/physical-expr/src/window/lead_lag.rs#L234 and uses_window_frame is false

@comphead comphead marked this pull request as ready for review February 22, 2024 01:45
Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead for this PR. It is LGTM!.
I sent two commits

  • to include a test which triggers evaluate_all call (we can work on this test in following PR to add support for WindowAggExec), with some minor stylistic changes.
  • to make algorithm pruning friendly. Previous implementation relied on indices kept track to be correct (when pruned this might not be the case). Hence, I modified implementation so that it produces correct results when pruned.

set datafusion.execution.batch_size = 1;

query I
SELECT LAG(c1, 2) IGNORE NULLS OVER()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test, triggers pruning internally. previous implementation was producing different result than the above result where data is fed as single chunk (because of large batch size), where no pruning is done.

# LAG window function IGNORE/RESPECT NULLS support with descending order and nondefault offset.
# To trigger WindowAggExec, we added a sum window function with all of the ranges.
statement error Execution error: IGNORE NULLS mode for LAG and LEAD is not supported for WindowAggExec
select lag(a, 2, null) ignore nulls over (order by id desc) as x1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test triggers evaluate_all call

@comphead
Copy link
Contributor Author

Thanks @mustafasrepo I'll wait for couple of more hours and then merge it if no other feedback shows up

@comphead comphead merged commit a851ecf into apache:main Feb 23, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants