Skip to content

[SPARK-51757] Fix LEAD/LAG Function Offset Exceeds Partition Size #50552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

xin-aurora
Copy link

@xin-aurora xin-aurora commented Apr 10, 2025

What changes were proposed in this pull request?

The current implementation of the prepare in OffsetWindowFunctionFrameBase:

  override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
    if (offset > rows.length) {
      fillDefaultValue(EmptyRow)
    } else {
    ...
  }

The current implementation of the write in OffsetWindowFunctionFrameBase:

   override def write(index: Int, current: InternalRow): Unit = {
    if (offset > rows.length) {
      // Already use default values in prepare.
    } else {
    ...
  }

These implementations caused the LEAD' and LAGfunctions to haveNullPointerExceptionwhen the default value is notLiteral` and the range of the default value exceeds the partition size.

This pr removes the default values in prepare and sets the default values in write.

Why are the changes needed?

Fix LEAD and LAG causes NullPointerException in the window function (SPARK-51757)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add test method in test("SPARK-51757: lead/lag column as default when offset exceeds partition") in org.apache.spark.sql.DataFrameWindowFramesSuite

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 10, 2025
@@ -183,7 +183,8 @@ abstract class OffsetWindowFunctionFrameBase(
override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
resetStates(rows)
if (absOffset > rows.length) {
fillDefaultValue(EmptyRow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the code is not needed, just remove them. and add some comments to explain the reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the only place? Seems we should never run fillDefaultValue in prepare as the default value can be an expression that references attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all the existing tests just cover the default value as Literal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think it's not related to the partition size.
The cause is apply expression on empty row.
We can check the default expression and apply fillDefaultValue(currentRow) in write if it is not a Literal, or apply it is in prepare if it is a literal.

@beliefer
Copy link
Contributor

Could you check the description
The current implementation of the write in OffsetWindowFunctionFrameBase:
Where is it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants