Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Fix a test failure in DataFrameSuite introduced by #29404

Why are the changes needed?

Fix test failure

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

cc @maropu @cloud-fan

assert(e.getCause.getClass.equals(classOf[ArithmeticException]))
assert(e.getCause.getMessage.contains("cannot be represented as Decimal"))
val e = intercept[SparkException] {
structDf.collect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment to say that we have to fail overflow with non-ansi mode as well, and link the related JIRA tickets?

@gengliangwang gengliangwang changed the title [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite [Do not merge][SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite Aug 17, 2020
@gengliangwang
Copy link
Member Author

I am now thinking about reverting the previous commits.

@SparkQA
Copy link

SparkQA commented Aug 17, 2020

Test build #127497 has finished for PR 29448 at commit 69278b2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 17, 2020

Actually this is more tricky than I thought.

In 3.0/2.4 without the unsafe row bug fix:

  1. for hash aggregate with GROUP BY (so that we need a binary hash map), the query fails as soon as the overflow happens, due to the unsafe row bug.
  2. for hash aggregate without GROUP BY, or sort aggregate, the sum value is actually stored in a Decimal object which can hold overflowed value.

(2) is very tricky:

  1. If the overflow happens in the final aggregate, the final CheckOverflow operator can give us the correct result.
  2. If the overflow happens in the partial aggregate, it produces null, and the final aggregate treats null as 0 which indicates empty inputs, and the wrong result happens.

The failed test will not work even if we revert the commit, if we change input DataFrame partition number to 1, to trigger overflow in partial aggregate.

To give a summary for 3.0/2.4:

  1. for hash aggregate with GROUP BY, we always fail for overflow, even under non-ansi mode. This is not ideal but also not a serious bug.
  2. for hash aggregate without GROUP BY, or sort aggregate, Spark returns the wrong result if overflow happens in partial aggregate, but is fine if overflow happens in final aggregate.

That said, #29404 introduced breaking changes to (2), as it always fails for overflow. Let's revert it.

For the unsafe row bug fix #29125, it's important as unsafe row binary is used to check equality in many places (join key, grouping key, window partition key, etc.), but it also makes (1) worse as Spark may return the wrong result. We can simply revert it as well, or we re-apply #29404 only to hash aggregate with GROUP BY to avoid breaking changes (can be complex if we consider distinct aggregate).

@cloud-fan
Copy link
Contributor

@maropu @viirya @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants