Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35955][SQL] Check for overflow in Average in ANSI mode #33177

Closed
wants to merge 1 commit into from

Conversation

karenfeng
Copy link
Contributor

@karenfeng karenfeng commented Jul 1, 2021

What changes were proposed in this pull request?

Fixes decimal overflow issues for decimal average in ANSI mode, so that overflows throw an exception rather than returning null.

Why are the changes needed?

Query:

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(mean("decNum"))
df2: org.apache.spark.sql.DataFrame = [avg(decNum): decimal(38,22)]

scala> df2.show(40,false)

Before:

+-----------+
|avg(decNum)|
+-----------+
|null       |
+-----------+

After:

21/07/01 19:48:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.lang.ArithmeticException: Overflow in sum of decimals.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInSumOfDecimalError(QueryExecutionErrors.scala:162)
	at org.apache.spark.sql.errors.QueryExecutionErrors.overflowInSumOfDecimalError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:499)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Signed-off-by: Karen Feng <karen.feng@databricks.com>
@github-actions github-actions bot added the SQL label Jul 1, 2021
@karenfeng karenfeng changed the title [SPARK-35855][SQL] Use CheckOverflowInSum for Average [SPARK-35855][SQL] Check for overflow in Average in ANSI mode Jul 1, 2021
@peter-toth
Copy link
Contributor

I don't think this is related to SPARK-35855.

@karenfeng karenfeng changed the title [SPARK-35855][SQL] Check for overflow in Average in ANSI mode [SPARK-35955][SQL] Check for overflow in Average in ANSI mode Jul 1, 2021
@karenfeng
Copy link
Contributor Author

I don't think this is related to SPARK-35855.

Good catch @peter-toth - sorry, meant SPARK-35955 🤦‍♀️

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45050/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45050/

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140536 has finished for PR 33177 at commit 50b7e20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Thanks, merging to master

@cloud-fan
Copy link
Contributor

late LGTM

gengliangwang pushed a commit that referenced this pull request Oct 5, 2021
### What changes were proposed in this pull request?

This bug was introduced by #33177

When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type.

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

Yes, the result was wrong before this PR.

### How was this patch tested?

a new test

Closes #34180 from cloud-fan/bug.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Oct 6, 2021
This bug was introduced by apache#33177

When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type.

fix a regression

Yes, the result was wrong before this PR.

a new test

Closes apache#34180 from cloud-fan/bug.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang pushed a commit that referenced this pull request Oct 6, 2021
backport #34180

### What changes were proposed in this pull request?

This bug was introduced by #33177

When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type.

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

Yes, the result was wrong before this PR.

### How was this patch tested?

a new test

Closes #34193 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
sunchao pushed a commit to sunchao/spark that referenced this pull request Dec 8, 2021
backport apache#34180

### What changes were proposed in this pull request?

This bug was introduced by apache#33177

When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type.

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

Yes, the result was wrong before this PR.

### How was this patch tested?

a new test

Closes apache#34193 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants