[SPARK-35955][SQL] Check for overflow in Average in ANSI mode #33177

karenfeng · 2021-07-01T19:52:52Z

What changes were proposed in this pull request?

Fixes decimal overflow issues for decimal average in ANSI mode, so that overflows throw an exception rather than returning null.

Why are the changes needed?

Query:

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(mean("decNum"))
df2: org.apache.spark.sql.DataFrame = [avg(decNum): decimal(38,22)]

scala> df2.show(40,false)

Before:

+-----------+
|avg(decNum)|
+-----------+
|null       |
+-----------+

After:

21/07/01 19:48:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.lang.ArithmeticException: Overflow in sum of decimals.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInSumOfDecimalError(QueryExecutionErrors.scala:162)
	at org.apache.spark.sql.errors.QueryExecutionErrors.overflowInSumOfDecimalError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:499)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Signed-off-by: Karen Feng <karen.feng@databricks.com>

peter-toth · 2021-07-01T20:26:56Z

I don't think this is related to SPARK-35855.

karenfeng · 2021-07-01T20:29:06Z

I don't think this is related to SPARK-35855.

Good catch @peter-toth - sorry, meant SPARK-35955 🤦‍♀️

SparkQA · 2021-07-01T20:44:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45050/

SparkQA · 2021-07-01T21:18:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45050/

SparkQA · 2021-07-02T00:25:06Z

Test build #140536 has finished for PR 33177 at commit 50b7e20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2021-07-02T04:40:55Z

Thanks, merging to master

cloud-fan · 2021-07-02T07:11:17Z

late LGTM

### What changes were proposed in this pull request? This bug was introduced by #33177 When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? Yes, the result was wrong before this PR. ### How was this patch tested? a new test Closes #34180 from cloud-fan/bug. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

This bug was introduced by apache#33177 When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type. fix a regression Yes, the result was wrong before this PR. a new test Closes apache#34180 from cloud-fan/bug. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

backport #34180 ### What changes were proposed in this pull request? This bug was introduced by #33177 When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? Yes, the result was wrong before this PR. ### How was this patch tested? a new test Closes #34193 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

backport apache#34180 ### What changes were proposed in this pull request? This bug was introduced by apache#33177 When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? Yes, the result was wrong before this PR. ### How was this patch tested? a new test Closes apache#34193 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

Use CheckOverflowInSum for Average

50b7e20

Signed-off-by: Karen Feng <karen.feng@databricks.com>

github-actions bot added the SQL label Jul 1, 2021

karenfeng changed the title ~~[SPARK-35855][SQL] Use CheckOverflowInSum for Average~~ [SPARK-35855][SQL] Check for overflow in Average in ANSI mode Jul 1, 2021

karenfeng changed the title ~~[SPARK-35855][SQL] Check for overflow in Average in ANSI mode~~ [SPARK-35955][SQL] Check for overflow in Average in ANSI mode Jul 1, 2021

gengliangwang approved these changes Jul 2, 2021

View reviewed changes

gengliangwang closed this in 1fda011 Jul 2, 2021

cloud-fan mentioned this pull request Oct 5, 2021

[SPARK-36926][SQL] Decimal average mistakenly overflow #34180

Closed

cloud-fan mentioned this pull request Oct 6, 2021

[3.2][SPARK-36926][SQL] Decimal average mistakenly overflow #34193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35955][SQL] Check for overflow in Average in ANSI mode #33177

[SPARK-35955][SQL] Check for overflow in Average in ANSI mode #33177

karenfeng commented Jul 1, 2021 •

edited

Loading

peter-toth commented Jul 1, 2021

karenfeng commented Jul 1, 2021

SparkQA commented Jul 1, 2021

SparkQA commented Jul 1, 2021

SparkQA commented Jul 2, 2021

gengliangwang commented Jul 2, 2021

cloud-fan commented Jul 2, 2021

[SPARK-35955][SQL] Check for overflow in Average in ANSI mode #33177

[SPARK-35955][SQL] Check for overflow in Average in ANSI mode #33177

Conversation

karenfeng commented Jul 1, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

peter-toth commented Jul 1, 2021

karenfeng commented Jul 1, 2021

SparkQA commented Jul 1, 2021

SparkQA commented Jul 1, 2021

SparkQA commented Jul 2, 2021

gengliangwang commented Jul 2, 2021

cloud-fan commented Jul 2, 2021

karenfeng commented Jul 1, 2021 •

edited

Loading