[SPARK-22176][SQL] Fix overflow issue in Dataset.show #19401

maropu · 2017-09-30T09:28:18Z

What changes were proposed in this pull request?

This pr fixed an overflow issue below in Dataset.show:

scala> Seq((1, 2), (3, 4)).toDF("a", "b").show(Int.MaxValue)
org.apache.spark.sql.AnalysisException: The limit expression must be equal to or greater than 0, but got -2147483648;;
GlobalLimit -2147483648
+- LocalLimit -2147483648
   +- Project [_1#27218 AS a#27221, _2#27219 AS b#27222]
      +- LocalRelation [_1#27218, _2#27219]

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:89)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$$checkLimitClause(CheckAnalysis.scala:70)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:234)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)

How was this patch tested?

Added tests in DataFrameSuite.

srowen · 2017-09-30T09:34:41Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    val numRows = _numRows.max(0)
-    val takeResult = toDF().take(numRows + 1)
-    val hasMoreData = takeResult.length > numRows
+    val numTotalRows = toDF().count()


You don't want to do a whole count() here -- could be quite expensive. Instead just something like:

val takeResult = toDF().take(if (numRows == Int.MaxValue) numRows else numRows + 1) val hasMoreData = takeResult.length > numRows val data = takeResult.take(numRows)

ok, I'll update

In the suggested, hasMoreData gets meaningless, so how about this?;

val (data, hasMoreData) = if (numRows < Int.MaxValue) { val takeResult = toDF().take(numRows + 1) (takeResult.take(numRows), takeResult.length > numRows) } else { val takeResult = toDF().take(numRows) val numTotalRows = toDF().count() (takeResult, numTotalRows > numRows) }

SparkQA · 2017-09-30T12:05:59Z

Test build #82350 has finished for PR 19401 at commit f988766.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-30T12:32:46Z

Test build #82352 has finished for PR 19401 at commit 340243c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-09-30T13:30:58Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      (takeResult.take(numRows), takeResult.length > numRows)
+    } else {
+      val takeResult = toDF().take(numRows)
+      val numTotalRows = toDF().count()


This still calls count(). I think it's just not worth it for a purely cosmetic difference, to print ("only showing up to 2 billion entries") in the special case that you've collected, and tried to print, 2 billion values. It probably will quite fail anyway. So just keep this simple

SparkQA · 2017-09-30T16:13:04Z

Test build #82354 has finished for PR 19401 at commit bbf2c39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-30T18:50:34Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

      _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = {
    val numRows = _numRows.max(0)
-    val takeResult = toDF().take(numRows + 1)
+    val takeResult = toDF().take(if (numRows == Int.MaxValue) numRows else numRows + 1)


Normally, we split it to two lines. How about ?

val numRows = _numRows.max(0).min(Int.MaxValue - 1) val takeResult = toDF().take(numRows + 1)

yea, looks great. I updated.

maropu · 2017-10-01T02:01:22Z

retest this please.

gatorsmile · 2017-10-01T03:10:59Z

retest this please.

maropu · 2017-10-01T04:12:28Z

It seems jenkins gets sleep

maropu · 2017-10-01T04:12:34Z

retest this please.

SparkQA · 2017-10-01T06:54:13Z

Test build #82366 has finished for PR 19401 at commit 99c988a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-10-01T08:02:04Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

  private[sql] def showString(
      _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = {
-    val numRows = _numRows.max(0)
+    val numRows = _numRows.max(0).min(Int.MaxValue - 1)


OK, but now you return one fewer row than expected when it's possible to return Int.MaxValue. Granted this is an extreme corner case, but that seems less compelling than just skipping the display of "more elements" in this case.

hmm, I see. Both is okay to me and WDYT? cc: @gatorsmile
IMHO it might be still okay to set [0, Int.MaxValue) as valid range for show cuz this is a corner case.

DataFrame.show() does not work when the number of rows is close to Int.MaxValue. The driver will be OOM before finishing the command. Thus, I do not think we can hit this extreme case.

gatorsmile · 2017-10-02T22:25:04Z

LGTM

gatorsmile · 2017-10-02T22:25:21Z

Thanks! Merged to master.

Fix overflow issue in Dataset.show

f988766

srowen reviewed Sep 30, 2017

View reviewed changes

Apply review

340243c

maropu force-pushed the MaxValueInShowString branch from 8ff32b3 to 340243c Compare September 30, 2017 10:06

srowen reviewed Sep 30, 2017

View reviewed changes

Fix more

bbf2c39

gatorsmile reviewed Sep 30, 2017

View reviewed changes

Fix

99c988a

srowen reviewed Oct 1, 2017

View reviewed changes

asfgit closed this in fa225da Oct 2, 2017

[SPARK-22176][SQL] Fix overflow issue in Dataset.show #19401

[SPARK-22176][SQL] Fix overflow issue in Dataset.show #19401

Uh oh!

Conversation

maropu commented Sep 30, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen Sep 30, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Sep 30, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Sep 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 30, 2017

Uh oh!

SparkQA commented Sep 30, 2017

Uh oh!

srowen Sep 30, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Sep 30, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 30, 2017

Uh oh!

gatorsmile Sep 30, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Oct 1, 2017

Choose a reason for hiding this comment

Uh oh!

maropu commented Oct 1, 2017

Uh oh!

gatorsmile commented Oct 1, 2017

Uh oh!

maropu commented Oct 1, 2017

Uh oh!

maropu commented Oct 1, 2017

Uh oh!

SparkQA commented Oct 1, 2017

Uh oh!

srowen Oct 1, 2017

Choose a reason for hiding this comment

Uh oh!

maropu Oct 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Oct 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Oct 2, 2017

Uh oh!

gatorsmile commented Oct 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maropu Sep 30, 2017 •

edited

Loading

maropu Oct 1, 2017 •

edited

Loading

gatorsmile Oct 1, 2017 •

edited

Loading

gatorsmile commented Oct 2, 2017 •

edited

Loading