Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Sep 30, 2017

What changes were proposed in this pull request?

This pr fixed an overflow issue below in Dataset.show:

scala> Seq((1, 2), (3, 4)).toDF("a", "b").show(Int.MaxValue)
org.apache.spark.sql.AnalysisException: The limit expression must be equal to or greater than 0, but got -2147483648;;
GlobalLimit -2147483648
+- LocalLimit -2147483648
   +- Project [_1#27218 AS a#27221, _2#27219 AS b#27222]
      +- LocalRelation [_1#27218, _2#27219]

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:89)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$$checkLimitClause(CheckAnalysis.scala:70)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:234)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)

How was this patch tested?

Added tests in DataFrameSuite.

val numRows = _numRows.max(0)
val takeResult = toDF().take(numRows + 1)
val hasMoreData = takeResult.length > numRows
val numTotalRows = toDF().count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to do a whole count() here -- could be quite expensive. Instead just something like:

    val takeResult = toDF().take(if (numRows == Int.MaxValue) numRows else numRows + 1)
    val hasMoreData = takeResult.length > numRows
    val data = takeResult.take(numRows)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll update

Copy link
Member Author

@maropu maropu Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the suggested, hasMoreData gets meaningless, so how about this?;

    val (data, hasMoreData) = if (numRows < Int.MaxValue) {
      val takeResult = toDF().take(numRows + 1)
      (takeResult.take(numRows), takeResult.length > numRows)
    } else {
      val takeResult = toDF().take(numRows)
      val numTotalRows = toDF().count()
      (takeResult, numTotalRows > numRows)
    }

@maropu maropu force-pushed the MaxValueInShowString branch from 8ff32b3 to 340243c Compare September 30, 2017 10:06
@SparkQA
Copy link

SparkQA commented Sep 30, 2017

Test build #82350 has finished for PR 19401 at commit f988766.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2017

Test build #82352 has finished for PR 19401 at commit 340243c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

(takeResult.take(numRows), takeResult.length > numRows)
} else {
val takeResult = toDF().take(numRows)
val numTotalRows = toDF().count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still calls count(). I think it's just not worth it for a purely cosmetic difference, to print ("only showing up to 2 billion entries") in the special case that you've collected, and tried to print, 2 billion values. It probably will quite fail anyway. So just keep this simple

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented Sep 30, 2017

Test build #82354 has finished for PR 19401 at commit bbf2c39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

_numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = {
val numRows = _numRows.max(0)
val takeResult = toDF().take(numRows + 1)
val takeResult = toDF().take(if (numRows == Int.MaxValue) numRows else numRows + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we split it to two lines. How about ?

    val numRows = _numRows.max(0).min(Int.MaxValue - 1)
    val takeResult = toDF().take(numRows + 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, looks great. I updated.

@maropu
Copy link
Member Author

maropu commented Oct 1, 2017

retest this please.

1 similar comment
@gatorsmile
Copy link
Member

retest this please.

@maropu
Copy link
Member Author

maropu commented Oct 1, 2017

It seems jenkins gets sleep

@maropu
Copy link
Member Author

maropu commented Oct 1, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 1, 2017

Test build #82366 has finished for PR 19401 at commit 99c988a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[sql] def showString(
_numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = {
val numRows = _numRows.max(0)
val numRows = _numRows.max(0).min(Int.MaxValue - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but now you return one fewer row than expected when it's possible to return Int.MaxValue. Granted this is an extreme corner case, but that seems less compelling than just skipping the display of "more elements" in this case.

Copy link
Member Author

@maropu maropu Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I see. Both is okay to me and WDYT? cc: @gatorsmile
IMHO it might be still okay to set [0, Int.MaxValue) as valid range for show cuz this is a corner case.

Copy link
Member

@gatorsmile gatorsmile Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.show() does not work when the number of rows is close to Int.MaxValue. The driver will be OOM before finishing the command. Thus, I do not think we can hit this extreme case.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

gatorsmile commented Oct 2, 2017

Thanks! Merged to master.

@asfgit asfgit closed this in fa225da Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants