-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22176][SQL] Fix overflow issue in Dataset.show #19401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| val numRows = _numRows.max(0) | ||
| val takeResult = toDF().take(numRows + 1) | ||
| val hasMoreData = takeResult.length > numRows | ||
| val numTotalRows = toDF().count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't want to do a whole count() here -- could be quite expensive. Instead just something like:
val takeResult = toDF().take(if (numRows == Int.MaxValue) numRows else numRows + 1)
val hasMoreData = takeResult.length > numRows
val data = takeResult.take(numRows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the suggested, hasMoreData gets meaningless, so how about this?;
val (data, hasMoreData) = if (numRows < Int.MaxValue) {
val takeResult = toDF().take(numRows + 1)
(takeResult.take(numRows), takeResult.length > numRows)
} else {
val takeResult = toDF().take(numRows)
val numTotalRows = toDF().count()
(takeResult, numTotalRows > numRows)
}
8ff32b3 to
340243c
Compare
|
Test build #82350 has finished for PR 19401 at commit
|
|
Test build #82352 has finished for PR 19401 at commit
|
| (takeResult.take(numRows), takeResult.length > numRows) | ||
| } else { | ||
| val takeResult = toDF().take(numRows) | ||
| val numTotalRows = toDF().count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still calls count(). I think it's just not worth it for a purely cosmetic difference, to print ("only showing up to 2 billion entries") in the special case that you've collected, and tried to print, 2 billion values. It probably will quite fail anyway. So just keep this simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #82354 has finished for PR 19401 at commit
|
| _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = { | ||
| val numRows = _numRows.max(0) | ||
| val takeResult = toDF().take(numRows + 1) | ||
| val takeResult = toDF().take(if (numRows == Int.MaxValue) numRows else numRows + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, we split it to two lines. How about ?
val numRows = _numRows.max(0).min(Int.MaxValue - 1)
val takeResult = toDF().take(numRows + 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, looks great. I updated.
|
retest this please. |
1 similar comment
|
retest this please. |
|
It seems jenkins gets sleep |
|
retest this please. |
|
Test build #82366 has finished for PR 19401 at commit
|
| private[sql] def showString( | ||
| _numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = { | ||
| val numRows = _numRows.max(0) | ||
| val numRows = _numRows.max(0).min(Int.MaxValue - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but now you return one fewer row than expected when it's possible to return Int.MaxValue. Granted this is an extreme corner case, but that seems less compelling than just skipping the display of "more elements" in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I see. Both is okay to me and WDYT? cc: @gatorsmile
IMHO it might be still okay to set [0, Int.MaxValue) as valid range for show cuz this is a corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame.show() does not work when the number of rows is close to Int.MaxValue. The driver will be OOM before finishing the command. Thus, I do not think we can hit this extreme case.
|
LGTM |
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
This pr fixed an overflow issue below in
Dataset.show:How was this patch tested?
Added tests in
DataFrameSuite.