Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to remove NumberFormat.parse use to disallow a case of partially parsed data. For example,

scala> spark.read.schema("a DOUBLE").option("mode", "FAILFAST").csv(Seq("10u12").toDS).show()
+----+
|   a|
+----+
|10.0|
+----+

How was this patch tested?

Unit tests added in UnivocityParserSuite and CSVSuite.

@HyukjinKwon
Copy link
Member Author

cc @srowen and @falaki, could you take a look and see if I understood correctly?

case options.nanValue => Float.NaN
case options.negativeInf => Float.NegativeInfinity
case options.positiveInf => Float.PositiveInfinity
case datum =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it looks we are not using NumberFormat.parse in schema inference -

if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field, options)) {

}
}

test("Do not partially lose data when parsing float and double") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a better description for this test and please include the JIRA number. E.g.,
SPARK-21263: Invalid float and double are handled correctly in different modes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks.

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79168 has finished for PR 18532 at commit 32233dd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79167 has finished for PR 18532 at commit 024dfcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79169 has finished for PR 18532 at commit a41c028.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@falaki falaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions. Otherwise LGTM

var message = intercept[NumberFormatException] {
parser.makeConverter("_1", FloatType, options = options).apply("10u000")
}.getMessage
assert(message.contains("10u000"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some more specific error we could check for?

message = intercept[NumberFormatException] {
parser.makeConverter("_1", DoubleType, options = options).apply("10u000")
}.getMessage
assert(message.contains("10u000"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

.csv(Seq("10u12").toDS())
.collect()
}
assert(exception.getMessage.contains("10u12"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check for more specific error message?

@HyukjinKwon
Copy link
Member Author

Thank you @falaki. I just updated.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79256 has finished for PR 18532 at commit c1967f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 11, 2017

Merged to master

@asfgit asfgit closed this in 7514db1 Jul 11, 2017
@HyukjinKwon HyukjinKwon deleted the SPARK-21263 branch January 2, 2018 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants