[SPARK-21263][SQL] Do not allow partially parsing double and floats via NumberFormat in CSV #18532

HyukjinKwon · 2017-07-04T23:45:18Z

What changes were proposed in this pull request?

This PR proposes to remove NumberFormat.parse use to disallow a case of partially parsed data. For example,

scala> spark.read.schema("a DOUBLE").option("mode", "FAILFAST").csv(Seq("10u12").toDS).show()
+----+
|   a|
+----+
|10.0|
+----+

How was this patch tested?

Unit tests added in UnivocityParserSuite and CSVSuite.

HyukjinKwon · 2017-07-04T23:45:41Z

cc @srowen and @falaki, could you take a look and see if I understood correctly?

HyukjinKwon · 2017-07-04T23:47:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

        case options.nanValue => Float.NaN
        case options.negativeInf => Float.NegativeInfinity
        case options.positiveInf => Float.PositiveInfinity
-        case datum =>


BTW, it looks we are not using NumberFormat.parse in schema inference -

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

Line 141 in 7e5359b

if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field, options)) {

falaki · 2017-07-04T23:52:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

      }
  }
+
+  test("Do not partially lose data when parsing float and double") {


I suggest a better description for this test and please include the JIRA number. E.g.,
SPARK-21263: Invalid float and double are handled correctly in different modes

Sure, thanks.

SparkQA · 2017-07-05T01:44:35Z

Test build #79168 has finished for PR 18532 at commit 32233dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-05T02:01:58Z

Test build #79167 has finished for PR 18532 at commit 024dfcc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-05T02:36:51Z

Test build #79169 has finished for PR 18532 at commit a41c028.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

falaki

Minor suggestions. Otherwise LGTM

falaki · 2017-07-05T21:37:04Z

...ore/src/test/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParserSuite.scala

+    var message = intercept[NumberFormatException] {
+      parser.makeConverter("_1", FloatType, options = options).apply("10u000")
+    }.getMessage
+    assert(message.contains("10u000"))


Is there some more specific error we could check for?

falaki · 2017-07-05T21:37:12Z

...ore/src/test/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParserSuite.scala

+    message = intercept[NumberFormatException] {
+      parser.makeConverter("_1", DoubleType, options = options).apply("10u000")
+    }.getMessage
+    assert(message.contains("10u000"))


falaki · 2017-07-05T21:38:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        .csv(Seq("10u12").toDS())
+        .collect()
+    }
+    assert(exception.getMessage.contains("10u12"))


Can we check for more specific error message?

HyukjinKwon · 2017-07-06T02:38:35Z

Thank you @falaki. I just updated.

SparkQA · 2017-07-06T04:55:51Z

Test build #79256 has finished for PR 18532 at commit c1967f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-07-11T10:11:14Z

Merged to master

Do not allow partially parsing double and floats via NumberFormat in CSV

024dfcc

HyukjinKwon commented Jul 4, 2017

View reviewed changes

falaki suggested changes Jul 4, 2017

View reviewed changes

HyukjinKwon added 2 commits July 5, 2017 09:04

Rename the title of the test in CSVSuite

32233dd

Rename the title of test in UnivocityParserSuite too

a41c028

falaki reviewed Jul 5, 2017

View reviewed changes

Address comments

c1967f8

srowen approved these changes Jul 8, 2017

View reviewed changes

asfgit closed this in 7514db1 Jul 11, 2017

HyukjinKwon deleted the SPARK-21263 branch January 2, 2018 03:41

[SPARK-21263][SQL] Do not allow partially parsing double and floats via NumberFormat in CSV #18532

[SPARK-21263][SQL] Do not allow partially parsing double and floats via NumberFormat in CSV #18532

Uh oh!

Conversation

HyukjinKwon commented Jul 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jul 4, 2017

Uh oh!

HyukjinKwon Jul 4, 2017

Choose a reason for hiding this comment

Uh oh!

falaki Jul 4, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 5, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 5, 2017

Uh oh!

SparkQA commented Jul 5, 2017

Uh oh!

SparkQA commented Jul 5, 2017

Uh oh!

falaki left a comment

Choose a reason for hiding this comment

Uh oh!

falaki Jul 5, 2017

Choose a reason for hiding this comment

Uh oh!

falaki Jul 5, 2017

Choose a reason for hiding this comment

Uh oh!

falaki Jul 5, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 6, 2017

Uh oh!

SparkQA commented Jul 6, 2017

Uh oh!

srowen commented Jul 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants