[SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types #11947

falaki · 2016-03-25T00:04:59Z

What changes were proposed in this pull request?

Adds following options for parsing type-specfic nulls to CSV data source:

byteNullValue
integerNullValue
shortNullValue
longNullValue
floatNullValue
doubleNullValue
decimalNullValue

Adds following options for parsing NaNs:

floatNaNValue
doubleNaNValue

And following options for parsing infinity:

floatNegativeInf
floatPositiveInf
doubleNegativeInf
doublePositiveInf

How was this patch tested?

TypeCast.castTo is unit tested and an end-to-end test is added to CSVSuite

SparkQA · 2016-03-25T00:08:56Z

Test build #54113 has finished for PR 11947 at commit 93ac6bb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-03-28T17:42:13Z

ping @HyukjinKwon and @rxin

falaki · 2016-03-28T20:31:15Z

@cloud-fan would you take a look at this if you have time?

HyukjinKwon · 2016-03-29T00:29:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

+object CSVOptions {
+
+  /** Used for convenient construction in unit tests */
+  def apply(): CSVOptions = new CSVOptions(Map.empty)


For me, I feel a bit hesitating if this CSVOptions companion object is only used in unit tests.

I'd just use new CSVOptions(Map("key" -> "value")) or new CSVOptions(Map.empty) in tests.
Otherwise, I'd just make this object in the tests if this object is required for some reasons or just make a function in tests for convenient construction.

HyukjinKwon · 2016-03-29T03:01:05Z

For codes, overall, it looks good to me.

However, I am not used to and don't have a lot of experience of dealing with NaN, Inf or -Inf. If the values can be different in many cases, I think it is reasonable.

Nevertheless, I feel a bit questionable for the options for null for each type.

HyukjinKwon · 2016-03-29T05:00:36Z

I found both NaN and Infinity are handled in JSON data source and it was fixed in this PR, 7a9dcbc.

cc @yhuai

cloud-fan · 2016-03-29T11:31:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala


 class CSVTypeCastSuite extends SparkFunSuite {

+  private def isNull(v: Any) = assert(v == null)


nit: isNull looks like something that return boolean, how about assertNull?

cloud-fan · 2016-03-29T11:36:06Z

I'm not sure how complicated the use case will be, but it really scares me with so many options...

If we decide to do it, I think we should also add these options to JSON, to make them consistent.

HyukjinKwon · 2016-03-29T22:44:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

+        } else if (datum == params.doublePositiveInf) {
+          Double.PositiveInfinity
+        } else {
+          Try(datum.toDouble)


(Also, it looks the use of Try API is discouraged scala-style-guide#exception.)

I think in this case, in a private and unexposed method, this seem OK. There are many other instances of it in CSVInferSchema

SparkQA · 2016-04-05T19:18:47Z

Test build #55010 has finished for PR 11947 at commit 180a900.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-05T22:47:56Z

Test build #55033 has finished for PR 11947 at commit 124873b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-12T01:58:26Z

Test build #55559 has finished for PR 11947 at commit 161a3eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

koertkuipers · 2016-04-27T18:55:24Z

hello!
why is there no stringNullValue?
basically i want for a column with type string to read in all empty strings as nulls. this is what the old option "treatEmptyStringsAsNulls" used to do. its the natural complement for writing out nulls as empty strings (without this data does not roundtrip).
thanks

koertkuipers · 2016-04-27T19:01:17Z

do these settings roundtrip correctly? say i set doubleNaNValue to "XY", and i create a dataframe with a Double.NaN in it, does it get written out correctly as XY, and then XY gets read back in correctly as Double.NaN?

koertkuipers · 2016-04-27T19:10:50Z

i personally would have been happy with a simple single value for nulls for all datatypes.

and the usage of that single value should be consistent across reading and writing. so when that value is encountered during reading it becomes null (except for double/float columns it becomes NaN perhaps), and when writing a null values gets written out as this value.

for example when dealing with text files dumped from hive this value is typically "\N" across all columns and datatypes. when i read this sort of data i simply want every "\N" to become null, and when writing out data that needs to be compatible with hive i would like to write out nulls across all columns as "\N".

for cascading/scalding this value is typically "" (the empty value). so again i would want all empty values to be converted to nulls when reading, and when writing i would want every null to be written out as the empty value.

a single setting "nullValue" that means when reading this becomes a null, and when writing that nulls get written as this, is basically all thats needed, i think.

i do realize some people might have custom values for NaN and infinity for numerical columns, i have no experience with this.

thanks

rxin · 2016-04-30T01:58:04Z

As discussed offline, we should just have a single option for setting null, another for nan, another for inf and negative inf. Basically just 4.

SparkQA · 2016-04-30T03:40:38Z

Test build #57394 has finished for PR 11947 at commit 698b4b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-30T06:00:42Z

@falaki sorry this no longer merges cleanly. Do you mind bringing it up to date?

falaki · 2016-04-30T07:19:27Z

@rxin done.

rxin · 2016-04-30T07:23:36Z

LGTM pending tests.

SparkQA · 2016-04-30T08:42:58Z

Test build #57423 has finished for PR 11947 at commit 6facd26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

koertkuipers · 2016-04-30T17:17:59Z

please also provide a way for strings to be converted to null upon reading

rxin · 2016-04-30T18:00:50Z

@falaki can you update the pr description?

rxin · 2016-04-30T18:04:40Z

@HyukjinKwon would be great if you can review this. Thanks.

HyukjinKwon · 2016-04-30T23:52:18Z

LGTM.

(Note to myself, let's not forgot the precedence for the options if the given values are same for a PR of CSV doc)

rxin · 2016-05-01T01:07:15Z

OK I'm going to merge this in master and manually update the commit message.

Added support for null, NaN and Inf options for numeric types

93ac6bb

HyukjinKwon reviewed Mar 29, 2016
View reviewed changes

cloud-fan reviewed Mar 29, 2016
View reviewed changes

HyukjinKwon reviewed Mar 29, 2016
View reviewed changes

falaki added 2 commits April 5, 2016 11:53

Merge branch 'master' of https://github.com/apache/spark

9594ee5

Addressed comments

180a900

Using assertNull instead of isNull

124873b

Merged master

161a3eb

Merge branch 'master' into SPARK-14143

3316101

Updated to reduce number of options

698b4b4

HyukjinKwon mentioned this pull request Apr 30, 2016

[SPARK-13667][SQL] Support for specifying custom date format for date and timestamp types at CSV datasource. #11550

Closed

Merged master

6facd26

asfgit closed this in 507bea5 May 1, 2016


		class CSVTypeCastSuite extends SparkFunSuite {

		private def isNull(v: Any) = assert(v == null)

[SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types #11947

[SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types #11947

Uh oh!

Conversation

falaki commented Mar 25, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

falaki commented Mar 28, 2016

Uh oh!

falaki commented Mar 28, 2016

Uh oh!

HyukjinKwon Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 29, 2016

Uh oh!

HyukjinKwon commented Mar 29, 2016

Uh oh!

cloud-fan Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 29, 2016

Uh oh!

HyukjinKwon Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

falaki Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

koertkuipers commented Apr 27, 2016

Uh oh!

koertkuipers commented Apr 27, 2016

Uh oh!

koertkuipers commented Apr 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Apr 30, 2016

Uh oh!

SparkQA commented Apr 30, 2016

Uh oh!

rxin commented Apr 30, 2016

Uh oh!

falaki commented Apr 30, 2016

Uh oh!

rxin commented Apr 30, 2016

Uh oh!

SparkQA commented Apr 30, 2016

Uh oh!

koertkuipers commented Apr 30, 2016

Uh oh!

rxin commented Apr 30, 2016

Uh oh!

rxin commented Apr 30, 2016

Uh oh!

HyukjinKwon commented Apr 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented May 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

koertkuipers commented Apr 27, 2016 •

edited

Loading

HyukjinKwon commented Apr 30, 2016 •

edited

Loading