Skip to content

Conversation

@falaki
Copy link
Contributor

@falaki falaki commented Mar 25, 2016

What changes were proposed in this pull request?

Adds following options for parsing type-specfic nulls to CSV data source:

  • byteNullValue
  • integerNullValue
  • shortNullValue
  • longNullValue
  • floatNullValue
  • doubleNullValue
  • decimalNullValue

Adds following options for parsing NaNs:

  • floatNaNValue
  • doubleNaNValue

And following options for parsing infinity:

  • floatNegativeInf
  • floatPositiveInf
  • doubleNegativeInf
  • doublePositiveInf

How was this patch tested?

TypeCast.castTo is unit tested and an end-to-end test is added to CSVSuite

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54113 has finished for PR 11947 at commit 93ac6bb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@falaki
Copy link
Contributor Author

falaki commented Mar 28, 2016

ping @HyukjinKwon and @rxin

@falaki
Copy link
Contributor Author

falaki commented Mar 28, 2016

@cloud-fan would you take a look at this if you have time?

object CSVOptions {

/** Used for convenient construction in unit tests */
def apply(): CSVOptions = new CSVOptions(Map.empty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I feel a bit hesitating if this CSVOptions companion object is only used in unit tests.

I'd just use new CSVOptions(Map("key" -> "value")) or new CSVOptions(Map.empty) in tests.
Otherwise, I'd just make this object in the tests if this object is required for some reasons or just make a function in tests for convenient construction.

@HyukjinKwon
Copy link
Member

For codes, overall, it looks good to me.

However, I am not used to and don't have a lot of experience of dealing with NaN, Inf or -Inf. If the values can be different in many cases, I think it is reasonable.

Nevertheless, I feel a bit questionable for the options for null for each type.

@HyukjinKwon
Copy link
Member

I found both NaN and Infinity are handled in JSON data source and it was fixed in this PR, 7a9dcbc.

cc @yhuai


class CSVTypeCastSuite extends SparkFunSuite {

private def isNull(v: Any) = assert(v == null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isNull looks like something that return boolean, how about assertNull?

@cloud-fan
Copy link
Contributor

I'm not sure how complicated the use case will be, but it really scares me with so many options...

If we decide to do it, I think we should also add these options to JSON, to make them consistent.

} else if (datum == params.doublePositiveInf) {
Double.PositiveInfinity
} else {
Try(datum.toDouble)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, it looks the use of Try API is discouraged scala-style-guide#exception.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case, in a private and unexposed method, this seem OK. There are many other instances of it in CSVInferSchema

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #55010 has finished for PR 11947 at commit 180a900.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #55033 has finished for PR 11947 at commit 124873b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55559 has finished for PR 11947 at commit 161a3eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@koertkuipers
Copy link
Contributor

hello!
why is there no stringNullValue?
basically i want for a column with type string to read in all empty strings as nulls. this is what the old option "treatEmptyStringsAsNulls" used to do. its the natural complement for writing out nulls as empty strings (without this data does not roundtrip).
thanks

@koertkuipers
Copy link
Contributor

do these settings roundtrip correctly? say i set doubleNaNValue to "XY", and i create a dataframe with a Double.NaN in it, does it get written out correctly as XY, and then XY gets read back in correctly as Double.NaN?

@koertkuipers
Copy link
Contributor

koertkuipers commented Apr 27, 2016

i personally would have been happy with a simple single value for nulls for all datatypes.

and the usage of that single value should be consistent across reading and writing. so when that value is encountered during reading it becomes null (except for double/float columns it becomes NaN perhaps), and when writing a null values gets written out as this value.

for example when dealing with text files dumped from hive this value is typically "\N" across all columns and datatypes. when i read this sort of data i simply want every "\N" to become null, and when writing out data that needs to be compatible with hive i would like to write out nulls across all columns as "\N".

for cascading/scalding this value is typically "" (the empty value). so again i would want all empty values to be converted to nulls when reading, and when writing i would want every null to be written out as the empty value.

a single setting "nullValue" that means when reading this becomes a null, and when writing that nulls get written as this, is basically all thats needed, i think.

i do realize some people might have custom values for NaN and infinity for numerical columns, i have no experience with this.

thanks

@rxin
Copy link
Contributor

rxin commented Apr 30, 2016

As discussed offline, we should just have a single option for setting null, another for nan, another for inf and negative inf. Basically just 4.

@SparkQA
Copy link

SparkQA commented Apr 30, 2016

Test build #57394 has finished for PR 11947 at commit 698b4b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 30, 2016

@falaki sorry this no longer merges cleanly. Do you mind bringing it up to date?

@falaki
Copy link
Contributor Author

falaki commented Apr 30, 2016

@rxin done.

@rxin
Copy link
Contributor

rxin commented Apr 30, 2016

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Apr 30, 2016

Test build #57423 has finished for PR 11947 at commit 6facd26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@koertkuipers
Copy link
Contributor

please also provide a way for strings to be converted to null upon reading

@rxin
Copy link
Contributor

rxin commented Apr 30, 2016

@falaki can you update the pr description?

@rxin
Copy link
Contributor

rxin commented Apr 30, 2016

@HyukjinKwon would be great if you can review this. Thanks.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 30, 2016

LGTM.

(Note to myself, let's not forgot the precedence for the options if the given values are same for a PR of CSV doc)

@rxin
Copy link
Contributor

rxin commented May 1, 2016

OK I'm going to merge this in master and manually update the commit message.

@asfgit asfgit closed this in 507bea5 May 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants