-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types #11947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #54113 has finished for PR 11947 at commit
|
|
ping @HyukjinKwon and @rxin |
|
@cloud-fan would you take a look at this if you have time? |
| object CSVOptions { | ||
|
|
||
| /** Used for convenient construction in unit tests */ | ||
| def apply(): CSVOptions = new CSVOptions(Map.empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, I feel a bit hesitating if this CSVOptions companion object is only used in unit tests.
I'd just use new CSVOptions(Map("key" -> "value")) or new CSVOptions(Map.empty) in tests.
Otherwise, I'd just make this object in the tests if this object is required for some reasons or just make a function in tests for convenient construction.
|
For codes, overall, it looks good to me. However, I am not used to and don't have a lot of experience of dealing with Nevertheless, I feel a bit questionable for the options for |
|
|
||
| class CSVTypeCastSuite extends SparkFunSuite { | ||
|
|
||
| private def isNull(v: Any) = assert(v == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isNull looks like something that return boolean, how about assertNull?
|
I'm not sure how complicated the use case will be, but it really scares me with so many options... If we decide to do it, I think we should also add these options to JSON, to make them consistent. |
| } else if (datum == params.doublePositiveInf) { | ||
| Double.PositiveInfinity | ||
| } else { | ||
| Try(datum.toDouble) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, it looks the use of Try API is discouraged scala-style-guide#exception.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case, in a private and unexposed method, this seem OK. There are many other instances of it in CSVInferSchema
|
Test build #55010 has finished for PR 11947 at commit
|
|
Test build #55033 has finished for PR 11947 at commit
|
|
Test build #55559 has finished for PR 11947 at commit
|
|
hello! |
|
do these settings roundtrip correctly? say i set doubleNaNValue to "XY", and i create a dataframe with a Double.NaN in it, does it get written out correctly as XY, and then XY gets read back in correctly as Double.NaN? |
|
i personally would have been happy with a simple single value for nulls for all datatypes. and the usage of that single value should be consistent across reading and writing. so when that value is encountered during reading it becomes null (except for double/float columns it becomes NaN perhaps), and when writing a null values gets written out as this value. for example when dealing with text files dumped from hive this value is typically "\N" across all columns and datatypes. when i read this sort of data i simply want every "\N" to become null, and when writing out data that needs to be compatible with hive i would like to write out nulls across all columns as "\N". for cascading/scalding this value is typically "" (the empty value). so again i would want all empty values to be converted to nulls when reading, and when writing i would want every null to be written out as the empty value. a single setting "nullValue" that means when reading this becomes a null, and when writing that nulls get written as this, is basically all thats needed, i think. i do realize some people might have custom values for NaN and infinity for numerical columns, i have no experience with this. thanks |
|
As discussed offline, we should just have a single option for setting null, another for nan, another for inf and negative inf. Basically just 4. |
|
Test build #57394 has finished for PR 11947 at commit
|
|
@falaki sorry this no longer merges cleanly. Do you mind bringing it up to date? |
|
@rxin done. |
|
LGTM pending tests. |
|
Test build #57423 has finished for PR 11947 at commit
|
|
please also provide a way for strings to be converted to null upon reading |
|
@falaki can you update the pr description? |
|
@HyukjinKwon would be great if you can review this. Thanks. |
|
LGTM. (Note to myself, let's not forgot the precedence for the options if the given values are same for a PR of CSV doc) |
|
OK I'm going to merge this in master and manually update the commit message. |
What changes were proposed in this pull request?
Adds following options for parsing type-specfic nulls to CSV data source:
Adds following options for parsing NaNs:
And following options for parsing infinity:
How was this patch tested?
TypeCast.castTois unit tested and an end-to-end test is added toCSVSuite