Adding and option to the csv parser, withParseExecptionAsNull which a… #298

rachelwarren · 2016-03-19T00:27:27Z

…llows dirty data to be parsed as nulls rather than cause failures. For example, if there is a String in a numeric column, rather than failing it lets that value be null. My experience is that data is often dirty, and some times you want and option that allows mall-formed numbers or dates without causing failures. I have notice a good deal of issues related to this package have to do with similar problems. My implementation is written so that it doesn't affect any of the existing functionality. Instead, it provides an extra option "withParseExceptionAsNull".

…llows dirty data to be parsed as nulls rather than cause failures.

codecov-io · 2016-03-19T00:27:29Z

Current coverage is `86.30%`

Merging #298 into master will decrease coverage by -0.36% as of f2349c3

@@            master   #298   diff @@
=====================================
  Files           12     12       
  Stmts          540    555    +15
  Branches       160    166     +6
  Methods          0      0       
=====================================
+ Hit            468    479    +11
  Partial          0      0       
- Missed          72     76     +4

Review entire Coverage Diff as of f2349c3

Powered by Codecov. Updated on successful CI builds.

HyukjinKwon · 2016-03-19T00:36:00Z

@falaki Actually, shouldn't such behaviour be included in PERMISSIVE mode?

HyukjinKwon · 2016-03-19T06:16:36Z

@rachelwarren Would you maybe share some codes and outputs after this change?

AFAIK JSON data source at Spark produces null in this case. For me I think maybe it does not necessarily have to work exactly as the same for each mode in data sources but I think it would be great if they are consistent.

I am trying to add parse modes in JSON at Spark, apache/spark#11756 here.

rachelwarren · 2016-03-21T05:18:29Z

I'm not exactly sure what you mean. The goal is that running something like the code:
new CsvParser().
withUseHeader(header).
withDelimiter(delim).
withQuoteChar(quoteString).
withSchema(schema).
csvFile(sqlContext, path)

will not fail in the general case, even if the schema specifies for example a numeric column, and there is a string value in that column.
I.e.
If we have
year, cars
1995, "Toyota"
2000, "Honda"
new, "Ford"
then this will be read in as
1995, "Toyota"
2000, "Honda"
null, "Ford"
if the year column is an integer.

HyukjinKwon · 2016-03-21T06:25:08Z

Ah, I mean I think this behaviour might have to be included in PERMISSIVE mode, not via an option because JSON data source also produces the results in this way with this PERMISSIVE mode.

For example, the data below:

{"a": {"b": 1}}
{"a": []}

with the schema below:

val schema =
  StructType(
    StructField("a", StructType(
      StructField("b", StringType) :: Nil
    )) :: Nil)

produces the results below:

+----+
|   a|
+----+
| [1]|
|null|
+----+

So, I thought it might be better if this CSV data source is also consistent.

rachelwarren · 2016-03-21T17:55:44Z

i see. I can look into making that fix.

On Sun, Mar 20, 2016 at 11:25 PM, Hyukjin Kwon notifications@github.com
wrote:

Ah, I mean I think this behaviour should be included in PERMISSIVE mode
not an option because JSON data source will produce the results in this way
with this PERMISSIVE mode.

For example, the data below:

{"a": {"b": 1}}
{"a": []}

with the schema below:

val schema =
StructType(
StructField("a", StructType(
StructField("b", StringType) :: Nil
)) :: Nil)

This produces the results below:

+----+| a|
+----+| [1]||null|
+----+

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#298 (comment)

HyukjinKwon · 2016-03-21T22:52:02Z

Maybe we should wait for feedback. That was just my opinion.

HyukjinKwon · 2016-03-29T09:20:26Z

I realised this is actually related with https://github.com/databricks/spark-csv/issues/286.

rachelwarren · 2016-03-29T18:05:09Z

It does. I actually did try adding it to PERMISSIVE MODE, so you can see what that looks like on this branch: https://github.com/rachelwarren/spark-csv/tree/permissive

dmsuehir · 2016-07-06T17:17:45Z

Hello,

What's the status on this pull request? I'm looking for this feature to get a null value when it can't be parsed. Will it be getting in as a new option or as a modification to PERMISSIVE? Can we expect to see the change get in master anytime soon?

Thanks,
Dina

rachelwarren · 2016-07-06T17:22:22Z

Its still hanging out. I have implemented the permissive strategy as well.
https://github.com/rachelwarren/spark-csv/tree/permissive

On Tue, Mar 29, 2016 at 2:20 AM, Hyukjin Kwon notifications@github.com
wrote:

I realised this is actually related with #286
https://github.com/databricks/spark-csv/issues/286.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#298 (comment)

joyeshmishra · 2016-07-12T00:47:54Z

Could this be reviewed/merged? (or the permissive PR)

falaki · 2016-07-12T20:55:35Z

I somewhat disagree with this pull request. For example, if there is a string in a field, user can either

Use schema inference to correctly mark that field as StringType
Change/fix the schema passed to data source

Would you please explain why neither of those two options work in your use case?

rachelwarren · 2016-07-12T21:38:21Z

This addresses a dirty data issue. The columns are not string columns, they are integer or data columns, they just may have a few malformed values, when working in the real data work i run into this instance all the time. Part of the trouble is that different systems represent null values in a different way. For example a null value in some systems that we use is represented as "N/A" which is a parsed in as a string, when it should really be interpreted as a missing value. With user entered data there may be errors especially in date formats. The dataset may have many columns, I don't want to drop the entire row just because the value 5.0 is in one integer column.
The alternative to this pull request is to read in every row as strings so that it doesn't fail on dirty data and then convert them safely to the real type. This solution, in my opinion defeats the purpose of using an off the shelf CSV parser, at all (and is why we have stopped using this parser).

@joyeshmishra @dmsuehir Does this sound relevant to your use cases?

dmsuehir · 2016-07-12T22:25:34Z

Yes, that is exactly why we want this feature as well. We don't want to stop parsing when we come across a value that can't be parsed to the specified data type, and we don't want to change the schema to accommodate bad values. It's kind of similar to the DROPMALFORMED mode, except that we don't want the entire row dropped, we just want null/missing values where the bad values were.

falaki · 2016-07-12T22:34:26Z

I see. Thanks for the feedback. Also do you guys plan to use Spark 2.0? This data source has been inlined in Spark 2.0. So the package would only work with Spark 1.X.

rachelwarren · 2016-07-12T22:42:14Z

We are not going to update for a while so this would be fine, but I can
write a 2.0 version in the next few weeks as well.

On Tue, Jul 12, 2016 at 3:34 PM, Hossein Falaki notifications@github.com
wrote:

I see. Thanks for the feedback. Also do you guys plan to use Spark 2.0?
This data source has been inlined in Spark 2.0. So the package would only
work with Spark 1.X.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#298 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFIpBhdEBcMyZb_VOSYiIBWwVixuEVgAks5qVBZ3gaJpZM4H0UfR
.

HyukjinKwon · 2016-07-12T22:43:54Z

FYI, this still happens in the csv one at Spark 2.0 as well. It might be great if this issue is ported to JIRA anyway.

I will create a JIRA after trying to reproduce this.

falaki · 2016-07-12T23:22:48Z

src/main/scala/com/databricks/spark/csv/util/TypeCast.scala

 import java.util.Locale

 import org.apache.spark.sql.types._
+import org.json4s.ParserUtil.ParseException


This import is not used and not needed.

falaki · 2016-07-12T23:34:22Z

I did a first pass on it. @HyukjinKwon Would you also take a look?

HyukjinKwon · 2016-07-13T00:08:36Z

@falaki Sure! thank you for cc me.

HyukjinKwon · 2016-07-13T00:19:26Z

@falaki, Actually, I noticed this before and have been thinking overtime (but I did not to open an issue because I thought it is a too narrow case).

Since Spark 2.0 JSON datasource also has the parse modes just like CSV, I thought it might be great if it is consistent.

Currently, this behaviour is included in JSON's PERMISSIVE mode, for example,

val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()

produces the results below:

+----+
|   a|
+----+
|   1|
|null|
+----+

Would this make sense? I don't mind if you feel strongly adding this as an option is correct but just I wanted to let you know just in case.

falaki · 2016-07-13T00:27:22Z

I think for data source it makes more sense not to change behavior of existing configuration. In that case, it is safer to add a new mode.

HyukjinKwon · 2016-07-13T00:27:46Z

I hear you. Thanks!

HyukjinKwon · 2016-07-13T00:29:53Z

src/test/scala/com/databricks/spark/csv/CsvSuite.scala

+                  .withParseMode(ParseModes.DROP_MALFORMED_MODE)
+                  .csvFile(sqlContext, ageFile)
+                  .select("Name")
+                  .collect()


We might have to fix the indentation here,

val r = new CsvParser() .withSchema(strictSchema) .withUseHeader(true) .withParserLib(parserLib) .withParseMode(ParseModes.DROP_MALFORMED_MODE) .csvFile(sqlContext, ageFile) .select("Name") .collect()

(maybe val parser instead of val r just to be consistent)

Sorry for noise. I took another look. It seems this val r is not related with this PR and not used?

HyukjinKwon · 2016-07-13T01:21:11Z

Filed here, https://issues.apache.org/jira/browse/SPARK-16512

rachelwarren · 2016-07-13T20:42:26Z

Thanks for all the feedback. I can address this next week, and hopefully get it in. Then I'll work on the 2.0 version after that.

rachelwarren · 2016-11-15T03:17:08Z

Sorting out all the merge conflicts associated with this PR was a bit gnarly. I have made a new request and am going to close this one.

Adding and option to the csv parser, withParseExecptionAsNull which a…

5c0634c

…llows dirty data to be parsed as nulls rather than cause failures.

Fixing scala style errors.

30fa371

falaki reviewed Jul 12, 2016
View reviewed changes

HyukjinKwon reviewed Jul 13, 2016
View reviewed changes

rachelwarren force-pushed the master branch 2 times, most recently from 810a713 to 30fa371 Compare September 29, 2016 06:37

rachelwarren mentioned this pull request Nov 15, 2016

[SPARK-16512] Adding a insertNullOnErrors option. #400

Open

rachelwarren closed this Nov 15, 2016

Adding and option to the csv parser, withParseExecptionAsNull which a… #298

Adding and option to the csv parser, withParseExecptionAsNull which a… #298

Uh oh!

Conversation

rachelwarren commented Mar 19, 2016

Uh oh!

codecov-io commented Mar 19, 2016

Current coverage is 86.30%

Uh oh!

HyukjinKwon commented Mar 19, 2016

Uh oh!

HyukjinKwon commented Mar 19, 2016

Uh oh!

rachelwarren commented Mar 21, 2016

Uh oh!

HyukjinKwon commented Mar 21, 2016

Uh oh!

rachelwarren commented Mar 21, 2016

Uh oh!

HyukjinKwon commented Mar 21, 2016

Uh oh!

HyukjinKwon commented Mar 29, 2016

Uh oh!

rachelwarren commented Mar 29, 2016

Uh oh!

dmsuehir commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rachelwarren commented Jul 6, 2016

Uh oh!

joyeshmishra commented Jul 12, 2016

Uh oh!

falaki commented Jul 12, 2016

Uh oh!

rachelwarren commented Jul 12, 2016

Uh oh!

dmsuehir commented Jul 12, 2016

Uh oh!

falaki commented Jul 12, 2016

Uh oh!

rachelwarren commented Jul 12, 2016

Uh oh!

HyukjinKwon commented Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falaki Jul 12, 2016

Choose a reason for hiding this comment

Uh oh!

falaki commented Jul 12, 2016

Uh oh!

HyukjinKwon commented Jul 13, 2016

Uh oh!

HyukjinKwon commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falaki commented Jul 13, 2016

Uh oh!

HyukjinKwon commented Jul 13, 2016

Uh oh!

HyukjinKwon Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 13, 2016

Uh oh!

rachelwarren commented Jul 13, 2016

Uh oh!

rachelwarren commented Nov 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Current coverage is `86.30%`

dmsuehir commented Jul 6, 2016 •

edited

Loading

HyukjinKwon commented Jul 12, 2016 •

edited

Loading

HyukjinKwon commented Jul 13, 2016 •

edited

Loading

HyukjinKwon Jul 13, 2016 •

edited

Loading

HyukjinKwon Jul 13, 2016 •

edited

Loading

HyukjinKwon Jul 13, 2016 •

edited

Loading