Skip to content

Conversation

@rachelwarren
Copy link

…llows dirty data to be parsed as nulls rather than cause failures. For example, if there is a String in a numeric column, rather than failing it lets that value be null. My experience is that data is often dirty, and some times you want and option that allows mall-formed numbers or dates without causing failures. I have notice a good deal of issues related to this package have to do with similar problems. My implementation is written so that it doesn't affect any of the existing functionality. Instead, it provides an extra option "withParseExceptionAsNull".

…llows dirty data to be parsed as nulls rather than cause failures.
@codecov-io
Copy link

Current coverage is 86.30%

Merging #298 into master will decrease coverage by -0.36% as of f2349c3

@@            master   #298   diff @@
=====================================
  Files           12     12       
  Stmts          540    555    +15
  Branches       160    166     +6
  Methods          0      0       
=====================================
+ Hit            468    479    +11
  Partial          0      0       
- Missed          72     76     +4

Review entire Coverage Diff as of f2349c3

Powered by Codecov. Updated on successful CI builds.

@HyukjinKwon
Copy link
Member

@falaki Actually, shouldn't such behaviour be included in PERMISSIVE mode?

@HyukjinKwon
Copy link
Member

@rachelwarren Would you maybe share some codes and outputs after this change?

AFAIK JSON data source at Spark produces null in this case. For me I think maybe it does not necessarily have to work exactly as the same for each mode in data sources but I think it would be great if they are consistent.

I am trying to add parse modes in JSON at Spark, apache/spark#11756 here.

@rachelwarren
Copy link
Author

I'm not exactly sure what you mean. The goal is that running something like the code:
new CsvParser().
withUseHeader(header).
withDelimiter(delim).
withQuoteChar(quoteString).
withSchema(schema).
csvFile(sqlContext, path)

will not fail in the general case, even if the schema specifies for example a numeric column, and there is a string value in that column.
I.e.
If we have
year, cars
1995, "Toyota"
2000, "Honda"
new, "Ford"
then this will be read in as
1995, "Toyota"
2000, "Honda"
null, "Ford"
if the year column is an integer.

@HyukjinKwon
Copy link
Member

Ah, I mean I think this behaviour might have to be included in PERMISSIVE mode, not via an option because JSON data source also produces the results in this way with this PERMISSIVE mode.

For example, the data below:

{"a": {"b": 1}}
{"a": []}

with the schema below:

val schema =
  StructType(
    StructField("a", StructType(
      StructField("b", StringType) :: Nil
    )) :: Nil)

produces the results below:

+----+
|   a|
+----+
| [1]|
|null|
+----+

So, I thought it might be better if this CSV data source is also consistent.

@rachelwarren
Copy link
Author

i see. I can look into making that fix.

On Sun, Mar 20, 2016 at 11:25 PM, Hyukjin Kwon notifications@github.com
wrote:

Ah, I mean I think this behaviour should be included in PERMISSIVE mode
not an option because JSON data source will produce the results in this way
with this PERMISSIVE mode.

For example, the data below:

{"a": {"b": 1}}
{"a": []}

with the schema below:

val schema =
StructType(
StructField("a", StructType(
StructField("b", StringType) :: Nil
)) :: Nil)

This produces the results below:

+----+| a|
+----+| [1]||null|
+----+


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#298 (comment)

@HyukjinKwon
Copy link
Member

Maybe we should wait for feedback. That was just my opinion.

@HyukjinKwon
Copy link
Member

I realised this is actually related with https://github.com/databricks/spark-csv/issues/286.

@rachelwarren
Copy link
Author

It does. I actually did try adding it to PERMISSIVE MODE, so you can see what that looks like on this branch: https://github.com/rachelwarren/spark-csv/tree/permissive

@dmsuehir
Copy link

dmsuehir commented Jul 6, 2016

Hello,

What's the status on this pull request? I'm looking for this feature to get a null value when it can't be parsed. Will it be getting in as a new option or as a modification to PERMISSIVE? Can we expect to see the change get in master anytime soon?

Thanks,
Dina

@rachelwarren
Copy link
Author

Its still hanging out. I have implemented the permissive strategy as well.
https://github.com/rachelwarren/spark-csv/tree/permissive

On Tue, Mar 29, 2016 at 2:20 AM, Hyukjin Kwon notifications@github.com
wrote:

I realised this is actually related with #286
https://github.com/databricks/spark-csv/issues/286.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#298 (comment)

@joyeshmishra
Copy link

Could this be reviewed/merged? (or the permissive PR)

@falaki
Copy link
Member

falaki commented Jul 12, 2016

I somewhat disagree with this pull request. For example, if there is a string in a field, user can either

  • Use schema inference to correctly mark that field as StringType
  • Change/fix the schema passed to data source

Would you please explain why neither of those two options work in your use case?

@rachelwarren
Copy link
Author

This addresses a dirty data issue. The columns are not string columns, they are integer or data columns, they just may have a few malformed values, when working in the real data work i run into this instance all the time. Part of the trouble is that different systems represent null values in a different way. For example a null value in some systems that we use is represented as "N/A" which is a parsed in as a string, when it should really be interpreted as a missing value. With user entered data there may be errors especially in date formats. The dataset may have many columns, I don't want to drop the entire row just because the value 5.0 is in one integer column.
The alternative to this pull request is to read in every row as strings so that it doesn't fail on dirty data and then convert them safely to the real type. This solution, in my opinion defeats the purpose of using an off the shelf CSV parser, at all (and is why we have stopped using this parser).

@joyeshmishra @dmsuehir Does this sound relevant to your use cases?

@dmsuehir
Copy link

Yes, that is exactly why we want this feature as well. We don't want to stop parsing when we come across a value that can't be parsed to the specified data type, and we don't want to change the schema to accommodate bad values. It's kind of similar to the DROPMALFORMED mode, except that we don't want the entire row dropped, we just want null/missing values where the bad values were.

@falaki
Copy link
Member

falaki commented Jul 12, 2016

I see. Thanks for the feedback. Also do you guys plan to use Spark 2.0? This data source has been inlined in Spark 2.0. So the package would only work with Spark 1.X.

@rachelwarren
Copy link
Author

We are not going to update for a while so this would be fine, but I can
write a 2.0 version in the next few weeks as well.

On Tue, Jul 12, 2016 at 3:34 PM, Hossein Falaki notifications@github.com
wrote:

I see. Thanks for the feedback. Also do you guys plan to use Spark 2.0?
This data source has been inlined in Spark 2.0. So the package would only
work with Spark 1.X.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#298 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFIpBhdEBcMyZb_VOSYiIBWwVixuEVgAks5qVBZ3gaJpZM4H0UfR
.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 12, 2016

FYI, this still happens in the csv one at Spark 2.0 as well. It might be great if this issue is ported to JIRA anyway.

I will create a JIRA after trying to reproduce this.

import java.util.Locale

import org.apache.spark.sql.types._
import org.json4s.ParserUtil.ParseException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is not used and not needed.

@falaki
Copy link
Member

falaki commented Jul 12, 2016

I did a first pass on it. @HyukjinKwon Would you also take a look?

@HyukjinKwon
Copy link
Member

@falaki Sure! thank you for cc me.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 13, 2016

@falaki, Actually, I noticed this before and have been thinking overtime (but I did not to open an issue because I thought it is a too narrow case).

Since Spark 2.0 JSON datasource also has the parse modes just like CSV, I thought it might be great if it is consistent.

Currently, this behaviour is included in JSON's PERMISSIVE mode, for example,

val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()

produces the results below:

+----+
|   a|
+----+
|   1|
|null|
+----+

Would this make sense? I don't mind if you feel strongly adding this as an option is correct but just I wanted to let you know just in case.

@falaki
Copy link
Member

falaki commented Jul 13, 2016

I think for data source it makes more sense not to change behavior of existing configuration. In that case, it is safer to add a new mode.

@HyukjinKwon
Copy link
Member

I hear you. Thanks!

.withParseMode(ParseModes.DROP_MALFORMED_MODE)
.csvFile(sqlContext, ageFile)
.select("Name")
.collect()
Copy link
Member

@HyukjinKwon HyukjinKwon Jul 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have to fix the indentation here,

val r = new CsvParser()
  .withSchema(strictSchema)
  .withUseHeader(true)
  .withParserLib(parserLib)
  .withParseMode(ParseModes.DROP_MALFORMED_MODE)
  .csvFile(sqlContext, ageFile)
  .select("Name")
  .collect()

Copy link
Member

@HyukjinKwon HyukjinKwon Jul 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe val parser instead of val r just to be consistent)

Copy link
Member

@HyukjinKwon HyukjinKwon Jul 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for noise. I took another look. It seems this val r is not related with this PR and not used?

@HyukjinKwon
Copy link
Member

@rachelwarren
Copy link
Author

Thanks for all the feedback. I can address this next week, and hopefully get it in. Then I'll work on the 2.0 version after that.

@rachelwarren
Copy link
Author

Sorting out all the merge conflicts associated with this PR was a bit gnarly. I have made a new request and am going to close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants