-
Notifications
You must be signed in to change notification settings - Fork 440
Adding and option to the csv parser, withParseExecptionAsNull which a… #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…llows dirty data to be parsed as nulls rather than cause failures.
Current coverage is
|
|
@falaki Actually, shouldn't such behaviour be included in |
|
@rachelwarren Would you maybe share some codes and outputs after this change? AFAIK JSON data source at Spark produces I am trying to add parse modes in JSON at Spark, apache/spark#11756 here. |
|
I'm not exactly sure what you mean. The goal is that running something like the code: will not fail in the general case, even if the schema specifies for example a numeric column, and there is a string value in that column. |
|
Ah, I mean I think this behaviour might have to be included in For example, the data below: {"a": {"b": 1}}
{"a": []}with the schema below: val schema =
StructType(
StructField("a", StructType(
StructField("b", StringType) :: Nil
)) :: Nil)produces the results below: +----+
| a|
+----+
| [1]|
|null|
+----+So, I thought it might be better if this CSV data source is also consistent. |
|
i see. I can look into making that fix. On Sun, Mar 20, 2016 at 11:25 PM, Hyukjin Kwon notifications@github.com
|
|
Maybe we should wait for feedback. That was just my opinion. |
|
I realised this is actually related with https://github.com/databricks/spark-csv/issues/286. |
|
It does. I actually did try adding it to PERMISSIVE MODE, so you can see what that looks like on this branch: https://github.com/rachelwarren/spark-csv/tree/permissive |
|
Hello, What's the status on this pull request? I'm looking for this feature to get a null value when it can't be parsed. Will it be getting in as a new option or as a modification to PERMISSIVE? Can we expect to see the change get in master anytime soon? Thanks, |
|
Its still hanging out. I have implemented the permissive strategy as well. On Tue, Mar 29, 2016 at 2:20 AM, Hyukjin Kwon notifications@github.com
|
|
Could this be reviewed/merged? (or the permissive PR) |
|
I somewhat disagree with this pull request. For example, if there is a string in a field, user can either
Would you please explain why neither of those two options work in your use case? |
|
This addresses a dirty data issue. The columns are not string columns, they are integer or data columns, they just may have a few malformed values, when working in the real data work i run into this instance all the time. Part of the trouble is that different systems represent null values in a different way. For example a null value in some systems that we use is represented as "N/A" which is a parsed in as a string, when it should really be interpreted as a missing value. With user entered data there may be errors especially in date formats. The dataset may have many columns, I don't want to drop the entire row just because the value 5.0 is in one integer column. @joyeshmishra @dmsuehir Does this sound relevant to your use cases? |
|
Yes, that is exactly why we want this feature as well. We don't want to stop parsing when we come across a value that can't be parsed to the specified data type, and we don't want to change the schema to accommodate bad values. It's kind of similar to the DROPMALFORMED mode, except that we don't want the entire row dropped, we just want null/missing values where the bad values were. |
|
I see. Thanks for the feedback. Also do you guys plan to use Spark 2.0? This data source has been inlined in Spark 2.0. So the package would only work with Spark 1.X. |
|
We are not going to update for a while so this would be fine, but I can On Tue, Jul 12, 2016 at 3:34 PM, Hossein Falaki notifications@github.com
|
|
FYI, this still happens in the csv one at Spark 2.0 as well. It might be great if this issue is ported to JIRA anyway. I will create a JIRA after trying to reproduce this. |
| import java.util.Locale | ||
|
|
||
| import org.apache.spark.sql.types._ | ||
| import org.json4s.ParserUtil.ParseException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is not used and not needed.
|
I did a first pass on it. @HyukjinKwon Would you also take a look? |
|
@falaki Sure! thank you for cc me. |
|
@falaki, Actually, I noticed this before and have been thinking overtime (but I did not to open an issue because I thought it is a too narrow case). Since Spark 2.0 JSON datasource also has the parse modes just like CSV, I thought it might be great if it is consistent. Currently, this behaviour is included in JSON's PERMISSIVE mode, for example, val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()produces the results below: Would this make sense? I don't mind if you feel strongly adding this as an option is correct but just I wanted to let you know just in case. |
|
I think for data source it makes more sense not to change behavior of existing configuration. In that case, it is safer to add a new mode. |
|
I hear you. Thanks! |
| .withParseMode(ParseModes.DROP_MALFORMED_MODE) | ||
| .csvFile(sqlContext, ageFile) | ||
| .select("Name") | ||
| .collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might have to fix the indentation here,
val r = new CsvParser()
.withSchema(strictSchema)
.withUseHeader(true)
.withParserLib(parserLib)
.withParseMode(ParseModes.DROP_MALFORMED_MODE)
.csvFile(sqlContext, ageFile)
.select("Name")
.collect()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe val parser instead of val r just to be consistent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for noise. I took another look. It seems this val r is not related with this PR and not used?
|
Thanks for all the feedback. I can address this next week, and hopefully get it in. Then I'll work on the 2.0 version after that. |
810a713 to
30fa371
Compare
|
Sorting out all the merge conflicts associated with this PR was a bit gnarly. I have made a new request and am going to close this one. |
…llows dirty data to be parsed as nulls rather than cause failures. For example, if there is a String in a numeric column, rather than failing it lets that value be null. My experience is that data is often dirty, and some times you want and option that allows mall-formed numbers or dates without causing failures. I have notice a good deal of issues related to this package have to do with similar problems. My implementation is written so that it doesn't affect any of the existing functionality. Instead, it provides an extra option "withParseExceptionAsNull".