-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON #16680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON #16680
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class might be too much. I am willing to revert this back.
The only reason I made them into here is, there are similar logics for dealing with header, comment and empty strings in reading/schema inference path, and they look a bit messy and easily letting other engineers make a mistake (actually they are already different even though some cases look required to be the same but I did not fix it here to keep the original behaviour).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These removed block is all into CSVInferSchema.infer(...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three removed blocks below are all into CSVUtils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is into CSVUtils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These mainly into CSVUtils.
|
Let me double check tomorrow in case and then cc someone. Regarding to the increase of lines, it is mainly due to comments. I will remove if anyone cares. |
|
Test build #71843 has finished for PR 16680 at commit
|
|
Test build #71844 has finished for PR 16680 at commit
|
|
Test build #71845 has finished for PR 16680 at commit
|
44a4d93 to
0f7b9b8
Compare
|
Test build #71928 has finished for PR 16680 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both behaviour of CSVUtils.filterCommentAndEmptys here and below should exactly the same up to my knowledge but I let them as are just simply to keep the behaviour for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please send a follow-up PR for this
|
@cloud-fan, could you please take a look? I tried to not change the current behaviour and logics at my best but just re-locate them here. I also ran a build with scala 2.10. |
|
Test build #71966 has finished for PR 16680 at commit
|
|
Test build #71967 has finished for PR 16680 at commit
|
|
shall we rename |
|
Sure! |
|
Test build #72024 has finished for PR 16680 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems readAsLines is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, either way is fine with me. I just resembled it from JsonFileFormat.createBaseRdd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: csvLines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one too, I just resembled json.InferSchema.infer ...
def infer(
json: RDD[String],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more related to CsvInferSchema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one resembled JacksonUtils.verifySchema.
|
@cloud-fan, I just mainly resembled ones in JSON datasource and I am pretty sure you knew this when you added some comments. But let me just rebase this as is for now just in case maybe you are okay with them above as is and missed my reasons. I know it is not always right to follow existing implementation but maybe we could rename them together later if the other one does not look appropriate. (I am fine with either way but just want to be sure that you know I had some reasons). |
|
(I am fine with changing the name only for CSV ones for now as well. I would appreciate if you confirm please) |
ad09417 to
6f7fa9b
Compare
|
Test build #72446 has finished for PR 16680 at commit
|
|
Sorry I didn't compare the CSV part with JSON part closely, I'm ok to keep them consistent now. To confirm, this PR just moves codes around right? |
|
Yes, I tried to only move and does not touch the code path at my best. Let me check this again tomorrow with Scala 2.10 build and ping you again. |
|
@cloud-fan, I just built it with 2.10 and checked it does not touch original code path, line by line at my best. |
|
thanks, merging to master! |
|
Thank you so much. |
…tent with JSON
## What changes were proposed in this pull request?
This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into `CSVUtils`.
It makes the methods in classes have consistent arguments with JSON ones. (this PR renames `.../json/InferSchema.scala` → `.../json/JsonInferSchema.scala`)
`CSVInferSchema` and `JsonInferSchema`
``` scala
private[csv] object CSVInferSchema {
...
def infer(
csv: Dataset[String],
caseSensitive: Boolean,
options: CSVOptions): StructType = {
...
```
``` scala
private[sql] object JsonInferSchema {
...
def infer(
json: RDD[String],
columnNameOfCorruptRecord: String,
configOptions: JSONOptions): StructType = {
...
```
These allow schema inference from `Dataset[String]` directly, meaning the similar functionalities that use `JacksonParser`/`JsonInferSchema` for JSON can be easily implemented by `UnivocityParser`/`CSVInferSchema` for CSV.
This completes refactoring CSV datasource and they are now pretty consistent.
## How was this patch tested?
Existing tests should cover this and
```
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes apache#16680 from HyukjinKwon/SPARK-16101-schema-inference.
What changes were proposed in this pull request?
This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into
CSVUtils.It makes the methods in classes have consistent arguments with JSON ones. (this PR renames
.../json/InferSchema.scala→.../json/JsonInferSchema.scala)CSVInferSchemaandJsonInferSchemaThese allow schema inference from
Dataset[String]directly, meaning the similar functionalities that useJacksonParser/JsonInferSchemafor JSON can be easily implemented byUnivocityParser/CSVInferSchemafor CSV.This completes refactoring CSV datasource and they are now pretty consistent.
How was this patch tested?
Existing tests should cover this and