[SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON #16680

HyukjinKwon · 2017-01-23T13:50:18Z

What changes were proposed in this pull request?

This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into CSVUtils.

It makes the methods in classes have consistent arguments with JSON ones. (this PR renames .../json/InferSchema.scala → .../json/JsonInferSchema.scala)

CSVInferSchema and JsonInferSchema

private[csv] object CSVInferSchema {
  ...
    
  def infer(
      csv: Dataset[String],
      caseSensitive: Boolean,
      options: CSVOptions): StructType = {
  ...

private[sql] object JsonInferSchema {
  ...
    
  def infer(
      json: RDD[String],
      columnNameOfCorruptRecord: String,
      configOptions: JSONOptions): StructType = {
  ...

These allow schema inference from Dataset[String] directly, meaning the similar functionalities that use JacksonParser/JsonInferSchema for JSON can be easily implemented by UnivocityParser/CSVInferSchema for CSV.

This completes refactoring CSV datasource and they are now pretty consistent.

How was this patch tested?

Existing tests should cover this and

./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package

HyukjinKwon · 2017-01-23T13:53:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala

This class might be too much. I am willing to revert this back.

The only reason I made them into here is, there are similar logics for dealing with header, comment and empty strings in reading/schema inference path, and they look a bit messy and easily letting other engineers make a mistake (actually they are already different even though some cases look required to be the same but I did not fix it here to keep the original behaviour).

HyukjinKwon · 2017-01-23T13:54:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

These removed block is all into CSVInferSchema.infer(...).

HyukjinKwon · 2017-01-23T13:55:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

Three removed blocks below are all into CSVUtils.

HyukjinKwon · 2017-01-23T13:56:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

This is into CSVUtils.

HyukjinKwon · 2017-01-23T13:56:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala

These mainly into CSVUtils.

HyukjinKwon · 2017-01-23T14:31:21Z

Let me double check tomorrow in case and then cc someone. Regarding to the increase of lines, it is mainly due to comments. I will remove if anyone cares.

SparkQA · 2017-01-23T16:14:40Z

Test build #71843 has finished for PR 16680 at commit 5562911.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-23T16:31:24Z

Test build #71844 has finished for PR 16680 at commit 2b03c9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-23T16:42:00Z

Test build #71845 has finished for PR 16680 at commit 44a4d93.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-24T12:39:56Z

Test build #71928 has finished for PR 16680 at commit 0f7b9b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-25T04:03:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

Both behaviour of CSVUtils.filterCommentAndEmptys here and below should exactly the same up to my knowledge but I let them as are just simply to keep the behaviour for now.

please send a follow-up PR for this

HyukjinKwon · 2017-01-25T04:40:28Z

@cloud-fan, could you please take a look? I tried to not change the current behaviour and logics at my best but just re-locate them here. I also ran a build with scala 2.10.

SparkQA · 2017-01-25T06:20:51Z

Test build #71966 has finished for PR 16680 at commit 15c4dec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-25T06:32:57Z

Test build #71967 has finished for PR 16680 at commit 37e0296.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-26T11:56:45Z

shall we rename InferSchema to JsonInferSchema?

HyukjinKwon · 2017-01-26T15:01:49Z

Sure!

SparkQA · 2017-01-26T17:31:13Z

Test build #72024 has finished for PR 16680 at commit ad09417.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-06T07:44:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

seems readAsLines is better?

Sure, either way is fine with me. I just resembled it from JsonFileFormat.createBaseRdd.

cloud-fan · 2017-02-06T07:45:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

nit: csvLines

This one too, I just resembled json.InferSchema.infer ...

def infer( json: RDD[String],

cloud-fan · 2017-02-06T07:49:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala

I think it's more related to CsvInferSchema?

This one resembled JacksonUtils.verifySchema.

HyukjinKwon · 2017-02-06T13:26:19Z

@cloud-fan, I just mainly resembled ones in JSON datasource and I am pretty sure you knew this when you added some comments. But let me just rebase this as is for now just in case maybe you are okay with them above as is and missed my reasons.

I know it is not always right to follow existing implementation but maybe we could rename them together later if the other one does not look appropriate. (I am fine with either way but just want to be sure that you know I had some reasons).

HyukjinKwon · 2017-02-06T13:27:32Z

(I am fine with changing the name only for CSV ones for now as well. I would appreciate if you confirm please)

SparkQA · 2017-02-06T15:52:41Z

Test build #72446 has finished for PR 16680 at commit 6f7fa9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-06T16:46:56Z

Sorry I didn't compare the CSV part with JSON part closely, I'm ok to keep them consistent now. To confirm, this PR just moves codes around right?

HyukjinKwon · 2017-02-06T17:26:44Z

Yes, I tried to only move and does not touch the code path at my best. Let me check this again tomorrow with Scala 2.10 build and ping you again.

HyukjinKwon · 2017-02-07T10:20:34Z

@cloud-fan, I just built it with 2.10 and checked it does not touch original code path, line by line at my best.

cloud-fan · 2017-02-07T13:02:50Z

thanks, merging to master!

HyukjinKwon · 2017-02-07T13:06:03Z

Thank you so much.

…tent with JSON ## What changes were proposed in this pull request? This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into `CSVUtils`. It makes the methods in classes have consistent arguments with JSON ones. (this PR renames `.../json/InferSchema.scala` → `.../json/JsonInferSchema.scala`) `CSVInferSchema` and `JsonInferSchema` ``` scala private[csv] object CSVInferSchema { ... def infer( csv: Dataset[String], caseSensitive: Boolean, options: CSVOptions): StructType = { ... ``` ``` scala private[sql] object JsonInferSchema { ... def infer( json: RDD[String], columnNameOfCorruptRecord: String, configOptions: JSONOptions): StructType = { ... ``` These allow schema inference from `Dataset[String]` directly, meaning the similar functionalities that use `JacksonParser`/`JsonInferSchema` for JSON can be easily implemented by `UnivocityParser`/`CSVInferSchema` for CSV. This completes refactoring CSV datasource and they are now pretty consistent. ## How was this patch tested? Existing tests should cover this and ``` ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#16680 from HyukjinKwon/SPARK-16101-schema-inference.

HyukjinKwon commented Jan 23, 2017

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON~~ [WIP][SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON Jan 23, 2017

HyukjinKwon force-pushed the SPARK-16101-schema-inference branch from 44a4d93 to 0f7b9b8 Compare January 24, 2017 10:04

HyukjinKwon commented Jan 25, 2017

View reviewed changes

HyukjinKwon changed the title ~~[WIP][SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON~~ [SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON Jan 25, 2017

HyukjinKwon mentioned this pull request Jan 26, 2017

[SPARK-15463][SQL] support creating dataframe out of Dataset[String] for csv data #13300

Closed

HyukjinKwon mentioned this pull request Feb 4, 2017

[SPARK-19446][SQL] Remove unused findTightestCommonType in TypeCoercion #16786

Closed

cloud-fan reviewed Feb 6, 2017

View reviewed changes

Refactoring CSV schema inference path to be consistent with JSON

6f7fa9b

HyukjinKwon force-pushed the SPARK-16101-schema-inference branch from ad09417 to 6f7fa9b Compare February 6, 2017 13:32

asfgit closed this in 3d314d0 Feb 7, 2017

HyukjinKwon deleted the SPARK-16101-schema-inference branch January 2, 2018 03:38

[SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON #16680

[SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON #16680

Uh oh!

Conversation

HyukjinKwon commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

HyukjinKwon Jan 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

SparkQA commented Jan 25, 2017

Uh oh!

cloud-fan commented Jan 26, 2017

Uh oh!

HyukjinKwon commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Feb 6, 2017

Uh oh!

SparkQA commented Feb 6, 2017

Uh oh!

cloud-fan commented Feb 6, 2017

Uh oh!

HyukjinKwon commented Feb 6, 2017

Uh oh!

HyukjinKwon commented Feb 7, 2017

Uh oh!

HyukjinKwon commented Jan 23, 2017 •

edited

Loading

HyukjinKwon Jan 23, 2017 •

edited

Loading

HyukjinKwon Jan 23, 2017 •

edited

Loading

HyukjinKwon commented Jan 23, 2017 •

edited

Loading

HyukjinKwon Jan 25, 2017 •

edited

Loading

HyukjinKwon commented Jan 25, 2017 •

edited

Loading

HyukjinKwon Feb 6, 2017 •

edited

Loading

HyukjinKwon commented Feb 6, 2017 •

edited

Loading