Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 23, 2017

What changes were proposed in this pull request?

This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into CSVUtils.

It makes the methods in classes have consistent arguments with JSON ones. (this PR renames .../json/InferSchema.scala.../json/JsonInferSchema.scala)

CSVInferSchema and JsonInferSchema

private[csv] object CSVInferSchema {
  ...
    
  def infer(
      csv: Dataset[String],
      caseSensitive: Boolean,
      options: CSVOptions): StructType = {
  ...
private[sql] object JsonInferSchema {
  ...
    
  def infer(
      json: RDD[String],
      columnNameOfCorruptRecord: String,
      configOptions: JSONOptions): StructType = {
  ...

These allow schema inference from Dataset[String] directly, meaning the similar functionalities that use JacksonParser/JsonInferSchema for JSON can be easily implemented by UnivocityParser/CSVInferSchema for CSV.

This completes refactoring CSV datasource and they are now pretty consistent.

How was this patch tested?

Existing tests should cover this and

./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class might be too much. I am willing to revert this back.

The only reason I made them into here is, there are similar logics for dealing with header, comment and empty strings in reading/schema inference path, and they look a bit messy and easily letting other engineers make a mistake (actually they are already different even though some cases look required to be the same but I did not fix it here to keep the original behaviour).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These removed block is all into CSVInferSchema.infer(...).

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three removed blocks below are all into CSVUtils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is into CSVUtils.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These mainly into CSVUtils.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jan 23, 2017

Let me double check tomorrow in case and then cc someone. Regarding to the increase of lines, it is mainly due to comments. I will remove if anyone cares.

@HyukjinKwon HyukjinKwon changed the title [SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON [WIP][SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON Jan 23, 2017
@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71843 has finished for PR 16680 at commit 5562911.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71844 has finished for PR 16680 at commit 2b03c9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71845 has finished for PR 16680 at commit 44a4d93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-16101-schema-inference branch from 44a4d93 to 0f7b9b8 Compare January 24, 2017 10:04
@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71928 has finished for PR 16680 at commit 0f7b9b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both behaviour of CSVUtils.filterCommentAndEmptys here and below should exactly the same up to my knowledge but I let them as are just simply to keep the behaviour for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please send a follow-up PR for this

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON [SPARK-16101][SQL] Refactoring CSV schema inference path to be consistent with JSON Jan 25, 2017
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jan 25, 2017

@cloud-fan, could you please take a look? I tried to not change the current behaviour and logics at my best but just re-locate them here. I also ran a build with scala 2.10.

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71966 has finished for PR 16680 at commit 15c4dec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71967 has finished for PR 16680 at commit 37e0296.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

shall we rename InferSchema to JsonInferSchema?

@HyukjinKwon
Copy link
Member Author

Sure!

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72024 has finished for PR 16680 at commit ad09417.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems readAsLines is better?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, either way is fine with me. I just resembled it from JsonFileFormat.createBaseRdd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: csvLines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too, I just resembled json.InferSchema.infer ...

  def infer(
      json: RDD[String],

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more related to CsvInferSchema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one resembled JacksonUtils.verifySchema.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 6, 2017

@cloud-fan, I just mainly resembled ones in JSON datasource and I am pretty sure you knew this when you added some comments. But let me just rebase this as is for now just in case maybe you are okay with them above as is and missed my reasons.

I know it is not always right to follow existing implementation but maybe we could rename them together later if the other one does not look appropriate. (I am fine with either way but just want to be sure that you know I had some reasons).

@HyukjinKwon
Copy link
Member Author

(I am fine with changing the name only for CSV ones for now as well. I would appreciate if you confirm please)

@HyukjinKwon HyukjinKwon force-pushed the SPARK-16101-schema-inference branch from ad09417 to 6f7fa9b Compare February 6, 2017 13:32
@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72446 has finished for PR 16680 at commit 6f7fa9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Sorry I didn't compare the CSV part with JSON part closely, I'm ok to keep them consistent now. To confirm, this PR just moves codes around right?

@HyukjinKwon
Copy link
Member Author

Yes, I tried to only move and does not touch the code path at my best. Let me check this again tomorrow with Scala 2.10 build and ping you again.

@HyukjinKwon
Copy link
Member Author

@cloud-fan, I just built it with 2.10 and checked it does not touch original code path, line by line at my best.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 3d314d0 Feb 7, 2017
@HyukjinKwon
Copy link
Member Author

Thank you so much.

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…tent with JSON

## What changes were proposed in this pull request?

This PR refactors CSV schema inference path to be consistent with JSON data source and moves some filtering codes having the similar/same logics into `CSVUtils`.

 It makes the methods in classes have consistent arguments with JSON ones. (this PR renames `.../json/InferSchema.scala` → `.../json/JsonInferSchema.scala`)

`CSVInferSchema` and `JsonInferSchema`

``` scala
private[csv] object CSVInferSchema {
  ...

  def infer(
      csv: Dataset[String],
      caseSensitive: Boolean,
      options: CSVOptions): StructType = {
  ...
```

``` scala
private[sql] object JsonInferSchema {
  ...

  def infer(
      json: RDD[String],
      columnNameOfCorruptRecord: String,
      configOptions: JSONOptions): StructType = {
  ...
```

These allow schema inference from `Dataset[String]` directly, meaning the similar functionalities that use `JacksonParser`/`JsonInferSchema` for JSON can be easily implemented by `UnivocityParser`/`CSVInferSchema` for CSV.

This completes refactoring CSV datasource and they are now pretty consistent.

## How was this patch tested?

Existing tests should cover this and

```
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#16680 from HyukjinKwon/SPARK-16101-schema-inference.
@HyukjinKwon HyukjinKwon deleted the SPARK-16101-schema-inference branch January 2, 2018 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants