Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 23, 2018

What changes were proposed in this pull request?

uniVocity parser allows to specify only required column names or indexes for parsing like:

// Here we select only the columns by their indexes.
// The parser just skips the values in other columns
parserSettings.selectIndexes(4, 0, 1);
CsvParser parser = new CsvParser(parserSettings);

In this PR, I propose to extract indexes from required schema and pass them into the CSV parser. Benchmarks show the following improvements in parsing of 1000 columns:

Select 100 columns out of 1000: x1.76
Select 1 column out of 1000: x2

Note: Comparing to current implementation, the changes can return different result for malformed rows in the DROPMALFORMED and FAILFAST modes if only subset of all columns is requested. To have previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.

How was this patch tested?

It was tested by new test which selects 3 columns out of 15, by existing tests and by new benchmarks.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 23, 2018

The difference between this PR and #21296 is that the columnPruning is passed to CSVOptions as a parameter. It should fix flaky UnivocityParserSuite.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 23, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 23, 2018

Test build #91061 has finished for PR 21415 at commit 0aef16b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91071 has finished for PR 21415 at commit 0aef16b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91079 has finished for PR 21415 at commit 0aef16b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 24, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91087 has finished for PR 21415 at commit 0aef16b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 24, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91096 has finished for PR 21415 at commit 0aef16b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 24, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91110 has finished for PR 21415 at commit 0aef16b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

defaultTimeZoneId: String,
defaultColumnNameOfCorruptRecord: String = "") = {
defaultColumnNameOfCorruptRecord: String = "",
columnPruning: Boolean = false) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us do not set the default value for columnPruning? We might lose the pruning opportunity if we call this constructor.

Copy link
Member Author

@MaxGekk MaxGekk May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor with disabled columnPruning is called in the CSV writer and 30 times from test suites like UnivocityParserSuite and CSVInferSchemaSuite where the pruning is not needed.

We might lose the pruning opportunity if we call this constructor.

ok. I will enable it by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always enabling it is also not right. Can we remove the default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

val idf = spark.read
.schema(schema)
.csv(path.getCanonicalPath)
.select('f15, 'f10, 'f5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an extreme test case? Try count(1) on csv files? That means zero column is required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an assert for count(). In the CSVSuite, there are a few tests with count() over malformed csv files.

--------------------------------------------------------------------------------------------
Select 1000 columns 76910 / 78065 0.0 76909.8 1.0X
Select 100 columns 28625 / 32884 0.0 28625.1 2.7X
Select one column 22498 / 22669 0.0 22497.8 3.4X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(1) too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, added count()

@SparkQA
Copy link

SparkQA commented May 24, 2018

Test build #91111 has finished for PR 21415 at commit 1ac8fea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2018

Test build #91126 has finished for PR 21415 at commit 4115058.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 64fad0b May 25, 2018
@MaxGekk MaxGekk deleted the csv-column-pruning2 branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants