[SPARK-24244][SPARK-24368][SQL] Passing only required columns to the CSV parser #21415

MaxGekk · 2018-05-23T19:35:44Z

What changes were proposed in this pull request?

uniVocity parser allows to specify only required column names or indexes for parsing like:

// Here we select only the columns by their indexes.
// The parser just skips the values in other columns
parserSettings.selectIndexes(4, 0, 1);
CsvParser parser = new CsvParser(parserSettings);

In this PR, I propose to extract indexes from required schema and pass them into the CSV parser. Benchmarks show the following improvements in parsing of 1000 columns:

Select 100 columns out of 1000: x1.76
Select 1 column out of 1000: x2

Note: Comparing to current implementation, the changes can return different result for malformed rows in the DROPMALFORMED and FAILFAST modes if only subset of all columns is requested. To have previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.

How was this patch tested?

It was tested by new test which selects 3 columns out of 15, by existing tests and by new benchmarks.

… schema

…otal number.

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

MaxGekk · 2018-05-23T19:39:39Z

The difference between this PR and #21296 is that the columnPruning is passed to CSVOptions as a parameter. It should fix flaky UnivocityParserSuite.

MaxGekk · 2018-05-23T22:18:23Z

jenkins, retest this, please

SparkQA · 2018-05-23T22:41:20Z

Test build #91061 has finished for PR 21415 at commit 0aef16b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-24T00:21:00Z

Test build #91071 has finished for PR 21415 at commit 0aef16b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-24T02:01:14Z

retest this please

SparkQA · 2018-05-24T05:45:49Z

Test build #91079 has finished for PR 21415 at commit 0aef16b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-05-24T06:28:12Z

jenkins, retest this, please

SparkQA · 2018-05-24T07:05:02Z

Test build #91087 has finished for PR 21415 at commit 0aef16b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-05-24T08:44:50Z

jenkins, retest this, please

SparkQA · 2018-05-24T12:23:30Z

Test build #91096 has finished for PR 21415 at commit 0aef16b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-05-24T14:08:55Z

jenkins, retest this, please

SparkQA · 2018-05-24T15:51:32Z

Test build #91110 has finished for PR 21415 at commit 0aef16b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-24T17:00:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

    defaultTimeZoneId: String,
-    defaultColumnNameOfCorruptRecord: String = "") = {
+    defaultColumnNameOfCorruptRecord: String = "",
+    columnPruning: Boolean = false) = {


Let us do not set the default value for columnPruning? We might lose the pruning opportunity if we call this constructor.

The constructor with disabled columnPruning is called in the CSV writer and 30 times from test suites like UnivocityParserSuite and CSVInferSchemaSuite where the pruning is not needed.

We might lose the pruning opportunity if we call this constructor.

ok. I will enable it by default.

always enabling it is also not right. Can we remove the default?

gatorsmile · 2018-05-24T17:01:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+      val idf = spark.read
+        .schema(schema)
+        .csv(path.getCanonicalPath)
+        .select('f15, 'f10, 'f5)


Could you add an extreme test case? Try count(1) on csv files? That means zero column is required.

added an assert for count(). In the CSVSuite, there are a few tests with count() over malformed csv files.

gatorsmile · 2018-05-24T17:04:15Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala

+      --------------------------------------------------------------------------------------------
+      Select 1000 columns                     76910 / 78065          0.0       76909.8       1.0X
+      Select 100 columns                      28625 / 32884          0.0       28625.1       2.7X
+      Select one column                       22498 / 22669          0.0       22497.8       3.4X


count(1) too?

sure, added count()

SparkQA · 2018-05-24T19:42:59Z

Test build #91111 has finished for PR 21415 at commit 1ac8fea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… csv-column-pruning

SparkQA · 2018-05-25T00:40:42Z

Test build #91126 has finished for PR 21415 at commit 4115058.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-25T04:36:39Z

LGTM

Thanks! Merged to master.

MaxGekk added 22 commits May 10, 2018 17:57

Adding tests for select only requested columns

9cffa0f

Select indexes of required columns only

fdbcbe3

Fix the case when number of parsed fields are not matched to required…

578f47b

… schema

Using selectIndexes if required number of columns are less than its t…

0f942c3

…otal number.

Fix the test: force to read all columns

c4b1160

Fix merging conflicts

8cf6eab

Benchmarks for many columns

5b2f0b9

Make size of requiredSchema equals to amount of selected columns

6d1e902

Removing selection of all columns

4525795

Updating benchmarks for select indexes

8809cec

Addressing Herman's review comments

dc97ceb

Updated benchmark result for recent changes

51b3148

Add ticket number to test title

e3958b1

Removing unnecessary benchmark

a4a0a54

Updating the migration guide

fa86015

Moving some values back as it was.

15528d2

Renaming the test title

f90daa7

Improving of the migration guide

4d9873d

Merge remote-tracking branch 'origin/master' into csv-column-pruning

7dcfc7a

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmarks.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Fix example

f89eeb7

Adding spark.sql.csv.parser.columnPruning.enabled

6ff6d4f

Add columnPruning as a parameter for CSVOptions

0aef16b

Merge branch 'master' into csv-column-pruning2

1ac8fea

gatorsmile reviewed May 24, 2018

View reviewed changes

MaxGekk added 2 commits May 24, 2018 21:15

Enable columnPruning in CSV parser by default

2ad4777

Removing default value for columnPruning

754688c

MaxGekk added 3 commits May 24, 2018 22:13

Added benchmark for count()

892927f

Test count() returns valid value

e37e477

Merge branch 'csv-column-pruning2' of github.com:MaxGekk/spark-1 into…

4115058

… csv-column-pruning

asfgit closed this in 64fad0b May 25, 2018

MaxGekk deleted the csv-column-pruning2 branch August 17, 2019 13:33

[SPARK-24244][SPARK-24368][SQL] Passing only required columns to the CSV parser #21415

[SPARK-24244][SPARK-24368][SQL] Passing only required columns to the CSV parser #21415

Uh oh!

Conversation

MaxGekk commented May 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MaxGekk commented May 23, 2018

Uh oh!

MaxGekk commented May 23, 2018

Uh oh!

SparkQA commented May 23, 2018

Uh oh!

SparkQA commented May 24, 2018

Uh oh!

HyukjinKwon commented May 24, 2018

Uh oh!

SparkQA commented May 24, 2018

Uh oh!

MaxGekk commented May 24, 2018

Uh oh!

SparkQA commented May 24, 2018

Uh oh!

MaxGekk commented May 24, 2018

Uh oh!

SparkQA commented May 24, 2018

Uh oh!

MaxGekk commented May 24, 2018

Uh oh!

SparkQA commented May 24, 2018

Uh oh!

gatorsmile May 24, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 24, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 24, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 24, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 24, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 24, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 24, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 24, 2018

Uh oh!

SparkQA commented May 25, 2018

Uh oh!

gatorsmile commented May 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk May 24, 2018 •

edited

Loading