[SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table #12047

davies · 2016-03-29T21:48:34Z

What changes were proposed in this pull request?

fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions.
Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan
Disable the returning columnar batch in parquet reader if there are many columns.
Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen.

Closes #12098

How was this patch tested?

Add a tests for table with 1000 columns.

davies · 2016-03-29T21:48:49Z

cc @nongli @rxin @marmbrus

marmbrus · 2016-03-29T21:52:23Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

   */
-  public void enableReturningBatches() {
-    returnColumnarBatch = true;
+  public boolean tryEnableReturningBatches(int maxColumns) {


Why are we pushing a config value into the reader so that it can switch? Can't we just put this logic in the query planner where the decision to call enableReturningBatches is. I think its clearer if there is a single place where we decide to use column batches or not.

Yes, we should do this in planner, by checking the schema, which requires some cleanup in parquet reader, currently which still use runtime exception to do the switch. cc @nongli @sameeragarwal

Should we wait for that?

SparkQA · 2016-03-29T23:08:56Z

Test build #54463 has finished for PR 12047 at commit 872ecf5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-01T01:18:00Z

Test build #2720 has finished for PR 12047 at commit 872ecf5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-01T23:08:54Z

Test build #54737 has finished for PR 12047 at commit 5f8e009.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-02T01:36:21Z

Test build #54744 has finished for PR 12047 at commit e8fb619.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-02T01:41:12Z

Test build #54745 has finished for PR 12047 at commit 6eb13f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-02T07:40:21Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

    super.initialize(inputSplit, taskAttemptContext);
    initializeInternal();
+    Configuration conf = ContextUtil.getConfiguration(taskAttemptContext);
+    returnColumnarBatch = conf.getBoolean("returning.batch", false);


where is this returning.batch set?

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SqlNewHadoopRDD.scala

SparkQA · 2016-04-04T21:53:23Z

Test build #54883 has finished for PR 12047 at commit ec46638.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T22:01:49Z

Test build #54882 has finished for PR 12047 at commit caae5c9.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T22:36:41Z

Test build #54887 has finished for PR 12047 at commit d8f6e4d.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T22:52:08Z

Test build #54890 has finished for PR 12047 at commit f2baae5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-04-04T23:57:31Z

@marmbrus Did another round of refactor, does this match the things in your mind?

marmbrus · 2016-04-05T00:00:46Z

Yeah, this looks much cleaner.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2016-04-06T02:01:48Z

Test build #55065 has finished for PR 12047 at commit 79c2ad5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-04-06T16:36:25Z

@marmbrus Is this PR good to go?

marmbrus · 2016-04-06T17:18:22Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

+/**
+ * The ParquetInputFormat that create VectorizedParquetRecordReader.
+ */
+final class VectorizedParquetInputFormat extends ParquetInputFormat[InternalRow] {


@liancheng @cloud-fan lets make sure we delete this when you remove buildInternalScan

marmbrus · 2016-04-06T17:26:41Z

This is a huge improvement. A few minor comments, otherwise LGTM.

davies · 2016-04-06T18:03:54Z

@marmbrus Since we will remove buildInternalScan soon, removed the returning batch support for that case to simplify this patch.

SparkQA · 2016-04-06T19:33:20Z

Test build #55130 has finished for PR 12047 at commit 2dc2b75.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-06T19:55:36Z

Test build #55129 has finished for PR 12047 at commit 4b44e22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-06T21:36:25Z

Test build #2761 has finished for PR 12047 at commit 2dc2b75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-04-06T21:59:15Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

      throws IOException, InterruptedException, UnsupportedOperationException {
    super.initialize(inputSplit, taskAttemptContext);
    initializeInternal();
+    Configuration conf = ContextUtil.getConfiguration(taskAttemptContext);


what does this do?

Not needed, will remove

nongli · 2016-04-06T22:02:31Z

LGTM

davies · 2016-04-06T22:33:21Z

Since the last commit is tiny, merging this into master to unblock others.

fix RowEncoder and parquet reader for wide table

872ecf5

marmbrus reviewed Mar 29, 2016
View reviewed changes

Davies Liu added 2 commits April 1, 2016 15:44

cleanup new parquet reader

09c51fa

Merge branch 'master' of github.com:apache/spark into many_columns

5f8e009

Davies Liu added 2 commits April 1, 2016 16:57

fix style

95a4565

cleanup

6eb13f3

davies force-pushed the many_columns branch from e8fb619 to 6eb13f3 Compare April 2, 2016 00:02

rxin reviewed Apr 2, 2016
View reviewed changes

davies force-pushed the many_columns branch from caae5c9 to 1fdea1a Compare April 4, 2016 19:38

BatchedDataSourceScan

ec46638

davies force-pushed the many_columns branch from 1fdea1a to ec46638 Compare April 4, 2016 19:41

Davies Liu added 3 commits April 4, 2016 13:18

fix tests

d8f6e4d

CR

d545fa3

Merge branch 'master' of github.com:apache/spark into many_columns

f2baae5

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SqlNewHadoopRDD.scala

davies changed the title ~~[SPARK-14224] [SPARK-14223] [SQL] fix RowEncoder and parquet reader for wide table~~ [SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table Apr 4, 2016

Merge branch 'master' of github.com:apache/spark into many_columns

79c2ad5

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

marmbrus reviewed Apr 6, 2016
View reviewed changes

address comments

2dc2b75

davies force-pushed the many_columns branch from 4b44e22 to 2dc2b75 Compare April 6, 2016 18:02

nongli reviewed Apr 6, 2016
View reviewed changes

address comments

76f31a2

asfgit closed this in 5a4b11a Apr 6, 2016

[SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table #12047

[SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table #12047

Uh oh!

Conversation

davies commented Mar 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

davies commented Mar 29, 2016

Uh oh!

marmbrus Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

davies Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 29, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 2, 2016

Uh oh!

SparkQA commented Apr 2, 2016

Uh oh!

rxin Apr 2, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

davies commented Apr 4, 2016

Uh oh!

marmbrus commented Apr 5, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

yhuai commented Apr 6, 2016

Uh oh!

marmbrus Apr 6, 2016

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Apr 6, 2016

Uh oh!

davies commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

nongli Apr 6, 2016

Choose a reason for hiding this comment

Uh oh!

davies Apr 6, 2016

Choose a reason for hiding this comment

Uh oh!

nongli commented Apr 6, 2016

Uh oh!

davies commented Apr 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants