Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Mar 29, 2016

What changes were proposed in this pull request?

  1. fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions.
  2. Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan
  3. Disable the returning columnar batch in parquet reader if there are many columns.
  4. Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen.

Closes #12098

How was this patch tested?

Add a tests for table with 1000 columns.

@davies
Copy link
Contributor Author

davies commented Mar 29, 2016

cc @nongli @rxin @marmbrus

*/
public void enableReturningBatches() {
returnColumnarBatch = true;
public boolean tryEnableReturningBatches(int maxColumns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we pushing a config value into the reader so that it can switch? Can't we just put this logic in the query planner where the decision to call enableReturningBatches is. I think its clearer if there is a single place where we decide to use column batches or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do this in planner, by checking the schema, which requires some cleanup in parquet reader, currently which still use runtime exception to do the switch. cc @nongli @sameeragarwal

Should we wait for that?

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54463 has finished for PR 12047 at commit 872ecf5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #2720 has finished for PR 12047 at commit 872ecf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54737 has finished for PR 12047 at commit 5f8e009.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Davies Liu added 2 commits April 1, 2016 16:57
@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54744 has finished for PR 12047 at commit e8fb619.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54745 has finished for PR 12047 at commit 6eb13f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

super.initialize(inputSplit, taskAttemptContext);
initializeInternal();
Configuration conf = ContextUtil.getConfiguration(taskAttemptContext);
returnColumnarBatch = conf.getBoolean("returning.batch", false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this returning.batch set?

Davies Liu added 3 commits April 4, 2016 13:18
Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SqlNewHadoopRDD.scala
@davies davies changed the title [SPARK-14224] [SPARK-14223] [SQL] fix RowEncoder and parquet reader for wide table [SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table Apr 4, 2016
@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54883 has finished for PR 12047 at commit ec46638.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54882 has finished for PR 12047 at commit caae5c9.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54887 has finished for PR 12047 at commit d8f6e4d.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54890 has finished for PR 12047 at commit f2baae5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Apr 4, 2016

@marmbrus Did another round of refactor, does this match the things in your mind?

@marmbrus
Copy link
Contributor

marmbrus commented Apr 5, 2016

Yeah, this looks much cleaner.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55065 has finished for PR 12047 at commit 79c2ad5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Apr 6, 2016

@marmbrus Is this PR good to go?

/**
* The ParquetInputFormat that create VectorizedParquetRecordReader.
*/
final class VectorizedParquetInputFormat extends ParquetInputFormat[InternalRow] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng @cloud-fan lets make sure we delete this when you remove buildInternalScan

@marmbrus
Copy link
Contributor

marmbrus commented Apr 6, 2016

This is a huge improvement. A few minor comments, otherwise LGTM.

@davies
Copy link
Contributor Author

davies commented Apr 6, 2016

@marmbrus Since we will remove buildInternalScan soon, removed the returning batch support for that case to simplify this patch.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55130 has finished for PR 12047 at commit 2dc2b75.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55129 has finished for PR 12047 at commit 4b44e22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #2761 has finished for PR 12047 at commit 2dc2b75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

throws IOException, InterruptedException, UnsupportedOperationException {
super.initialize(inputSplit, taskAttemptContext);
initializeInternal();
Configuration conf = ContextUtil.getConfiguration(taskAttemptContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, will remove

@nongli
Copy link
Contributor

nongli commented Apr 6, 2016

LGTM

@davies
Copy link
Contributor Author

davies commented Apr 6, 2016

Since the last commit is tiny, merging this into master to unblock others.

@asfgit asfgit closed this in 5a4b11a Apr 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants