-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table #12047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| */ | ||
| public void enableReturningBatches() { | ||
| returnColumnarBatch = true; | ||
| public boolean tryEnableReturningBatches(int maxColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we pushing a config value into the reader so that it can switch? Can't we just put this logic in the query planner where the decision to call enableReturningBatches is. I think its clearer if there is a single place where we decide to use column batches or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do this in planner, by checking the schema, which requires some cleanup in parquet reader, currently which still use runtime exception to do the switch. cc @nongli @sameeragarwal
Should we wait for that?
|
Test build #54463 has finished for PR 12047 at commit
|
|
Test build #2720 has finished for PR 12047 at commit
|
|
Test build #54737 has finished for PR 12047 at commit
|
|
Test build #54744 has finished for PR 12047 at commit
|
|
Test build #54745 has finished for PR 12047 at commit
|
| super.initialize(inputSplit, taskAttemptContext); | ||
| initializeInternal(); | ||
| Configuration conf = ContextUtil.getConfiguration(taskAttemptContext); | ||
| returnColumnarBatch = conf.getBoolean("returning.batch", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this returning.batch set?
|
Test build #54883 has finished for PR 12047 at commit
|
|
Test build #54882 has finished for PR 12047 at commit
|
|
Test build #54887 has finished for PR 12047 at commit
|
|
Test build #54890 has finished for PR 12047 at commit
|
|
@marmbrus Did another round of refactor, does this match the things in your mind? |
|
Yeah, this looks much cleaner. |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
|
Test build #55065 has finished for PR 12047 at commit
|
|
@marmbrus Is this PR good to go? |
| /** | ||
| * The ParquetInputFormat that create VectorizedParquetRecordReader. | ||
| */ | ||
| final class VectorizedParquetInputFormat extends ParquetInputFormat[InternalRow] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liancheng @cloud-fan lets make sure we delete this when you remove buildInternalScan
|
This is a huge improvement. A few minor comments, otherwise LGTM. |
|
@marmbrus Since we will remove buildInternalScan soon, removed the returning batch support for that case to simplify this patch. |
|
Test build #55130 has finished for PR 12047 at commit
|
|
Test build #55129 has finished for PR 12047 at commit
|
|
Test build #2761 has finished for PR 12047 at commit
|
| throws IOException, InterruptedException, UnsupportedOperationException { | ||
| super.initialize(inputSplit, taskAttemptContext); | ||
| initializeInternal(); | ||
| Configuration conf = ContextUtil.getConfiguration(taskAttemptContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, will remove
|
LGTM |
|
Since the last commit is tiny, merging this into master to unblock others. |
What changes were proposed in this pull request?
Closes #12098
How was this patch tested?
Add a tests for table with 1000 columns.