-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations #12866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4c946a9 to
1f7ee5f
Compare
|
cc @yhuai @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Instead of adding a new ReaderFunction trait with an initialize() method as you suggested, I used an anonymous Function1 class here. Not quite sure how useful the initialize() method can be in more general cases...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. the text datasource, which need to initialize a UnsafeRowWriter for one reader function(not every file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a reasonable use case. But we can also use an anonymous Function1 class there.
|
Test build #57626 has finished for PR 12866 at commit
|
|
Test build #57628 has finished for PR 12866 at commit
|
1f7ee5f to
1bce7db
Compare
|
Test build #57702 has finished for PR 12866 at commit
|
|
LGTM |
|
Thanks for the review! Merged this to master and branch-2.0. |
…rious buildReader() implementations ## What changes were proposed in this pull request? Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication. A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`. Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`. This PR brings two benefits: 1. Apparently, it de-duplicates partition value appending logic 2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`. Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending. (cherry picked from commit bc3760d) Signed-off-by: Cheng Lian <lian@databricks.com>
What changes were proposed in this pull request?
Currently, various
FileFormatdata sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.A new method
buildReaderWithPartitionValues()is added toFileFormatwith a default implementation that appends partition values toInternalRows produced by the reader function returned bybuildReader().Special data sources like Parquet, which implements partition value appending inside
buildReader()because of the vectorized reader, and the Text data source, which doesn't support partitioning, overridebuildReaderWithPartitionValues()and simply delegate tobuildReader().This PR brings two benefits:
Apparently, it de-duplicates partition value appending logic
Now the reader function returned by
buildReader()is only required to produceInternalRows rather thanUnsafeRows if the data source doesn't overridebuildReaderWithPartitionValues().Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving
UnsafeRow.How was this patch tested?
Existing tests should do the work.