[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations #12866

liancheng · 2016-05-03T11:10:31Z

What changes were proposed in this pull request?

Currently, various FileFormat data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.

A new method buildReaderWithPartitionValues() is added to FileFormat with a default implementation that appends partition values to InternalRows produced by the reader function returned by buildReader().

Special data sources like Parquet, which implements partition value appending inside buildReader() because of the vectorized reader, and the Text data source, which doesn't support partitioning, override buildReaderWithPartitionValues() and simply delegate to buildReader().

This PR brings two benefits:

Apparently, it de-duplicates partition value appending logic
Now the reader function returned by buildReader() is only required to produce InternalRows rather than UnsafeRows if the data source doesn't override buildReaderWithPartitionValues().

Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving UnsafeRow.

How was this patch tested?

Existing tests should do the work.

liancheng · 2016-05-03T11:15:54Z

cc @yhuai @cloud-fan

liancheng · 2016-05-03T11:18:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

@cloud-fan Instead of adding a new ReaderFunction trait with an initialize() method as you suggested, I used an anonymous Function1 class here. Not quite sure how useful the initialize() method can be in more general cases...

e.g. the text datasource, which need to initialize a UnsafeRowWriter for one reader function(not every file).

That's a reasonable use case. But we can also use an anonymous Function1 class there.

SparkQA · 2016-05-03T12:39:17Z

Test build #57626 has finished for PR 12866 at commit 4c946a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-03T12:43:00Z

Test build #57628 has finished for PR 12866 at commit 1f7ee5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T02:49:46Z

Test build #57702 has finished for PR 12866 at commit 1bce7db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-05-04T05:37:37Z

LGTM

liancheng · 2016-05-04T06:18:16Z

Thanks for the review! Merged this to master and branch-2.0.

…rious buildReader() implementations ## What changes were proposed in this pull request? Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication. A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`. Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`. This PR brings two benefits: 1. Apparently, it de-duplicates partition value appending logic 2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`. Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending. (cherry picked from commit bc3760d) Signed-off-by: Cheng Lian <lian@databricks.com>

liancheng mentioned this pull request May 3, 2016

[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations #12033

Closed

liancheng force-pushed the spark-14237-simplify-partition-values-appending branch from 4c946a9 to 1f7ee5f Compare May 3, 2016 11:12

liancheng reviewed May 3, 2016
View reviewed changes

liancheng added 2 commits May 4, 2016 08:34

Abstracts away partition value appending

9641b73

Fixes serialization

1bce7db

liancheng force-pushed the spark-14237-simplify-partition-values-appending branch from 1f7ee5f to 1bce7db Compare May 4, 2016 01:22

asfgit closed this in bc3760d May 4, 2016

liancheng deleted the spark-14237-simplify-partition-values-appending branch May 4, 2016 16:30

liancheng mentioned this pull request May 9, 2016

[SPARK-15211][SQL] Select features column from LibSVMRelation causes failure #12986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations #12866

[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations #12866

Uh oh!

liancheng commented May 3, 2016 •

edited

Loading

Uh oh!

liancheng commented May 3, 2016

Uh oh!

liancheng May 3, 2016

Uh oh!

cloud-fan May 3, 2016

Uh oh!

liancheng May 4, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

cloud-fan commented May 4, 2016

Uh oh!

liancheng commented May 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations #12866

[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations #12866

Uh oh!

Conversation

liancheng commented May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

liancheng commented May 3, 2016

Uh oh!

liancheng May 3, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 3, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng May 4, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

cloud-fan commented May 4, 2016

Uh oh!

liancheng commented May 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liancheng commented May 3, 2016 •

edited

Loading