Skip to content

Conversation

@liancheng
Copy link
Contributor

@liancheng liancheng commented May 3, 2016

What changes were proposed in this pull request?

Currently, various FileFormat data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.

A new method buildReaderWithPartitionValues() is added to FileFormat with a default implementation that appends partition values to InternalRows produced by the reader function returned by buildReader().

Special data sources like Parquet, which implements partition value appending inside buildReader() because of the vectorized reader, and the Text data source, which doesn't support partitioning, override buildReaderWithPartitionValues() and simply delegate to buildReader().

This PR brings two benefits:

  1. Apparently, it de-duplicates partition value appending logic

  2. Now the reader function returned by buildReader() is only required to produce InternalRows rather than UnsafeRows if the data source doesn't override buildReaderWithPartitionValues().

    Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving UnsafeRow.

How was this patch tested?

Existing tests should do the work.

@liancheng
Copy link
Contributor Author

cc @yhuai @cloud-fan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Instead of adding a new ReaderFunction trait with an initialize() method as you suggested, I used an anonymous Function1 class here. Not quite sure how useful the initialize() method can be in more general cases...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. the text datasource, which need to initialize a UnsafeRowWriter for one reader function(not every file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a reasonable use case. But we can also use an anonymous Function1 class there.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57626 has finished for PR 12866 at commit 4c946a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57628 has finished for PR 12866 at commit 1f7ee5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng force-pushed the spark-14237-simplify-partition-values-appending branch from 1f7ee5f to 1bce7db Compare May 4, 2016 01:22
@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57702 has finished for PR 12866 at commit 1bce7db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@liancheng
Copy link
Contributor Author

Thanks for the review! Merged this to master and branch-2.0.

@asfgit asfgit closed this in bc3760d May 4, 2016
asfgit pushed a commit that referenced this pull request May 4, 2016
…rious buildReader() implementations

## What changes were proposed in this pull request?

Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.

A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.

Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.

This PR brings two benefits:

1. Apparently, it de-duplicates partition value appending logic

2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.

   Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.

(cherry picked from commit bc3760d)
Signed-off-by: Cheng Lian <lian@databricks.com>
@liancheng liancheng deleted the spark-14237-simplify-partition-values-appending branch May 4, 2016 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants