Skip to content

Conversation

@ajacques
Copy link
Contributor

@ajacques ajacques commented Jul 27, 2018

(Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)

This is a restart of #21320. Most of the discussion has taken place over there. I've only taken it as a based to make changes based on the code review to help move things along.

Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR.

What changes were proposed in this pull request?

One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, contacts, backed by parquet with the following Spark SQL schema:

root
 |-- name: struct
 |    |-- first: string
 |    |-- last: string
 |-- address: string

Parquet stores this table's data in three physical columns: name.first, name.last and address. To answer the query

select address from contacts
Spark will read only from the address column of parquet data. However, to answer the query

select name.first from contacts
Spark will read name.first and name.last from parquet.

This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the name.first column to answer the previous query.

Implementation

There are two main components of this patch. First, there is a ParquetSchemaPruning optimizer rule for gathering the required schema fields of a PhysicalOperation over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a catalyst expression in terms of a pruned schema.

Second, the ParquetRowConverter has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. ParquetReadSupport has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader.

Limitation

Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only.

How was this patch tested?

Care has been taken to ensure correctness and prevent regressions. A more advanced version of this patch incorporating optimizations for rewriting queries involving aggregations and joins has been running on a production Spark cluster at VideoAmp for several years. In that time, one bug was found and fixed early on, and we added a regression test for that bug.

We forward-ported this patch to Spark master in June 2016 and have been running this patch against Spark 2.x branches on ad-hoc clusters since then.

Known Issues

Highlighted in #21320 (comment)

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93646 has finished for PR 21889 at commit 4b847ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 27, 2018

If there's no objection within few days, let me get this in cc @cloud-fan and @gatorsmile and make other works and comments separate.

@mallman, if we are all happy here, mind taking a look #21320 (comment) and #21320 (comment)

I will fix my own comments as a followup by myself.

Will credit this to you FWIW.

@mallman
Copy link
Contributor

mallman commented Jul 27, 2018

@mallman, if we are all happy here, mind taking a look #21320 (comment) and #21320 (comment)

I'm working on the former. I don't understand what kind of "second PR" @gatorsmile is referring to in the latter comment. I know we need to do something to fix the ignored tests, but those tests are in this PR. Am I supposed to create a PR on this PR?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 27, 2018

we need to do something to fix the ignored tests, but those tests are in this PR.

If this one got merged, yes, we could fix them in a separate PR to enable them. I remember there need some fixes in ParquetReadSupport? I believe we can all work in roughly parallel after this one.

@gatorsmile
Copy link
Member

Just FYI, we are unable to merge it if it has a correctness bug.

@HyukjinKwon
Copy link
Member

@gatorsmile, just for clarification, you mean some regressions about correctness bug in existing features, right?

@gatorsmile
Copy link
Member

I mean #21320 (comment)

@HyukjinKwon
Copy link
Member

Ah, I think that's a bug when this feature is enabled. How about this: since this PR is open already and that's a bug in the new feature, we can get this in, I address my comments and he works on that comment and fixes in ParquetReadSupport.

@mallman
Copy link
Contributor

mallman commented Jul 31, 2018

Hi @gatorsmile. Where do you see us at this point? Do you still want to get this into Spark 2.4?

@gatorsmile
Copy link
Member

We are still targeting this to 2.4, but we need to fix all the identified bugs before merging it.

@gatorsmile
Copy link
Member

Normally, we change the default to false or revert the whole PR if the bugs are found during the RC (release candidate) stage.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 31, 2018

In that case, I hope my comments are addressed together before merging it in. They are non trivial and I don't usually go with them if there isn't special reason for it.

@ajacques
Copy link
Contributor Author

Where does that leave both of these PRs? Do we still want this one with the code refactoring or to go back to the original? Are there any comments for this PR that would block merging? I've set the default to false in this PR.

@gatorsmile
Copy link
Member

If the revert is very risky, we normally change the default from true to false when the bug is identified in the RC stage.

@mallman
Copy link
Contributor

mallman commented Aug 1, 2018

Where does that leave both of these PRs? Do we still want this one with the code refactoring or to go back to the original? Are there any comments for this PR that would block merging? I've set the default to false in this PR.

@ajacques Do you want to rebase your PR off my last commit? Then I think it's up to @HyukjinKwon and @gatorsmile on how to proceed.

@ajacques
Copy link
Contributor Author

ajacques commented Aug 1, 2018

@mallman, sounds good I'll get this PR updated with your latest changes as soon as I can.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93925 has finished for PR 21889 at commit 85e7759.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93932 has finished for PR 21889 at commit 887330d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mallman
Copy link
Contributor

mallman commented Aug 2, 2018

This patch fails Scala style tests.

Hi @ajacques. I'm not sure if you're aware of this, but you can run the scalastyle checks locally with

sbt scalastyle

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93952 has finished for PR 21889 at commit be71cd7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ajacques
Copy link
Contributor Author

ajacques commented Aug 2, 2018

These test failures are in Spark streaming. Is this just an intermittent test failure or actually caused by this PR?

@mallman
Copy link
Contributor

mallman commented Aug 2, 2018

These test failures are in Spark streaming. Is this just an intermittent test failure or actually caused by this PR?

I was able to run the first failing test successfully. Can we get a retest, please?

@mallman
Copy link
Contributor

mallman commented Aug 2, 2018

I was able to run the first failing test successfully. Can we get a retest, please?

@ajacques I just rebased and pushed my branch off of master. Perhaps the easiest thing to do would be for you to rebase again off of my branch and do another force-push to your branch.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #94037 has finished for PR 21889 at commit 89f9e53.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94073 has finished for PR 21889 at commit 37e0a97.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ajacques
Copy link
Contributor Author

ajacques commented Aug 3, 2018

Anybody else able to reproduce this failure? It succeeded on my developer machine.

@HyukjinKwon
Copy link
Member

retest this please

@mallman
Copy link
Contributor

mallman commented Aug 3, 2018

Anybody else able to reproduce this failure? It succeeded on my developer machine.

It worked for me, too. Let's see what a retest does.

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94101 has finished for PR 21889 at commit 37e0a97.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94536 has finished for PR 21889 at commit 51f0dc5.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mallman
Copy link
Contributor

mallman commented Aug 10, 2018

@mallman, while we wait for the go-no-go, do you have the changes for the next PR ready? Is there anything you need help with?

I have the hack I used originally, but I haven't tried finding a better solution yet. It could take some time to understand the underlying problem/incompatibility/misunderstanding/etc.

I spent some time yesterday digging deeper into why the hack I wrote worked, and I think I understand now. Practically speaking, my follow-on PR will be about the same as the commit I removed. However, I can support it with some explanatory comments instead of just "this throws an exception sometimes".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we really avoid AttributeReference(name, _, _, _) pattern per https://github.com/databricks/scala-style-guide#pattern-matching?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, these , _, _, _), _, _, looks excessive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

private def buildPrunedDataSchema(
    fileDataSchema: StructType,
    requestedRootFields: Seq[RootField]) = {

per https://github.com/databricks/scala-style-guide#spacing-and-indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix all this. Seems like this is an opportunity to catch some of this with static analysis to avoid wasting the reviewer's time. Is Scalastyle not configured to catch this stuff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, no. I use IntelliJ (integration with scalastyle) to catch basic style nits but rather then that, I do it manually. It took me a while to get used to it before FWIW.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[..]] syntax is not needed in inlined comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have multiple relations with split schema for each? It wouldn't necessarily have one big deep nested schema which makes hard to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Array.empty[FullName]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a definition in a constructor, so I don't think we can do:

case class Contact(
  [...]
  friends: Array.empty[FullName],
  relatives: Map.empty[String, FullName]
)

Scala wants a colon, so I opted for:

case class Contact(
  [...]
  friends: Array[FullName] = Array.empty,
  relatives: Map[String, FullName] = Map.empty
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Map.empty[String, FullName]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid unrelated refactoring and revert this back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really unable to make some private functions and split some logics at least?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@mallman
Copy link
Contributor

mallman commented Aug 13, 2018

@ajacques I added a commit to enable schema pruning by default. It's a little more complete than your commit to do the same. Please rebase off my branch and remove your commit.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94731 has finished for PR 21889 at commit 51f0dc5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94785 has finished for PR 21889 at commit 1c0c4bf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94790 has finished for PR 21889 at commit 1c0c4bf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mallman
Copy link
Contributor

mallman commented Aug 15, 2018

Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR.

That would be my preference, yes, especially if it means less work for you.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94805 has finished for PR 21889 at commit 8d822ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #4278 has finished for PR 21889 at commit 8d822ee.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I've only taken it as a based to make stylistic changes based on the code review to help move things along.

This PR doesn't only include stylistic changes. Since stylistic changes do not usually block a PR, mind fixing the PR description?

@mallman
Copy link
Contributor

mallman commented Aug 16, 2018

Essentially, this PR was created to take the management of #21320 out of my hands, with a view towards facilitating its incorporation into Spark 2.4. It was my suggestion, one based in frustration. In hindsight, I no longer believe this strategy is the best—or most expedient—approach towards progress. Indeed, I believe the direction of this PR has become orthogonal to its motivating goal, becoming a dispute between myself and @HyukjinKwon rather than a means to move things along.

I believe I can shepherd #21320 in a way that will promote greater progress. @ajacques, I mean no disrespect, and I thank you for volunteering your time, patience and effort for the sake of all that are interested in seeing this patch become a part of Spark. And I apologize for letting you down, letting everyone down. In my conduct leading up to the creation of this PR I did not act with the greatest maturity or patience. And I did not act in the best interests of the community.

No one has spent more time or more effort, taken more responsibility or exhibited more patience with this 2+ year patch-set-in-the-making than myself. I respectfully submit it is mine to present and manage, and no one else's. Insofar as I have expressed otherwise in the past, I admit my error—one made in frustration—and recant in hindsight.

@ajacques, at this point I respectfully assert that managing the patch set I submitted in #21320 is not your responsibility, nor is it anyone else's but mine. I ask you to close this PR so that we can resume the review in #21320. As I stated there, you are welcome to open a PR on https://github.com/VideoAmp/spark-public/tree/spark-4502-parquet_column_pruning-foundation to submit the changes you've made for review.

Thank you.

@ajacques
Copy link
Contributor Author

ajacques commented Aug 16, 2018

Thanks for the response all. @mailman If it's really your preference, I will create a PR against that branch and close this one. My intention was never to take away from your efforts, and I still consider my work here to be just minor stylistic tweaks on top of your work. I did this as service to help bridge the divide and hopefully alleviate frustrations. But this has been a bit frustrating being stuck between two sides of this PR/changing merge strategies often and don't wish to continue playing this role.

As such, I will create a PR, but hope it does not dragged out to settle any differences in opinions between maintainers and submitters. My goal is to make sure this valuable feature gets merged so many can benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants