[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt) #21889

ajacques · 2018-07-27T00:29:33Z

(Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)

This is a restart of #21320. Most of the discussion has taken place over there. I've only taken it as a based to make changes based on the code review to help move things along.

Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR.

What changes were proposed in this pull request?

One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, contacts, backed by parquet with the following Spark SQL schema:

root
 |-- name: struct
 |    |-- first: string
 |    |-- last: string
 |-- address: string

Parquet stores this table's data in three physical columns: name.first, name.last and address. To answer the query

select address from contacts
Spark will read only from the address column of parquet data. However, to answer the query

select name.first from contacts
Spark will read name.first and name.last from parquet.

This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the name.first column to answer the previous query.

Implementation

There are two main components of this patch. First, there is a ParquetSchemaPruning optimizer rule for gathering the required schema fields of a PhysicalOperation over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a catalyst expression in terms of a pruned schema.

Second, the ParquetRowConverter has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. ParquetReadSupport has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader.

Limitation

Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only.

How was this patch tested?

Care has been taken to ensure correctness and prevent regressions. A more advanced version of this patch incorporating optimizations for rewriting queries involving aggregations and joins has been running on a production Spark cluster at VideoAmp for several years. In that time, one bug was found and fixed early on, and we added a regression test for that bug.

We forward-ported this patch to Spark master in June 2016 and have been running this patch against Spark 2.x branches on ad-hoc clusters since then.

Known Issues

Highlighted in #21320 (comment)

HyukjinKwon · 2018-07-27T01:08:16Z

ok to test

SparkQA · 2018-07-27T04:55:13Z

Test build #93646 has finished for PR 21889 at commit 4b847ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-27T05:22:41Z

If there's no objection within few days, let me get this in cc @cloud-fan and @gatorsmile and make other works and comments separate.

@mallman, if we are all happy here, mind taking a look #21320 (comment) and #21320 (comment)

I will fix my own comments as a followup by myself.

Will credit this to you FWIW.

mallman · 2018-07-27T06:10:05Z

@mallman, if we are all happy here, mind taking a look #21320 (comment) and #21320 (comment)

I'm working on the former. I don't understand what kind of "second PR" @gatorsmile is referring to in the latter comment. I know we need to do something to fix the ignored tests, but those tests are in this PR. Am I supposed to create a PR on this PR?

HyukjinKwon · 2018-07-27T06:18:10Z

we need to do something to fix the ignored tests, but those tests are in this PR.

If this one got merged, yes, we could fix them in a separate PR to enable them. I remember there need some fixes in ParquetReadSupport? I believe we can all work in roughly parallel after this one.

gatorsmile · 2018-07-27T07:01:53Z

Just FYI, we are unable to merge it if it has a correctness bug.

HyukjinKwon · 2018-07-27T08:05:13Z

@gatorsmile, just for clarification, you mean some regressions about correctness bug in existing features, right?

gatorsmile · 2018-07-27T16:21:01Z

I mean #21320 (comment)

HyukjinKwon · 2018-07-27T16:31:20Z

Ah, I think that's a bug when this feature is enabled. How about this: since this PR is open already and that's a bug in the new feature, we can get this in, I address my comments and he works on that comment and fixes in ParquetReadSupport.

mallman · 2018-07-31T05:09:37Z

Hi @gatorsmile. Where do you see us at this point? Do you still want to get this into Spark 2.4?

gatorsmile · 2018-07-31T06:46:19Z

We are still targeting this to 2.4, but we need to fix all the identified bugs before merging it.

gatorsmile · 2018-07-31T06:48:42Z

Normally, we change the default to false or revert the whole PR if the bugs are found during the RC (release candidate) stage.

HyukjinKwon · 2018-07-31T07:49:59Z

In that case, I hope my comments are addressed together before merging it in. They are non trivial and I don't usually go with them if there isn't special reason for it.

ajacques · 2018-07-31T15:00:03Z

Where does that leave both of these PRs? Do we still want this one with the code refactoring or to go back to the original? Are there any comments for this PR that would block merging? I've set the default to false in this PR.

gatorsmile · 2018-07-31T15:10:16Z

If the revert is very risky, we normally change the default from true to false when the bug is identified in the RC stage.

mallman · 2018-08-01T00:50:56Z

Where does that leave both of these PRs? Do we still want this one with the code refactoring or to go back to the original? Are there any comments for this PR that would block merging? I've set the default to false in this PR.

@ajacques Do you want to rebase your PR off my last commit? Then I think it's up to @HyukjinKwon and @gatorsmile on how to proceed.

ajacques · 2018-08-01T05:17:04Z

@mallman, sounds good I'll get this PR updated with your latest changes as soon as I can.

SparkQA · 2018-08-02T02:00:32Z

Test build #93925 has finished for PR 21889 at commit 85e7759.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T03:00:40Z

Test build #93932 has finished for PR 21889 at commit 887330d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2018-08-02T04:16:06Z

This patch fails Scala style tests.

Hi @ajacques. I'm not sure if you're aware of this, but you can run the scalastyle checks locally with

sbt scalastyle

SparkQA · 2018-08-02T07:05:01Z

Test build #93952 has finished for PR 21889 at commit be71cd7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ajacques · 2018-08-02T15:39:45Z

These test failures are in Spark streaming. Is this just an intermittent test failure or actually caused by this PR?

mallman · 2018-08-02T16:48:31Z

These test failures are in Spark streaming. Is this just an intermittent test failure or actually caused by this PR?

I was able to run the first failing test successfully. Can we get a retest, please?

mallman · 2018-08-02T16:51:35Z

I was able to run the first failing test successfully. Can we get a retest, please?

@ajacques I just rebased and pushed my branch off of master. Perhaps the easiest thing to do would be for you to rebase again off of my branch and do another force-push to your branch.

SparkQA · 2018-08-02T21:14:08Z

Test build #94037 has finished for PR 21889 at commit 89f9e53.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-03T01:37:33Z

Test build #94073 has finished for PR 21889 at commit 37e0a97.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ajacques · 2018-08-03T03:04:35Z

Anybody else able to reproduce this failure? It succeeded on my developer machine.

HyukjinKwon · 2018-08-03T04:59:32Z

retest this please

mallman · 2018-08-03T05:01:48Z

Anybody else able to reproduce this failure? It succeeded on my developer machine.

It worked for me, too. Let's see what a retest does.

SparkQA · 2018-08-03T07:05:02Z

Test build #94101 has finished for PR 21889 at commit 37e0a97.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-10T06:43:06Z

Test build #94536 has finished for PR 21889 at commit 51f0dc5.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2018-08-10T17:30:40Z

@mallman, while we wait for the go-no-go, do you have the changes for the next PR ready? Is there anything you need help with?

I have the hack I used originally, but I haven't tried finding a better solution yet. It could take some time to understand the underlying problem/incompatibility/misunderstanding/etc.

I spent some time yesterday digging deeper into why the hack I wrote worked, and I think I understand now. Practically speaking, my follow-on PR will be about the same as the commit I removed. However, I can support it with some explanatory comments instead of just "this throws an exception sometimes".

HyukjinKwon · 2018-08-13T08:17:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ProjectionOverSchema.scala

Can we really avoid AttributeReference(name, _, _, _) pattern per https://github.com/databricks/scala-style-guide#pattern-matching?

HyukjinKwon · 2018-08-13T08:18:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ProjectionOverSchema.scala

For instance, these , _, _, _), _, _, looks excessive.

HyukjinKwon · 2018-08-13T08:22:12Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala

nit:

private def buildPrunedDataSchema( fileDataSchema: StructType, requestedRootFields: Seq[RootField]) = {

per https://github.com/databricks/scala-style-guide#spacing-and-indentation

Will fix all this. Seems like this is an opportunity to catch some of this with static analysis to avoid wasting the reviewer's time. Is Scalastyle not configured to catch this stuff?

As far as I know, no. I use IntelliJ (integration with scalastyle) to catch basic style nits but rather then that, I do it manually. It took me a while to get used to it before FWIW.

HyukjinKwon · 2018-08-13T08:22:40Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala

[[..]] syntax is not needed in inlined comments.

HyukjinKwon · 2018-08-13T08:24:11Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala

indentation looks weird here per https://github.com/databricks/scala-style-guide#documentation-style

HyukjinKwon · 2018-08-13T08:26:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SelectedFieldSuite.scala

Shall we have multiple relations with split schema for each? It wouldn't necessarily have one big deep nested schema which makes hard to read.

HyukjinKwon · 2018-08-13T08:27:10Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala

nit: Array.empty[FullName]

This is a definition in a constructor, so I don't think we can do:

case class Contact( [...] friends: Array.empty[FullName], relatives: Map.empty[String, FullName] )

Scala wants a colon, so I opted for:

case class Contact( [...] friends: Array[FullName] = Array.empty, relatives: Map[String, FullName] = Map.empty )

HyukjinKwon · 2018-08-13T08:27:26Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala

nit: Map.empty[String, FullName]

HyukjinKwon · 2018-08-13T08:29:10Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala

nit: avoid infix notation per https://github.com/databricks/scala-style-guide#infix

HyukjinKwon · 2018-08-13T08:30:29Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTest.scala

I would avoid unrelated refactoring and revert this back.

HyukjinKwon · 2018-08-13T08:31:29Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala

Are we really unable to make some private functions and split some logics at least?

HyukjinKwon · 2018-08-13T08:31:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SelectedField.scala

indentation

mallman · 2018-08-13T17:54:57Z

@ajacques I added a commit to enable schema pruning by default. It's a little more complete than your commit to do the same. Please rebase off my branch and remove your commit.

HyukjinKwon · 2018-08-14T06:27:15Z

ok to test

SparkQA · 2018-08-14T07:05:02Z

Test build #94731 has finished for PR 21889 at commit 51f0dc5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-15T04:55:39Z

Test build #94785 has finished for PR 21889 at commit 1c0c4bf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-15T06:59:50Z

retest this please

SparkQA · 2018-08-15T07:25:07Z

Test build #94790 has finished for PR 21889 at commit 1c0c4bf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2018-08-15T17:41:37Z

Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR.

That would be my preference, yes, especially if it means less work for you.

SparkQA · 2018-08-15T19:58:55Z

Test build #94805 has finished for PR 21889 at commit 8d822ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-16T07:05:02Z

Test build #4278 has finished for PR 21889 at commit 8d822ee.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-16T10:05:47Z

I've only taken it as a based to make stylistic changes based on the code review to help move things along.

This PR doesn't only include stylistic changes. Since stylistic changes do not usually block a PR, mind fixing the PR description?

mallman · 2018-08-16T19:17:05Z

Essentially, this PR was created to take the management of #21320 out of my hands, with a view towards facilitating its incorporation into Spark 2.4. It was my suggestion, one based in frustration. In hindsight, I no longer believe this strategy is the best—or most expedient—approach towards progress. Indeed, I believe the direction of this PR has become orthogonal to its motivating goal, becoming a dispute between myself and @HyukjinKwon rather than a means to move things along.

I believe I can shepherd #21320 in a way that will promote greater progress. @ajacques, I mean no disrespect, and I thank you for volunteering your time, patience and effort for the sake of all that are interested in seeing this patch become a part of Spark. And I apologize for letting you down, letting everyone down. In my conduct leading up to the creation of this PR I did not act with the greatest maturity or patience. And I did not act in the best interests of the community.

No one has spent more time or more effort, taken more responsibility or exhibited more patience with this 2+ year patch-set-in-the-making than myself. I respectfully submit it is mine to present and manage, and no one else's. Insofar as I have expressed otherwise in the past, I admit my error—one made in frustration—and recant in hindsight.

@ajacques, at this point I respectfully assert that managing the patch set I submitted in #21320 is not your responsibility, nor is it anyone else's but mine. I ask you to close this PR so that we can resume the review in #21320. As I stated there, you are welcome to open a PR on https://github.com/VideoAmp/spark-public/tree/spark-4502-parquet_column_pruning-foundation to submit the changes you've made for review.

Thank you.

ajacques · 2018-08-16T19:37:56Z

Thanks for the response all. @mailman If it's really your preference, I will create a PR against that branch and close this one. My intention was never to take away from your efforts, and I still consider my work here to be just minor stylistic tweaks on top of your work. I did this as service to help bridge the divide and hopefully alleviate frustrations. But this has been a bit frustrating being stuck between two sides of this PR/changing merge strategies often and don't wish to continue playing this role.

As such, I will create a PR, but hope it does not dragged out to settle any differences in opinions between maintainers and submitters. My goal is to make sure this valuable feature gets merged so many can benefit.

HyukjinKwon reviewed Aug 13, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SelectedField.scala Outdated

Copy link

Member

HyukjinKwon Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

HyukjinKwon mentioned this pull request Aug 13, 2018

[SPARK-4502][SQL] Parquet nested column pruning - foundation #21320

Closed

Clean-up code based on review comments

8d822ee

ajacques mentioned this pull request Aug 16, 2018

Stylistic changes from Spark maintainers VideoAmp/spark-public#9

Merged

ajacques closed this Aug 16, 2018

[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt) #21889

[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt) #21889

Uh oh!

Conversation

ajacques commented Jul 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Implementation

Limitation

How was this patch tested?

Known Issues

Uh oh!

HyukjinKwon commented Jul 27, 2018

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

HyukjinKwon commented Jul 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mallman commented Jul 27, 2018

Uh oh!

HyukjinKwon commented Jul 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 27, 2018

Uh oh!

HyukjinKwon commented Jul 27, 2018

Uh oh!

gatorsmile commented Jul 27, 2018

Uh oh!

HyukjinKwon commented Jul 27, 2018

Uh oh!

mallman commented Jul 31, 2018

Uh oh!

gatorsmile commented Jul 31, 2018

Uh oh!

gatorsmile commented Jul 31, 2018

Uh oh!

HyukjinKwon commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajacques commented Jul 31, 2018

Uh oh!

gatorsmile commented Jul 31, 2018

Uh oh!

mallman commented Aug 1, 2018

Uh oh!

ajacques commented Aug 1, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

mallman commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

ajacques commented Aug 2, 2018

Uh oh!

mallman commented Aug 2, 2018

Uh oh!

mallman commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 2, 2018

Uh oh!

SparkQA commented Aug 3, 2018

Uh oh!

ajacques commented Aug 3, 2018

Uh oh!

HyukjinKwon commented Aug 3, 2018

Uh oh!

mallman commented Aug 3, 2018

Uh oh!

SparkQA commented Aug 3, 2018

Uh oh!

SparkQA commented Aug 10, 2018

Uh oh!

mallman commented Aug 10, 2018

Uh oh!

Choose a reason for hiding this comment

ajacques commented Jul 27, 2018 •

edited

Loading

HyukjinKwon commented Jul 27, 2018 •

edited

Loading

HyukjinKwon commented Jul 27, 2018 •

edited

Loading

HyukjinKwon commented Jul 31, 2018 •

edited

Loading