[SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet #9940

viirya · 2015-11-24T17:21:23Z

JIRA: https://issues.apache.org/jira/browse/SPARK-11955

Currently we simply skip pushdowning filters in parquet if we enable schema merging.

However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet.

… in parquet.

SparkQA · 2015-11-24T18:26:59Z

Test build #46611 has finished for PR 9940 at commit e24529d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-11-24T21:49:13Z

retest this please.

SparkQA · 2015-11-25T00:04:08Z

Test build #46626 has finished for PR 9940 at commit e24529d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-25T04:18:48Z

Test build #46656 has finished for PR 9940 at commit ff4ef4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-11-28T16:43:18Z

ping @liancheng @yhuai

liancheng · 2015-11-29T02:14:37Z

I don't really get the idea of this one, what exactly does "oneSide" mean?

liancheng · 2015-11-29T02:15:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

Indentation of the map block is off, please help fixing it. Thanks.

viirya · 2015-11-29T02:40:10Z

@liancheng the naming might be a little confusing. It means that the fields only exist in one of the merging schemas. For example, when we want to merge two schema: {a: Int, b: String, c: Double} and {a: Int, c: Double}, then the b: String is only existing in one schema.

In other words, this patch tried to mark the differences between merged schemas.

liancheng · 2015-11-29T02:57:25Z

Oh I see, so basically it indicates that a field doesn't exist in schemata of all part-files, and we try to skip filters that involve this kind of fields when doing filter push-down. Right?

viirya · 2015-11-29T03:43:34Z

Yes, you are right.

…quet-filters

SparkQA · 2015-11-29T15:24:28Z

Test build #46848 has finished for PR 9940 at commit 4536b72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-29T21:30:56Z

Is it a regression?

viirya · 2015-11-30T03:32:46Z

Hi @yhuai Can you explain more? What you meant about regression?

yhuai · 2015-11-30T04:38:28Z

oh, I thought it's a bug fix (so, I was wondering if it's a regression from 1.5 or not). But, it is actually an improvement?

viirya · 2015-11-30T05:08:36Z

@yhuai I think so. We completely skip pushdown filters now if schema merging is enabled. This patch is to improve that.

liancheng · 2015-11-30T07:33:38Z

@yhuai Here is the background: Parquet filter push-down is disabled when schema merging is turned on because of PARQUET-389.

I'm somewhat hesitant to have this change. On one hand, I do want to have filter push-down in case of schema merging since it's generally a very useful optimization. On the other hand, this change is a little bit hacky, since schema metadata is not intended to be used in this way.

Anyway, at least, let's have two more updates for this PR:

Rename oneSide to something more intuitive
Schema metadata is persisted together with the schema, this temporary metadata entry should be cleared before saving the Parquet file.

…quet-filters

viirya · 2015-11-30T11:37:29Z

Renamed oneSide to optional.

I will update this for the second point later.

SparkQA · 2015-11-30T13:28:45Z

Test build #46881 has finished for PR 9940 at commit 2a4e471.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…quet-filters

SparkQA · 2015-12-05T19:10:29Z

Test build #47229 has finished for PR 9940 at commit db8ffa3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…quet-filters

SparkQA · 2015-12-06T18:39:55Z

Test build #47241 has finished for PR 9940 at commit 40533a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public abstract static class PrefixComputer\n

liancheng · 2015-12-08T10:14:50Z

@viirya Thanks for your efforts! Would you mind me revisiting this after 1.6 release? I would like see whether we can have PARQUET-389 fixed in Parquet community ASAP, so that we may not need to work around it in Spark 1.7/2.0.

viirya · 2015-12-08T10:32:34Z

@liancheng Sure. Thank you!

viirya · 2016-01-19T08:48:18Z

@liancheng Can we revisit this now? Or we want to wait a bit longer?

liancheng · 2016-01-19T19:50:18Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

I guess we resort to equalsIgnoreCompatibleNullability because extra metadata in merged? Can we also add assertion for added metadata instead of working around it using equalsIgnoreCompatibleNullability?

liancheng · 2016-01-19T20:13:04Z

Overall LGTM except for several minor issues.

Another thing is that, we probably want to use a more special name (something like _OPTIONAL_) to avoid naming conflict with user defined metadata keys.

…quet-filters

SparkQA · 2016-01-27T11:18:46Z

Test build #50190 has finished for PR 9940 at commit 1a11770.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-01-28T02:43:35Z

ping @liancheng Please see if latest updates are proper for you. Thanks.

liancheng · 2016-01-29T00:25:00Z

Thanks! I'm going to merging this to master.

tedyu · 2016-01-29T01:04:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

If we add clear() to MetadataBuilder, this can be lifted above the fields.map. Inside the map operation we just clear the MetadataBuilder.

What do you think ?

This PR is mostly a workaround for a parquet-mr bug (PARQUET-389), and I'd assume that it will be fixed in the near future. Then we can remove this workaround. So it doesn't seem to be worth modifying MetadataBuilder, which is part of public API.

This workaround may be taken out in the future.

However, use of MetadataBuilder occurs in many other places:
http://pastebin.com/nVjNfrgp

I feel adding clear() to MetadataBuilder would help in current and future use cases.

Unfortunately unless we have a timeline to actually fix the Parquet bug, I don't think we can expect the workaround will be removed in the near future. It's been almost half a year since the original patch is in, and the patch is still necessary.

rxin · 2016-07-06T18:57:05Z

Hey guys -- this patch is severely under documented. It also isn't great to introduce something in the metadata to tag a column as optional.

rxin · 2016-07-06T18:57:49Z

Also do you actually see a real use case for this issue, or are you just thinking it might be useful to optimize this?

rxin · 2016-07-06T19:12:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala

  protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType =
    StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata)))

+  def removeMetadata(key: String, dt: DataType): DataType =


why are we adding a public API for a patch fix?

rxin · 2016-07-06T19:46:13Z

FYI #14074

I'm actually thinking we should perhaps revert this patch, but I don't want to do it when we are so close to rc.

viirya · 2016-07-06T22:28:45Z

This is introduced for real use case and it is very useful. Actually this patch is developed to fix the problem when I was in previous company which is a heavy Spark user for machine learning and ETL. In practice, schema merging is common because of schema expansion. Under such use cases, they can't have the advantages of predicate pushdown. Instead of reverting this, I would suggest we can fix this issues (documenting and hide the API you just did). It surely causes performance regression in such cases without this patch.

rxin · 2016-07-06T22:39:19Z

OK thanks.

Mark one side fields in merging schema for safely pushdowning filters…

e24529d

… in parquet.

Fix test.

ff4ef4c

liancheng reviewed Nov 29, 2015
View reviewed changes

viirya added 2 commits November 29, 2015 16:06

Merge remote-tracking branch 'upstream/master' into safe-pushdown-par…

5e2e955

…quet-filters

Fix wrong indentation.

4536b72

viirya changed the title ~~[SPARK-11955][SQL] Mark one side fields in merging schema for safely pushdowning filters in Parquet~~ [SPARK-11955][SQL] Mark particular fields in merging schema for safely pushdowning filters in Parquet Nov 30, 2015

viirya added 2 commits November 30, 2015 19:32

Merge remote-tracking branch 'upstream/master' into safe-pushdown-par…

96f4237

…quet-filters

Use particular instead of oneSide.

2a4e471

viirya added 2 commits December 6, 2015 01:02

Merge remote-tracking branch 'upstream/master' into safe-pushdown-par…

400bfd4

…quet-filters

Clear the temporary metadata before writing to Parquet file.

db8ffa3

viirya changed the title ~~[SPARK-11955][SQL] Mark particular fields in merging schema for safely pushdowning filters in Parquet~~ [SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet Dec 5, 2015

Merge remote-tracking branch 'upstream/master' into safe-pushdown-par…

ce3a1c3

…quet-filters

liancheng reviewed Jan 19, 2016
View reviewed changes

viirya added 2 commits January 27, 2016 09:49

Merge remote-tracking branch 'upstream/master' into safe-pushdown-par…

b87cd61

…quet-filters

For comments.

1a11770

asfgit closed this in 4637fc0 Jan 29, 2016

tedyu reviewed Jan 29, 2016
View reviewed changes

liancheng mentioned this pull request Jan 29, 2016

[SPARK-13070][SQL] Better error message when Parquet schema merging fails #10972

Closed

rxin reviewed Jul 6, 2016
View reviewed changes

liancheng mentioned this pull request Dec 5, 2016

[SPARK-18539][SQL]: tolerate pushed-down filter on non-existing parquet columns #16156

Closed

viirya deleted the safe-pushdown-parquet-filters branch December 27, 2023 18:18

[SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet #9940

[SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet #9940

Uh oh!

Conversation

viirya commented Nov 24, 2015

Uh oh!

SparkQA commented Nov 24, 2015

Uh oh!

viirya commented Nov 24, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

viirya commented Nov 28, 2015

Uh oh!

liancheng commented Nov 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 29, 2015

Uh oh!

liancheng commented Nov 29, 2015

Uh oh!

viirya commented Nov 29, 2015

Uh oh!

SparkQA commented Nov 29, 2015

Uh oh!

yhuai commented Nov 29, 2015

Uh oh!

viirya commented Nov 30, 2015

Uh oh!

yhuai commented Nov 30, 2015

Uh oh!

viirya commented Nov 30, 2015

Uh oh!

liancheng commented Nov 30, 2015

Uh oh!

viirya commented Nov 30, 2015

Uh oh!

SparkQA commented Nov 30, 2015

Uh oh!

SparkQA commented Dec 5, 2015

Uh oh!

SparkQA commented Dec 6, 2015

Uh oh!

liancheng commented Dec 8, 2015

Uh oh!

viirya commented Dec 8, 2015

Uh oh!

viirya commented Jan 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jan 19, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

viirya commented Jan 28, 2016

Uh oh!

liancheng commented Jan 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 6, 2016

Uh oh!

rxin commented Jul 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 6, 2016

Uh oh!

viirya commented Jul 6, 2016 •

edited

Loading