Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 24, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-11955

Currently we simply skip pushdowning filters in parquet if we enable schema merging.

However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet.

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46611 has finished for PR 9940 at commit e24529d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 24, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46626 has finished for PR 9940 at commit e24529d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46656 has finished for PR 9940 at commit ff4ef4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 28, 2015

ping @liancheng @yhuai

@liancheng
Copy link
Contributor

I don't really get the idea of this one, what exactly does "oneSide" mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation of the map block is off, please help fixing it. Thanks.

@viirya
Copy link
Member Author

viirya commented Nov 29, 2015

@liancheng the naming might be a little confusing. It means that the fields only exist in one of the merging schemas. For example, when we want to merge two schema: {a: Int, b: String, c: Double} and {a: Int, c: Double}, then the b: String is only existing in one schema.

In other words, this patch tried to mark the differences between merged schemas.

@liancheng
Copy link
Contributor

Oh I see, so basically it indicates that a field doesn't exist in schemata of all part-files, and we try to skip filters that involve this kind of fields when doing filter push-down. Right?

@viirya
Copy link
Member Author

viirya commented Nov 29, 2015

Yes, you are right.

@SparkQA
Copy link

SparkQA commented Nov 29, 2015

Test build #46848 has finished for PR 9940 at commit 4536b72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Nov 29, 2015

Is it a regression?

@viirya
Copy link
Member Author

viirya commented Nov 30, 2015

Hi @yhuai Can you explain more? What you meant about regression?

@yhuai
Copy link
Contributor

yhuai commented Nov 30, 2015

oh, I thought it's a bug fix (so, I was wondering if it's a regression from 1.5 or not). But, it is actually an improvement?

@viirya
Copy link
Member Author

viirya commented Nov 30, 2015

@yhuai I think so. We completely skip pushdown filters now if schema merging is enabled. This patch is to improve that.

@liancheng
Copy link
Contributor

@yhuai Here is the background: Parquet filter push-down is disabled when schema merging is turned on because of PARQUET-389.

I'm somewhat hesitant to have this change. On one hand, I do want to have filter push-down in case of schema merging since it's generally a very useful optimization. On the other hand, this change is a little bit hacky, since schema metadata is not intended to be used in this way.

Anyway, at least, let's have two more updates for this PR:

  1. Rename oneSide to something more intuitive
  2. Schema metadata is persisted together with the schema, this temporary metadata entry should be cleared before saving the Parquet file.

@viirya viirya changed the title [SPARK-11955][SQL] Mark one side fields in merging schema for safely pushdowning filters in Parquet [SPARK-11955][SQL] Mark particular fields in merging schema for safely pushdowning filters in Parquet Nov 30, 2015
@viirya
Copy link
Member Author

viirya commented Nov 30, 2015

Renamed oneSide to optional.

I will update this for the second point later.

@SparkQA
Copy link

SparkQA commented Nov 30, 2015

Test build #46881 has finished for PR 9940 at commit 2a4e471.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-11955][SQL] Mark particular fields in merging schema for safely pushdowning filters in Parquet [SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet Dec 5, 2015
@SparkQA
Copy link

SparkQA commented Dec 5, 2015

Test build #47229 has finished for PR 9940 at commit db8ffa3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 6, 2015

Test build #47241 has finished for PR 9940 at commit 40533a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public abstract static class PrefixComputer\n

@liancheng
Copy link
Contributor

@viirya Thanks for your efforts! Would you mind me revisiting this after 1.6 release? I would like see whether we can have PARQUET-389 fixed in Parquet community ASAP, so that we may not need to work around it in Spark 1.7/2.0.

@viirya
Copy link
Member Author

viirya commented Dec 8, 2015

@liancheng Sure. Thank you!

@viirya
Copy link
Member Author

viirya commented Jan 19, 2016

@liancheng Can we revisit this now? Or we want to wait a bit longer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we resort to equalsIgnoreCompatibleNullability because extra metadata in merged? Can we also add assertion for added metadata instead of working around it using equalsIgnoreCompatibleNullability?

@liancheng
Copy link
Contributor

Overall LGTM except for several minor issues.

Another thing is that, we probably want to use a more special name (something like _OPTIONAL_) to avoid naming conflict with user defined metadata keys.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50190 has finished for PR 9940 at commit 1a11770.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 28, 2016

ping @liancheng Please see if latest updates are proper for you. Thanks.

@liancheng
Copy link
Contributor

Thanks! I'm going to merging this to master.

@asfgit asfgit closed this in 4637fc0 Jan 29, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add clear() to MetadataBuilder, this can be lifted above the fields.map. Inside the map operation we just clear the MetadataBuilder.

What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is mostly a workaround for a parquet-mr bug (PARQUET-389), and I'd assume that it will be fixed in the near future. Then we can remove this workaround. So it doesn't seem to be worth modifying MetadataBuilder, which is part of public API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround may be taken out in the future.

However, use of MetadataBuilder occurs in many other places:
http://pastebin.com/nVjNfrgp

I feel adding clear() to MetadataBuilder would help in current and future use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately unless we have a timeline to actually fix the Parquet bug, I don't think we can expect the workaround will be removed in the near future. It's been almost half a year since the original patch is in, and the patch is still necessary.

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

Hey guys -- this patch is severely under documented. It also isn't great to introduce something in the metadata to tag a column as optional.

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

Also do you actually see a real use case for this issue, or are you just thinking it might be useful to optimize this?

protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType =
StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata)))

def removeMetadata(key: String, dt: DataType): DataType =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding a public API for a patch fix?

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

FYI #14074

I'm actually thinking we should perhaps revert this patch, but I don't want to do it when we are so close to rc.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2016

This is introduced for real use case and it is very useful. Actually this patch is developed to fix the problem when I was in previous company which is a heavy Spark user for machine learning and ETL. In practice, schema merging is common because of schema expansion. Under such use cases, they can't have the advantages of predicate pushdown. Instead of reverting this, I would suggest we can fix this issues (documenting and hide the API you just did). It surely causes performance regression in such cases without this patch.

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

OK thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants