Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 13, 2016

Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predicates and prune references to columns not in that task's file. This PR implements predicate evaluation for missing columns, where the values are all null. This allows engines to pass predicates as they are written.

A future commit should rewrite the predicates to avoid the extra work currently done in record-level filtering, but that isn't included here because it is an optimization.

@rdblue
Copy link
Contributor Author

rdblue commented Jul 13, 2016

@danielcweeks, @liancheng, this is to avoid the predicate push-down problems in Spark. Can you review? Thanks!

@liancheng
Copy link
Contributor

+1

@danielcweeks
Copy link

+1 as well. Thanks!

@asfgit asfgit closed this in 42662f8 Jul 15, 2016
@liancheng
Copy link
Contributor

Is it possible to have a 1.8.2 release that includes this fix? I just checked and it seems that there isn't a dedicated branch for 1.8.x?

@viirya
Copy link
Member

viirya commented Dec 6, 2016

I am curious, why this patch doesn't do the same thing to the visit method for UserDefinedPredicate in StatisticsFilter?

Is it intentional?

@viirya
Copy link
Member

viirya commented Dec 6, 2016

@liancheng @rdblue @danielcweeks I submitted #389 to extend this kind of fixing to UserDefinedPredicate. Don't know if it is appropriate? Please remind me if it is not. Thank you.

asfgit pushed a commit that referenced this pull request Dec 8, 2016
This extends the fixing #354 to UserDefinedPredicate.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #389 from viirya/PARQUET-791 and squashes the following commits:

d6be37d [Liang-Chi Hsieh] Address comment.
7e929c3 [Liang-Chi Hsieh] PARQUET-791: Add missing column support for UserDefinedPredicate.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predicates and prune references to columns not in that task's file. This PR implements predicate evaluation for missing columns, where the values are all null. This allows engines to pass predicates as they are written.

A future commit should rewrite the predicates to avoid the extra work currently done in record-level filtering, but that isn't included here because it is an optimization.

Author: Ryan Blue <blue@apache.org>

Closes apache#354 from rdblue/PARQUET-389-predicate-push-down-on-missing-columns and squashes the following commits:

b4d809a [Ryan Blue] PARQUET-389: Support record-level filtering with missing columns.
91b841c [Ryan Blue] PARQUET-389: Add missing column support to StatisticsFilter.
275f950 [Ryan Blue] PARQUET-389: Add missing column support to DictionaryFilter.
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This extends the fixing apache#354 to UserDefinedPredicate.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#389 from viirya/PARQUET-791 and squashes the following commits:

d6be37d [Liang-Chi Hsieh] Address comment.
7e929c3 [Liang-Chi Hsieh] PARQUET-791: Add missing column support for UserDefinedPredicate.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predicates and prune references to columns not in that task's file. This PR implements predicate evaluation for missing columns, where the values are all null. This allows engines to pass predicates as they are written.

A future commit should rewrite the predicates to avoid the extra work currently done in record-level filtering, but that isn't included here because it is an optimization.

Author: Ryan Blue <blue@apache.org>

Closes apache#354 from rdblue/PARQUET-389-predicate-push-down-on-missing-columns and squashes the following commits:

b4d809a [Ryan Blue] PARQUET-389: Support record-level filtering with missing columns.
91b841c [Ryan Blue] PARQUET-389: Add missing column support to StatisticsFilter.
275f950 [Ryan Blue] PARQUET-389: Add missing column support to DictionaryFilter.
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
This extends the fixing apache#354 to UserDefinedPredicate.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#389 from viirya/PARQUET-791 and squashes the following commits:

d6be37d [Liang-Chi Hsieh] Address comment.
7e929c3 [Liang-Chi Hsieh] PARQUET-791: Add missing column support for UserDefinedPredicate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants