-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-389: Support predicate push down on missing columns. #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-389: Support predicate push down on missing columns. #354
Conversation
|
@danielcweeks, @liancheng, this is to avoid the predicate push-down problems in Spark. Can you review? Thanks! |
|
+1 |
|
+1 as well. Thanks! |
|
Is it possible to have a 1.8.2 release that includes this fix? I just checked and it seems that there isn't a dedicated branch for 1.8.x? |
|
I am curious, why this patch doesn't do the same thing to the Is it intentional? |
|
@liancheng @rdblue @danielcweeks I submitted #389 to extend this kind of fixing to |
This extends the fixing #354 to UserDefinedPredicate. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #389 from viirya/PARQUET-791 and squashes the following commits: d6be37d [Liang-Chi Hsieh] Address comment. 7e929c3 [Liang-Chi Hsieh] PARQUET-791: Add missing column support for UserDefinedPredicate.
Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predicates and prune references to columns not in that task's file. This PR implements predicate evaluation for missing columns, where the values are all null. This allows engines to pass predicates as they are written. A future commit should rewrite the predicates to avoid the extra work currently done in record-level filtering, but that isn't included here because it is an optimization. Author: Ryan Blue <blue@apache.org> Closes apache#354 from rdblue/PARQUET-389-predicate-push-down-on-missing-columns and squashes the following commits: b4d809a [Ryan Blue] PARQUET-389: Support record-level filtering with missing columns. 91b841c [Ryan Blue] PARQUET-389: Add missing column support to StatisticsFilter. 275f950 [Ryan Blue] PARQUET-389: Add missing column support to DictionaryFilter.
This extends the fixing apache#354 to UserDefinedPredicate. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#389 from viirya/PARQUET-791 and squashes the following commits: d6be37d [Liang-Chi Hsieh] Address comment. 7e929c3 [Liang-Chi Hsieh] PARQUET-791: Add missing column support for UserDefinedPredicate.
Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predicates and prune references to columns not in that task's file. This PR implements predicate evaluation for missing columns, where the values are all null. This allows engines to pass predicates as they are written. A future commit should rewrite the predicates to avoid the extra work currently done in record-level filtering, but that isn't included here because it is an optimization. Author: Ryan Blue <blue@apache.org> Closes apache#354 from rdblue/PARQUET-389-predicate-push-down-on-missing-columns and squashes the following commits: b4d809a [Ryan Blue] PARQUET-389: Support record-level filtering with missing columns. 91b841c [Ryan Blue] PARQUET-389: Add missing column support to StatisticsFilter. 275f950 [Ryan Blue] PARQUET-389: Add missing column support to DictionaryFilter.
This extends the fixing apache#354 to UserDefinedPredicate. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#389 from viirya/PARQUET-791 and squashes the following commits: d6be37d [Liang-Chi Hsieh] Address comment. 7e929c3 [Liang-Chi Hsieh] PARQUET-791: Add missing column support for UserDefinedPredicate.
Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predicates and prune references to columns not in that task's file. This PR implements predicate evaluation for missing columns, where the values are all null. This allows engines to pass predicates as they are written.
A future commit should rewrite the predicates to avoid the extra work currently done in record-level filtering, but that isn't included here because it is an optimization.