-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet #14067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, @rxin @liancheng, I hope this is not missed to 2.0.. |
|
cc @viirya as well. |
|
LGTM pending Jenkins. 2.0.0 RC2 has already been cut. We may have this in 2.0.0 if there was another RC. |
|
Oh, @liancheng I just corrected some more. Please take another look.. (sorry) |
| !f.metadata.contains(StructType.metadataKeyForOptionalField) || | ||
| !f.metadata.getBoolean(StructType.metadataKeyForOptionalField) | ||
| }.map(f => f.name -> f.dataType) ++ fields.flatMap { f => getFieldMap(f.dataType) } | ||
| }.map(f => f.name -> f.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
|
The description seems incorrect. It should be a StringType, instead of IntegerType? |
|
yes it is not, I just found. I will correct them all. Thank you! |
|
Another question is when the inner field is not the same name, we still can push down the filter, right? |
|
If so, then this patch seems completely skip all such push down. |
|
Test build #61841 has finished for PR 14067 at commit
|
|
Test build #61840 has finished for PR 14067 at commit
|
|
Test build #61842 has finished for PR 14067 at commit
|
|
Test build #61843 has finished for PR 14067 at commit
|
|
Yea, currently Spark SQL doesn't support column pruning and/or filter push-down for nested fields. |
|
OK. LGTM then. |
| } | ||
| } | ||
|
|
||
| test("Do not push down filters incorrectly when inner name and outer name are the same") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the jira number here
|
I'm going to merge this and fix some comments myself with another pr. Merging in master/2.0. |
…me and outer name are the same in Parquet
## What changes were proposed in this pull request?
Currently, if there is a schema as below:
```
root
|-- _1: struct (nullable = true)
| |-- _1: integer (nullable = true)
```
and if we execute the codes below:
```scala
df.filter("_1 IS NOT NULL").count()
```
This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).
The reason is, `ParquetFilters.getFieldMap` produces results below:
```
(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)
```
and then it becomes a `Map`
```
(_1,IntegerType)
```
Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.
So, Parquet filter2 produces incorrect results, for example, the codes below:
```
df.filter("_1 IS NOT NULL").count()
```
produces always 0.
This PR prevents this by not finding nested fields.
## How was this patch tested?
Unit test in `ParquetFilterSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #14067 from HyukjinKwon/SPARK-16371.
(cherry picked from commit 4f8ceed)
Signed-off-by: Reynold Xin <rxin@databricks.com>

What changes were proposed in this pull request?
Currently, if there is a schema as below:
and if we execute the codes below:
df.filter("_1 IS NOT NULL").count()This pushes down a filter although this filter is being applied to
StructType.(If my understanding is correct, Spark does not pushes down filters for those).The reason is,
ParquetFilters.getFieldMapproduces results below:and then it becomes a
MapNow, because of
....lift(dataTypeOf(name)).map(_(name, value)), this pushes down filters for_1which Parquet thinks isIntegerType. However, it is actuallyStructType.So, Parquet filter2 produces incorrect results, for example, the codes below:
produces always 0.
This PR prevents this by not finding nested fields.
How was this patch tested?
Unit test in
ParquetFilterSuite.