Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 6, 2016

What changes were proposed in this pull request?

Currently, if there is a schema as below:

root
  |-- _1: struct (nullable = true)
  |    |-- _1: integer (nullable = true)

and if we execute the codes below:

df.filter("_1 IS NOT NULL").count()

This pushes down a filter although this filter is being applied to StructType.(If my understanding is correct, Spark does not pushes down filters for those).

The reason is, ParquetFilters.getFieldMap produces results below:

(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)

and then it becomes a Map

(_1,IntegerType)

Now, because of ....lift(dataTypeOf(name)).map(_(name, value)), this pushes down filters for _1 which Parquet thinks is IntegerType. However, it is actually StructType.

So, Parquet filter2 produces incorrect results, for example, the codes below:

df.filter("_1 IS NOT NULL").count()

produces always 0.

This PR prevents this by not finding nested fields.

How was this patch tested?

Unit test in ParquetFilterSuite.

@HyukjinKwon
Copy link
Member Author

Hi, @rxin @liancheng, I hope this is not missed to 2.0..

@HyukjinKwon
Copy link
Member Author

cc @viirya as well.

@liancheng
Copy link
Contributor

LGTM pending Jenkins.

2.0.0 RC2 has already been cut. We may have this in 2.0.0 if there was another RC.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 6, 2016

Oh, @liancheng I just corrected some more. Please take another look.. (sorry)

!f.metadata.contains(StructType.metadataKeyForOptionalField) ||
!f.metadata.getBoolean(StructType.metadataKeyForOptionalField)
}.map(f => f.name -> f.dataType) ++ fields.flatMap { f => getFieldMap(f.dataType) }
}.map(f => f.name -> f.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some comment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@viirya
Copy link
Member

viirya commented Jul 6, 2016

The description seems incorrect. It should be a StringType, instead of IntegerType?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 6, 2016

yes it is not, I just found. I will correct them all. Thank you!

@viirya
Copy link
Member

viirya commented Jul 6, 2016

Another question is when the inner field is not the same name, we still can push down the filter, right?

@viirya
Copy link
Member

viirya commented Jul 6, 2016

If so, then this patch seems completely skip all such push down.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 6, 2016

Hm... I thought Spark does not support filter-push down for nested fields.

Running the codes below:

df.filter("_1._1 IS NOT NULL").count()

pushes no fileters..

2016-07-06 7 50 08

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61841 has finished for PR 14067 at commit 39e66ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61840 has finished for PR 14067 at commit 54b8348.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61842 has finished for PR 14067 at commit ffaa666.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61843 has finished for PR 14067 at commit 1300042.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Yea, currently Spark SQL doesn't support column pruning and/or filter push-down for nested fields.

@viirya
Copy link
Member

viirya commented Jul 6, 2016

OK. LGTM then.

}
}

test("Do not push down filters incorrectly when inner name and outer name are the same") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the jira number here

@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

I'm going to merge this and fix some comments myself with another pr. Merging in master/2.0.

@asfgit asfgit closed this in 4f8ceed Jul 6, 2016
asfgit pushed a commit that referenced this pull request Jul 6, 2016
…me and outer name are the same in Parquet

## What changes were proposed in this pull request?

Currently, if there is a schema as below:

```
root
  |-- _1: struct (nullable = true)
  |    |-- _1: integer (nullable = true)
```

and if we execute the codes below:

```scala
df.filter("_1 IS NOT NULL").count()
```

This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).

The reason is, `ParquetFilters.getFieldMap` produces results below:

```
(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)
```

and then it becomes a `Map`

```
(_1,IntegerType)
```

Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.

So, Parquet filter2 produces incorrect results, for example, the codes below:

```
df.filter("_1 IS NOT NULL").count()
```

produces always 0.

This PR prevents this by not finding nested fields.

## How was this patch tested?

Unit test in `ParquetFilterSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14067 from HyukjinKwon/SPARK-16371.

(cherry picked from commit 4f8ceed)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@rxin
Copy link
Contributor

rxin commented Jul 6, 2016

@HyukjinKwon HyukjinKwon deleted the SPARK-16371 branch January 2, 2018 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants