Skip to content

Conversation

@PenguinToast
Copy link
Contributor

What changes were proposed in this pull request?

We get a NPE when we have a filter on a partition column of the form col in (x, null). This is due to the filter converter in HiveShim not handling nulls correctly. This patch fixes this bug while still pushing down as much of the partition pruning predicates as possible, by filtering out nulls from any in predicate. Since Hive only supports very simple partition pruning filters, this change should preserve correctness.

How was this patch tested?

Unit tests, manual tests

@gatorsmile
Copy link
Member

ok to test

"""stringcol = 'p1" and q="q1' and 'p2" and q="q2' = stringcol""")

filterTest("SPARK-24879 null literals should be ignored for IN constructs",
Seq(a("intcol", IntegerType) in (Literal(1), Literal(null))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add more test cases for better test coverage

object ExtractableLiterals {
def unapply(exprs: Seq[Expression]): Option[Seq[String]] = {
val extractables = exprs.map(ExtractableLiteral.unapply)
// SPARK-24879: The Hive filter parser does not support "null", but we still want to push
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Hive metastore filter parser

val extractables = exprs.map(ExtractableLiteral.unapply)
// SPARK-24879: The Hive filter parser does not support "null", but we still want to push
// down as many predicates as we can while still maintaining correctness. "x in (a, b,
// null)" can be rewritten as "x in (a, b)" for the purposes of partition pruning, so we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should write down the rules here.
1 in (2, NULL) -> NULL
1 in (1, NULL) -> true
1 in (2) -> false

NULL is not equal to FALSE. Since all the pushed down predicates are NULL intolerant and connected by AND or OR, NULL can be treated as FALSE.

@gatorsmile
Copy link
Member

Test this please

@gatorsmile
Copy link
Member

add to whitelist

@SparkQA
Copy link

SparkQA commented Jul 21, 2018

Test build #93369 has finished for PR 21832 at commit ce86fbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 21, 2018

LGTM

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Jul 21, 2018
## What changes were proposed in this pull request?
We get a NPE when we have a filter on a partition column of the form `col in (x, null)`. This is due to the filter converter in HiveShim not handling `null`s correctly. This patch fixes this bug while still pushing down as much of the partition pruning predicates as possible, by filtering out `null`s from any `in` predicate. Since Hive only supports very simple partition pruning filters, this change should preserve correctness.

## How was this patch tested?
Unit tests, manual tests

Author: William Sheu <william.sheu@databricks.com>

Closes #21832 from PenguinToast/partition-pruning-npe.

(cherry picked from commit bbd6f0c)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
@asfgit asfgit closed this in bbd6f0c Jul 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants