Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -769,7 +769,9 @@ private[client] class Shim_v0_13 extends Shim_v0_12 {

case InSet(child, values) if useAdvanced && values.size > inSetThreshold =>
val dataType = child.dataType
val sortedValues = values.toSeq.sorted(TypeUtils.getInterpretedOrdering(dataType))
// Skip null here is safe, more details could see at ExtractableLiterals.
val sortedValues = values.filter(_ != null).toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a similar comment for why this is safe, similar to IN - #21832 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks better

.sorted(TypeUtils.getInterpretedOrdering(dataType))
Comment on lines +773 to +774
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change the result of InSet which contains null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here. The null sematic of IN is pretty tricky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe, since In has already do this.

object ExtractableLiterals {
def unapply(exprs: Seq[Expression]): Option[Seq[String]] = {
// SPARK-24879: The Hive metastore filter parser does not support "null", but we still want
// to push down as many predicates as we can while still maintaining correctness.
// In SQL, the `IN` expression evaluates as follows:
// > `1 in (2, NULL)` -> NULL
// > `1 in (1, NULL)` -> true
// > `1 in (2)` -> false
// Since Hive metastore filters are NULL-intolerant binary operations joined only by
// `AND` and `OR`, we can treat `NULL` as `false` and thus rewrite `1 in (2, NULL)` as
// `1 in (2)`.
// If the Hive metastore begins supporting NULL-tolerant predicates and Spark starts
// pushing down these predicates, then this optimization will become incorrect and need
// to be changed.
val extractables = exprs
.filter {
case Literal(null, _) => false
case _ => true
}.map(ExtractableLiteral.unapply)

convert(And(GreaterThanOrEqual(child, Literal(sortedValues.head, dataType)),
LessThanOrEqual(child, Literal(sortedValues.last, dataType))))

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,5 +179,13 @@ class FiltersSuite extends SparkFunSuite with Logging with PlanTest {
}
}

test("SPARK-34515: Fix NPE if InSet contains null value during getPartitionsByFilter") {
withSQLConf(SQLConf.HIVE_METASTORE_PARTITION_PRUNING_INSET_THRESHOLD.key -> "2") {
val filter = InSet(a("p", IntegerType), Set(null, 1, 2))
val converted = shim.convertFilters(testTable, Seq(filter), conf.sessionLocalTimeZone)
assert(converted == "(p >= 1 and p <= 2)")
}
}

private def a(name: String, dataType: DataType) = AttributeReference(name, dataType)()
}