Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,29 @@ public <T> Boolean eq(Bound<T> term, Literal<T> lit) {
public <T> Boolean notEq(Bound<T> term, Literal<T> lit) {
// because the bounds are not necessarily a min or max value, this cannot be answered using
// them. notEq(col, X) with (X, Y) doesn't guarantee that X is a value in col.
// However, when min == max and the file has no nulls or NaN values, we can safely prune
// if that value equals the literal.
int id = term.ref().fieldId();
if (mayContainNull(id)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about including mayContainNaN(id) in this branch:

if (mayContainNull(id) || mayContainNaN(id)) {
   return ROWS_MIGHT_MATCH;
}

and leave out the other branch
if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) {

Copy link
Author

@joyhaldar joyhaldar Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion Nandor. I actually tried this initially but had to change it due to test failures.

The problem that I faced is mayContainNaN(id), which would be defined as nanCounts == null || !nanCounts.containsKey(id) || nanCounts.get(id) != 0; returns true when nanCounts == null or when the column has no entry in the map.

Tests that failed with mayContainNaN(id):

These tests use timestamp/string columns, and mayContainNaN() returns true for them (either because nanCounts == null or the column isn't in the map), preventing the optimization from running.

The current approach checks NaN two ways:

  1. NaNUtil.isNaN(bounds) - returns false for timestamps/strings (they can't be NaN)
  2. nanCounts.get(id) != 0 - only checks if stats actually exist

return ROWS_MIGHT_MATCH;
}
T lower = lowerBound(term);
T upper = upperBound(term);

if (lower == null || upper == null || NaNUtil.isNaN(lower) || NaNUtil.isNaN(upper)) {
return ROWS_MIGHT_MATCH;
}

if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) {
return ROWS_MIGHT_MATCH;
}

if (lower.equals(upper)) {
int cmp = lit.comparator().compare(lower, lit.value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need another local variable here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review Manu. I was trying to keep the logic inline to match the existing code style in this file.

For example, the methods lt(), gt(), ltEq(), gtEq(), eq(), in() all have similar variables declared, including this one,
int cmp = lit.comparator().compare(lower, lit.value());

I thought extracting it would be inconsistent with how other methods are structured.

However, if you feel this is important for maintainability, I'm happy to remove the local variable declaration. Let me know what you think.

if (cmp == 0) {
return ROWS_CANNOT_MATCH;
}
}
return ROWS_MIGHT_MATCH;
}

Expand Down Expand Up @@ -381,6 +404,28 @@ public <T> Boolean in(Bound<T> term, Set<T> literalSet) {
public <T> Boolean notIn(Bound<T> term, Set<T> literalSet) {
// because the bounds are not necessarily a min or max value, this cannot be answered using
// them. notIn(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value in col.
// However, when min == max and the file has no nulls or NaN values, we can safely prune
// if that value is in the exclusion set.
int id = term.ref().fieldId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put the similar logic into a separate method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion Manu. I was trying to keep the logic inline to match the existing code style in this file.

For example, the methods eq(), in(), lt(), ltEq(), gt(), gtEq(), all have similar inline checks

I thought extracting it would be inconsistent with how other methods are structured.

However, if you feel this is important for maintainability, I'm happy to extract it. Let me know your preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to extract similar complex logic in a separate method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can do that, I will make the changes and request you for another review.

if (mayContainNull(id)) {
return ROWS_MIGHT_MATCH;
}
T lower = lowerBound(term);
T upper = upperBound(term);

if (lower == null || upper == null || NaNUtil.isNaN(lower) || NaNUtil.isNaN(upper)) {
return ROWS_MIGHT_MATCH;
}

if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) {
return ROWS_MIGHT_MATCH;
}

if (lower.equals(upper)) {
if (literalSet.contains(lower)) {
return ROWS_CANNOT_MATCH;
}
}
return ROWS_MIGHT_MATCH;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -970,4 +970,172 @@ public void testNotNullInNestedStruct() {
.as("Should not read: optional_address.optional_street2 is optional")
.isFalse();
}

@Test
public void testNotEqWithSingleValue() {
DataFile rangeOfValues =
new TestDataFile(
"range_of_values.avro",
Row.of(),
10,
ImmutableMap.of(3, 10L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "aaa")),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "zzz")));

boolean shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notEqual("required", "aaa")).eval(rangeOfValues);
assertThat(shouldRead)
.as("Should read: file has range of values, optimization doesn't apply")
.isTrue();

DataFile singleValueFile =
new TestDataFile(
"single_value.avro",
Row.of(),
10,
ImmutableMap.of(3, 10L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")));

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notEqual("required", "abc")).eval(singleValueFile);
assertThat(shouldRead)
.as("Should prune: file contains single value equal to literal")
.isFalse();

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notEqual("required", "def")).eval(singleValueFile);
assertThat(shouldRead)
.as("Should read: file contains single value not equal to literal")
.isTrue();

DataFile singleValueWithNulls =
new TestDataFile(
"single_value_nulls.avro",
Row.of(),
10,
ImmutableMap.of(3, 10L),
ImmutableMap.of(3, 2L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")));

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notEqual("required", "abc"))
.eval(singleValueWithNulls);
assertThat(shouldRead).as("Should read: file has nulls which match != predicate").isTrue();

DataFile singleValueWithNaN =
new TestDataFile(
"single_value_nan.avro",
Row.of(),
10,
ImmutableMap.of(9, 10L),
ImmutableMap.of(9, 0L),
ImmutableMap.of(9, 2L),
ImmutableMap.of(9, toByteBuffer(Types.FloatType.get(), 5.0F)),
ImmutableMap.of(9, toByteBuffer(Types.FloatType.get(), 5.0F)));

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notEqual("no_nans", 5.0F)).eval(singleValueWithNaN);
assertThat(shouldRead).as("Should read: file has NaN values which match != predicate").isTrue();

DataFile singleValueNaNBounds =
new TestDataFile(
"single_value_nan_bounds.avro",
Row.of(),
10,
ImmutableMap.of(9, 10L),
ImmutableMap.of(9, 0L),
ImmutableMap.of(9, 0L),
ImmutableMap.of(9, toByteBuffer(Types.FloatType.get(), Float.NaN)),
ImmutableMap.of(9, toByteBuffer(Types.FloatType.get(), Float.NaN)));

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notEqual("no_nans", 5.0F)).eval(singleValueNaNBounds);
assertThat(shouldRead).as("Should read: bounds are NaN").isTrue();
}

@Test
public void testNotInWithSingleValue() {
DataFile rangeOfValues =
new TestDataFile(
"range_of_values.avro",
Row.of(),
10,
ImmutableMap.of(3, 10L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "aaa")),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "zzz")));

boolean shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notIn("required", "aaa", "bbb")).eval(rangeOfValues);
assertThat(shouldRead)
.as("Should read: file has range of values, optimization doesn't apply")
.isTrue();

DataFile singleValueFile =
new TestDataFile(
"single_value.avro",
Row.of(),
10,
ImmutableMap.of(3, 10L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")));

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notIn("required", "abc", "def"))
.eval(singleValueFile);
assertThat(shouldRead)
.as("Should prune: file contains single value in exclusion list")
.isFalse();

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notIn("required", "def", "ghi"))
.eval(singleValueFile);
assertThat(shouldRead)
.as("Should read: file contains single value not in exclusion list")
.isTrue();

DataFile singleValueWithNulls =
new TestDataFile(
"single_value_nulls.avro",
Row.of(),
10,
ImmutableMap.of(3, 10L),
ImmutableMap.of(3, 2L),
ImmutableMap.of(3, 0L),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")),
ImmutableMap.of(3, toByteBuffer(StringType.get(), "abc")));

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notIn("required", "abc", "def"))
.eval(singleValueWithNulls);
assertThat(shouldRead).as("Should read: file has nulls which match NOT IN predicate").isTrue();

DataFile singleValueWithNaN =
new TestDataFile(
"single_value_nan.avro",
Row.of(),
10,
ImmutableMap.of(9, 10L),
ImmutableMap.of(9, 0L),
ImmutableMap.of(9, 2L),
ImmutableMap.of(9, toByteBuffer(Types.FloatType.get(), 5.0F)),
ImmutableMap.of(9, toByteBuffer(Types.FloatType.get(), 5.0F)));

shouldRead =
new InclusiveMetricsEvaluator(SCHEMA, notIn("no_nans", 5.0F, 10.0F))
.eval(singleValueWithNaN);
assertThat(shouldRead)
.as("Should read: file has NaN values which match NOT IN predicate")
.isTrue();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -461,8 +461,7 @@ public void testUnpartitionedYears() throws Exception {
pushFilters(builder, predicate);
scan = builder.build().toBatch();

// notEq can't be answered using column bounds because they are not exact
assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);
}

@TestTemplate
Expand Down Expand Up @@ -771,7 +770,7 @@ public void testUnpartitionedTruncateString() throws Exception {
pushFilters(builder, predicate);
Batch scan = builder.build().toBatch();

assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);

// NOT NotEqual
builder = scanBuilder();
Expand Down Expand Up @@ -990,7 +989,7 @@ public void testUnpartitionedOr() throws Exception {
pushFilters(builder, predicate);
scan = builder.build().toBatch();

assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);
}

@TestTemplate
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -461,8 +461,7 @@ public void testUnpartitionedYears() throws Exception {
pushFilters(builder, predicate);
scan = builder.build().toBatch();

// notEq can't be answered using column bounds because they are not exact
assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);
}

@TestTemplate
Expand Down Expand Up @@ -771,7 +770,7 @@ public void testUnpartitionedTruncateString() throws Exception {
pushFilters(builder, predicate);
Batch scan = builder.build().toBatch();

assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);

// NOT NotEqual
builder = scanBuilder();
Expand Down Expand Up @@ -990,7 +989,7 @@ public void testUnpartitionedOr() throws Exception {
pushFilters(builder, predicate);
scan = builder.build().toBatch();

assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);
}

@TestTemplate
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -461,8 +461,7 @@ public void testUnpartitionedYears() throws Exception {
pushFilters(builder, predicate);
scan = builder.build().toBatch();

// notEq can't be answered using column bounds because they are not exact
assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);
}

@TestTemplate
Expand Down Expand Up @@ -771,7 +770,7 @@ public void testUnpartitionedTruncateString() throws Exception {
pushFilters(builder, predicate);
Batch scan = builder.build().toBatch();

assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);

// NOT NotEqual
builder = scanBuilder();
Expand Down Expand Up @@ -990,7 +989,7 @@ public void testUnpartitionedOr() throws Exception {
pushFilters(builder, predicate);
scan = builder.build().toBatch();

assertThat(scan.planInputPartitions()).hasSize(10);
assertThat(scan.planInputPartitions()).hasSize(5);
}

@TestTemplate
Expand Down