Skip to content

Conversation

@dejangvozdenac
Copy link
Contributor

@dejangvozdenac dejangvozdenac commented Aug 13, 2025

producesNull check is used when binding isNull and notNull predicate. Currently, the logic states that a field can produce null if and only if that field is optional. This is not true, however, in the case of required fields nested within optional structs. The field itself can produce nulls if the parent struct is null despite it being required.

I'm able to reproduce this case in Trino and Spark by creating the following schema and adding rows to it:

spark-sql>  CREATE TABLE default.dejan_test (
  id INT NOT NULL,
  name STRING NOT NULL,
  age INT NOT NULL,
  address STRUCT<street: STRING NOT NULL, address_info: STRUCT<city: STRING NOT NULL, county: STRING NOT NULL, state: STRING NOT NULL>>)
USING iceberg;
spark-sql> INSERT INTO default.dejan_test (id, name, age, address)
VALUES (
  0, 
  'Jane Doe', 
  27, 
  NULL
);
spark-sql> INSERT INTO default.dejan_test (id, name, age, address)
VALUES (
  1, 
  'John Doe', 
  30, 
  STRUCT(
    '123 Main St',
    STRUCT('San Francisco', 'San Francisco County', 'California')
  )
);

address.street is null for row 0, but trino/spark using iceberg api disagree:

trino> 
set session iceberg.projection_pushdown_enabled=true;
SET SESSION
trino> 
select
  id
from
  iceberg.default.dejan_test
where
  address.street is null;
 id 
----
(0 rows)

Query 20250613_034027_00008_xn59q, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.36 [0 rows, 0B] [0 rows/s, 0B/s]

You can see that when Trino reads the entire file, it correctly determines the row:

trino> 
set session iceberg.projection_pushdown_enabled=false;
SET SESSION
trino> 
select
  id
from
  iceberg.default.dejan_test
where
  address.street is null;
 id 
----
  0 
(1 row)

Query 20250613_033713_00001_xn59q, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
2.85 [2 rows, 4.43KiB] [0 rows/s, 1.56KiB/s]

This also leads to unexpected behavior where null or not null check behaves differently based on the binding order:

spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test;
2

spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test
                   > where
                   >   address.street is null or address.street is not null;
2


spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test
                   > where
                   >   address.street is not null;
1

spark-sql (default)> select
                   >   count(*)
                   > from
                   >   default.dejan_test
                   > where
                   >   address.street is null;
0

After this change, iceberg can find the null row:

trino> 
set session iceberg.projection_pushdown_enabled=true;
SET SESSION
trino> 
select
  id
from
  iceberg.default.dejan_test
where
  address.street is null;
 id 
----
  0 
(1 row)

Query 20250813_413537_00001_qn1a9, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
2.15 [1 rows, 2.03KiB] [0 rows/s, 969B/s]

Closes #13328 (and relatedly trinodb/trino#20511)

@github-actions github-actions bot added the API label Aug 13, 2025
@dejangvozdenac dejangvozdenac marked this pull request as draft August 13, 2025 15:23
@dejangvozdenac dejangvozdenac changed the title add ancestor check to field optional is null check add ancestor check to field optional producesNull Aug 13, 2025
@dejangvozdenac dejangvozdenac changed the title add ancestor check to field optional producesNull API: required nested fields within optional structs can produce null Aug 14, 2025
@dejangvozdenac dejangvozdenac marked this pull request as ready for review August 14, 2025 19:47
@stevenzwu
Copy link
Contributor

stevenzwu commented Aug 14, 2025

@dejangvozdenac thanks for reporting and fixing the issue. I agree that my_struct.nested_field should be evaluated to null if my_struct is null.

can you add unit test in Spark module to cover the scenario you described?

@dejangvozdenac
Copy link
Contributor Author

thanks for the review @stevenzwu, I appreciate it! I addressed all the comments, let me know if anything further is needed.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We will need 2 or 3 more committers' approval, since this modifies critical api code.

@dejangvozdenac
Copy link
Contributor Author

LGTM. We will need 2 or 3 more committers' approval, since this modifies critical api code.

awesome, thanks for the quick review @stevenzwu. what's the usually process here? should I tag 2-3 people in who have reviewed code under api or do you have suggestions?

@dejangvozdenac
Copy link
Contributor Author

Thanks for the reviews @nastra / @pvary , the comments should all be addressed.

@dejangvozdenac
Copy link
Contributor Author

Hey @pvary , mind taking a look? @stevenzwu mentioned we need at least 3 reviewers on this.

@dejangvozdenac
Copy link
Contributor Author

Sorry wrong tag, meant to tag @nastra

@dejangvozdenac
Copy link
Contributor Author

@nastra might be busy. @pvary or @singhpk234 mind taking a look?

@dejangvozdenac
Copy link
Contributor Author

@stevenzwu we have three approvals now, would this be ready to merge?

@dejangvozdenac
Copy link
Contributor Author

hey @stevenzwu @pvary @nastra @huaxingao , gentle ping on this

@stevenzwu
Copy link
Contributor

@nastra you want to take another look before we merge this?

@nastra nastra merged commit c05f6c9 into apache:main Sep 19, 2025
43 checks passed
@huaxingao huaxingao added this to the Iceberg 1.10.1 milestone Sep 30, 2025
nastra added a commit to nastra/iceberg that referenced this pull request Oct 7, 2025
…ce nulls

This partially reverts some changes around the `Accessor` API that were introduced by apache#13804 and uses a Schema visitor
to detect whether any of the parent fields of a nested required field are optional.
This info is then used when IS_NULL / NOT_NULL is evaluated
nastra added a commit to nastra/iceberg that referenced this pull request Oct 7, 2025
…ce nulls

This partially reverts some changes around the `Accessor` API that were introduced by apache#13804 and uses a Schema visitor
to detect whether any of the parent fields of a nested required field are optional.
This info is then used when IS_NULL / NOT_NULL is evaluated
nastra added a commit to nastra/iceberg that referenced this pull request Oct 7, 2025
…ce nulls

This partially reverts some changes around the `Accessor` API that were introduced by apache#13804 and uses a Schema visitor
to detect whether any of the parent fields of a nested required field are optional.
This info is then used when IS_NULL / NOT_NULL is evaluated
huaxingao pushed a commit that referenced this pull request Oct 10, 2025
…ce nulls (#14270)

* API: Detect whether required fields nested within optionals can produce nulls

This partially reverts some changes around the `Accessor` API that were introduced by #13804 and uses a Schema visitor
to detect whether any of the parent fields of a nested required field are optional.
This info is then used when IS_NULL / NOT_NULL is evaluated

* only check parent fields on IS_NULL/NOT_NULL
huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Oct 31, 2025
huaxingao added a commit that referenced this pull request Nov 1, 2025
…13804) (#14460)

(cherry picked from commit c05f6c9)

Co-authored-by: Dejan Gvozdenac <d.gvozdenac94@gmail.com>
huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 6, 2025
…ce nulls (apache#14270)

* API: Detect whether required fields nested within optionals can produce nulls

This partially reverts some changes around the `Accessor` API that were introduced by apache#13804 and uses a Schema visitor
to detect whether any of the parent fields of a nested required field are optional.
This info is then used when IS_NULL / NOT_NULL is evaluated

* only check parent fields on IS_NULL/NOT_NULL

(cherry picked from commit a473b1c)
huaxingao added a commit that referenced this pull request Nov 6, 2025
…ce nulls (#14270) (#14512)

* API: Detect whether required fields nested within optionals can produce nulls

This partially reverts some changes around the `Accessor` API that were introduced by #13804 and uses a Schema visitor
to detect whether any of the parent fields of a nested required field are optional.
This info is then used when IS_NULL / NOT_NULL is evaluated

* only check parent fields on IS_NULL/NOT_NULL

(cherry picked from commit a473b1c)

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
@ebyhr
Copy link
Contributor

ebyhr commented Nov 13, 2025

@dejangvozdenac Thank you for fixing this issue! I've verified that this change resolves the correctness issue in Trino. A regression test is included in trinodb/trino#26640.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Required fields within optional fields cause incorrect results in Trino

6 participants