Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR modifies DataSourceExec::try_swapping_with_projection to preserve equivalence properties when creating a new DataSourceExec.

Datafusion was losing equivalence properties when projection pushdown occurred after filter pushdown.

Consider this example:

COPY (
    SELECT
        '00000000000000000000000000000001' AS trace_id,
        '2023-10-01 00:00:00'::timestamptz AS start_timestamp,
        'prod' as deployment_environment
)
TO 'data/1.parquet';

COPY (
    SELECT
        '00000000000000000000000000000002' AS trace_id,
        '2024-10-01 00:00:00'::timestamptz AS start_timestamp,
        'staging' as deployment_environment
)
TO 'data/2.parquet';

CREATE EXTERNAL TABLE t1 STORED AS PARQUET LOCATION 'data/';

SET datafusion.execution.parquet.pushdown_filters = true;

SELECT deployment_environment
FROM t1
WHERE trace_id = '00000000000000000000000000000002'
ORDER BY start_timestamp, trace_id;

/*
SanityCheckPlan
caused by
Error during planning: 
Plan: ["SortPreservingMergeExec: [start_timestamp@1 ASC NULLS LAST, trace_id@2 ASC NULLS LAST]", "  SortExec: expr=[start_timestamp@1 ASC NULLS LAST], preserve_partitioning=[true]", "    DataSourceExec: file_groups={2 groups: [[Users/adriangb/GitHub/datafusion/data/1.parquet], [Users/adriangb/GitHub/datafusion/data/2.parquet]]}, projection=[deployment_environment, start_timestamp, trace_id], file_type=parquet, predicate=trace_id@0 = 00000000000000000000000000000002, pruning_predicate=trace_id_null_count@2 != row_count@3 AND trace_id_min@0 <= 00000000000000000000000000000002 AND 00000000000000000000000000000002 <= trace_id_max@1, required_guarantees=[trace_id in (00000000000000000000000000000002)]"] does not satisfy order requirements: [start_timestamp@1 ASC NULLS LAST, trace_id@2 ASC NULLS LAST]. Child-0 order: [[start_timestamp@1 ASC NULLS LAST]]
*/
  1. Filter pushdown would create equivalence properties indicating trace_id is a constant
  2. Projection pushdown would create a new DataSourceExec that lost this cached information
  3. This caused the SanityCheckPlan to fail because it couldn't determine that ordering requirements were satisfied

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Aug 11, 2025
@friendlymatthew friendlymatthew marked this pull request as ready for review August 11, 2025 19:48
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid to me!

@adriangb adriangb merged commit 7d52145 into apache:main Aug 11, 2025
27 checks passed
Comment on lines +326 to +327
if let Some(new_data_source_exec) =
new_plan.as_any().downcast_ref::<DataSourceExec>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cry every time I see a downcast of a trait to a specific type. I understand why it's necessary in this case, I think the existing API is to blame, it's worth moving forward with this to fix the (pretty bad IMO) bug BUT this is the sort of thing I hope a future refactor will avoid the need for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth making a ticket to track the issue / proposed solution if there is not one already

Copy link
Contributor

@adriangb adriangb Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, it's #15952, should have put that in the code comment

@adriangb adriangb deleted the friendlymatthew/preserve-cache-when-projecting branch August 11, 2025 21:37
@AdamGS AdamGS mentioned this pull request Aug 12, 2025
5 tasks
AdamGS pushed a commit to AdamGS/arrow-datafusion that referenced this pull request Aug 12, 2025
alamb pushed a commit that referenced this pull request Aug 12, 2025
* Preserve equivalence properties during projection pushdown (#17129)

* Adds parquet data diffs

---------

Co-authored-by: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com>
'2023-10-01 00:00:00'::timestamptz AS start_timestamp,
'prod' as deployment_environment
)
TO 'data/1.parquet';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this writes to datafusion/sqllogictest/data/1.parquet which was also checked in. I hit issues during upgrade with this here (the data is different so the files being written become different):

#17175 (comment)

I have a fix and will forward port it to main, but wanted to drop a note here in case anyone else saw it

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Aug 14, 2025
* Preserve equivalence properties during projection pushdown (apache#17129)

* Adds parquet data diffs

---------

Co-authored-by: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com>
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Aug 14, 2025
* Preserve equivalence properties during projection pushdown (apache#17129)

* Adds parquet data diffs

---------

Co-authored-by: Adam Gutglick <adam@spiraldb.com>
Co-authored-by: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Equivalence properties lost if projection is pushed down in FileScanConfig

3 participants