Filtering on duplicate columns after lazy right-join giving incorrect results #21142

CasperTeirlinck · 2025-02-08T11:07:51Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When filtering on a duplicate column name after a lazy join, it looks like the wrong column is used, which results in incorrect results. When collecting right after the join and then filtering, the results are correct.

To reproduce, we can start with 2 DataFrames to be joined together on join, and that have a duplicate column name duplicate:

df_1 = pl.LazyFrame({"join": [1, 2, 4], "duplicate": ["a", "b", "c"], "col1": ["d", "e", "f"]})
df_2 = pl.LazyFrame({"join": [1, 2, 3], "duplicate": ["d", "e", "f"], "col2": ["g", "h", "i"]})

print(df_1.collect())
print(df_2.collect())

shape: (3, 3)
┌──────┬───────────┬──────┐
│ join ┆ duplicate ┆ col1 │
│ ---  ┆ ---       ┆ ---  │
│ i64  ┆ str       ┆ str  │
╞══════╪═══════════╪══════╡
│ 1    ┆ a         ┆ d    │
│ 2    ┆ b         ┆ e    │
│ 4    ┆ c         ┆ f    │
└──────┴───────────┴──────┘
shape: (3, 3)
┌──────┬───────────┬──────┐
│ join ┆ duplicate ┆ col2 │
│ ---  ┆ ---       ┆ ---  │
│ i64  ┆ str       ┆ str  │
╞══════╪═══════════╪══════╡
│ 1    ┆ d         ┆ g    │
│ 2    ┆ e         ┆ h    │
│ 3    ┆ f         ┆ i    │
└──────┴───────────┴──────┘

Example 1: Right join with filtering on left column = wrong result

df_joined_right = df_1.join(df_2, on="join", how="right")
print(df_joined_right.collect())

shape: (3, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
│ a         ┆ d    ┆ 1    ┆ d               ┆ g    │
│ b         ┆ e    ┆ 2    ┆ e               ┆ h    │
│ null      ┆ null ┆ 3    ┆ f               ┆ i    │
└───────────┴──────┴──────┴─────────────────┴──────┘

# filter after `.collect()` => correct result
print(df_joined_right.collect().filter(pl.col("duplicate").is_null()))

shape: (1, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
│ null      ┆ null ┆ 3    ┆ f               ┆ i    │
└───────────┴──────┴──────┴─────────────────┴──────┘

# filter before `.collect()` => wrong result
# it appears to filter on the right column instead of the left column
print(df_joined_right.filter(pl.col("duplicate").is_null()).collect())

shape: (0, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
└───────────┴──────┴──────┴─────────────────┴──────┘

Example 2: Left join with filtering on right column = correct result

When filtering on the right column, the issue does not seem to occur:

df_joined_left = df_1.join(df_2, on="join", how="left")
print(df_joined_left.collect())

shape: (3, 5)
┌──────┬───────────┬──────┬─────────────────┬──────┐
│ join ┆ duplicate ┆ col1 ┆ duplicate_right ┆ col2 │
│ ---  ┆ ---       ┆ ---  ┆ ---             ┆ ---  │
│ i64  ┆ str       ┆ str  ┆ str             ┆ str  │
╞══════╪═══════════╪══════╪═════════════════╪══════╡
│ 1    ┆ a         ┆ d    ┆ d               ┆ g    │
│ 2    ┆ b         ┆ e    ┆ e               ┆ h    │
│ 4    ┆ c         ┆ f    ┆ null            ┆ null │
└──────┴───────────┴──────┴─────────────────┴──────┘

# filter before `.collect()` => correct result
print(df_joined_left.filter(pl.col("duplicate_right").is_null()).collect())

shape: (1, 5)
┌──────┬───────────┬──────┬─────────────────┬──────┐
│ join ┆ duplicate ┆ col1 ┆ duplicate_right ┆ col2 │
│ ---  ┆ ---       ┆ ---  ┆ ---             ┆ ---  │
│ i64  ┆ str       ┆ str  ┆ str             ┆ str  │
╞══════╪═══════════╪══════╪═════════════════╪══════╡
│ 4    ┆ c         ┆ f    ┆ null            ┆ null │
└──────┴───────────┴──────┴─────────────────┴──────┘

Different filters also give wrong results

The above examples only use .is_null(), but other filters give the same issue:

print(df_joined_right.collect().filter(pl.col("duplicate") == "a"))

shape: (1, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
│ a         ┆ d    ┆ 1    ┆ d               ┆ g    │
└───────────┴──────┴──────┴─────────────────┴──────┘

print(df_joined_right.filter(pl.col("duplicate") == "a").collect())

shape: (0, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
└───────────┴──────┴──────┴─────────────────┴──────┘

Log output

Issue description

When filtering on a duplicate column name after a lazy join, it looks like the wrong column is used, which results in incorrect results. When collecting right after the join and then filtering, the results are correct.

Expected behavior

The results of a filter on a LazyFrame is the same as for a DataFrame.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.13.1 (main, Dec  3 2024, 17:59:52) [GCC 11.4.0]
LTS CPU:             False

The text was updated successfully, but these errors were encountered:

nameexhaustion · 2025-02-10T04:53:17Z

Regression from fix: Fix incorrect filter after right-join on LazyFrame #19775

CasperTeirlinck added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 8, 2025

ritchie46 added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Feb 9, 2025

nameexhaustion self-assigned this Feb 10, 2025

nameexhaustion added the regression Issue introduced by a new release label Feb 10, 2025

nameexhaustion changed the title ~~Filtering on duplicate columns after lazy join giving incorrect results~~ Filtering on duplicate columns after lazy right-join giving incorrect results Feb 10, 2025

nameexhaustion added the A-optimizer Area: plan optimization label Feb 10, 2025

nameexhaustion assigned nameexhaustion and unassigned nameexhaustion Feb 14, 2025

nameexhaustion mentioned this issue Feb 17, 2025

fix: Fix incorrect predicate pushdown for predicates referring to right-join key columns #21293

Merged

nameexhaustion assigned nameexhaustion and unassigned nameexhaustion Feb 17, 2025

ritchie46 closed this as completed in #21293 Feb 17, 2025

c-peters added the accepted Ready for implementation label Feb 24, 2025

c-peters added this to Backlog Feb 24, 2025

c-peters moved this to Done in Backlog Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering on duplicate columns after lazy right-join giving incorrect results #21142

Filtering on duplicate columns after lazy right-join giving incorrect results #21142

CasperTeirlinck commented Feb 8, 2025

nameexhaustion commented Feb 10, 2025 •

edited

Loading

Filtering on duplicate columns after lazy right-join giving incorrect results #21142

Filtering on duplicate columns after lazy right-join giving incorrect results #21142

Comments

CasperTeirlinck commented Feb 8, 2025

Checks

Reproducible example

Example 1: Right join with filtering on left column = wrong result

Example 2: Left join with filtering on right column = correct result

Different filters also give wrong results

Log output

Issue description

Expected behavior

Installed versions

nameexhaustion commented Feb 10, 2025 • edited Loading

nameexhaustion commented Feb 10, 2025 •

edited

Loading