Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering on duplicate columns after lazy right-join giving incorrect results #21142

Closed
2 tasks done
CasperTeirlinck opened this issue Feb 8, 2025 · 1 comment · Fixed by #21293
Closed
2 tasks done
Assignees
Labels
A-optimizer Area: plan optimization accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release

Comments

@CasperTeirlinck
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When filtering on a duplicate column name after a lazy join, it looks like the wrong column is used, which results in incorrect results. When collecting right after the join and then filtering, the results are correct.

To reproduce, we can start with 2 DataFrames to be joined together on join, and that have a duplicate column name duplicate:

df_1 = pl.LazyFrame({"join": [1, 2, 4], "duplicate": ["a", "b", "c"], "col1": ["d", "e", "f"]})
df_2 = pl.LazyFrame({"join": [1, 2, 3], "duplicate": ["d", "e", "f"], "col2": ["g", "h", "i"]})
print(df_1.collect())
print(df_2.collect())
shape: (3, 3)
┌──────┬───────────┬──────┐
│ join ┆ duplicate ┆ col1 │
│ ---  ┆ ---       ┆ ---  │
│ i64  ┆ str       ┆ str  │
╞══════╪═══════════╪══════╡
│ 1    ┆ a         ┆ d    │
│ 2    ┆ b         ┆ e    │
│ 4    ┆ c         ┆ f    │
└──────┴───────────┴──────┘
shape: (3, 3)
┌──────┬───────────┬──────┐
│ join ┆ duplicate ┆ col2 │
│ ---  ┆ ---       ┆ ---  │
│ i64  ┆ str       ┆ str  │
╞══════╪═══════════╪══════╡
│ 1    ┆ d         ┆ g    │
│ 2    ┆ e         ┆ h    │
│ 3    ┆ f         ┆ i    │
└──────┴───────────┴──────┘

Example 1: Right join with filtering on left column = wrong result

df_joined_right = df_1.join(df_2, on="join", how="right")
print(df_joined_right.collect())
shape: (3, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
│ a         ┆ d    ┆ 1    ┆ d               ┆ g    │
│ b         ┆ e    ┆ 2    ┆ e               ┆ h    │
│ null      ┆ null ┆ 3    ┆ f               ┆ i    │
└───────────┴──────┴──────┴─────────────────┴──────┘
# filter after `.collect()` => correct result
print(df_joined_right.collect().filter(pl.col("duplicate").is_null()))
shape: (1, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
│ null      ┆ null ┆ 3    ┆ f               ┆ i    │
└───────────┴──────┴──────┴─────────────────┴──────┘
# filter before `.collect()` => wrong result
# it appears to filter on the right column instead of the left column
print(df_joined_right.filter(pl.col("duplicate").is_null()).collect())
shape: (0, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
└───────────┴──────┴──────┴─────────────────┴──────┘

Example 2: Left join with filtering on right column = correct result

When filtering on the right column, the issue does not seem to occur:

df_joined_left = df_1.join(df_2, on="join", how="left")
print(df_joined_left.collect())
shape: (3, 5)
┌──────┬───────────┬──────┬─────────────────┬──────┐
│ join ┆ duplicate ┆ col1 ┆ duplicate_right ┆ col2 │
│ ---  ┆ ---       ┆ ---  ┆ ---             ┆ ---  │
│ i64  ┆ str       ┆ str  ┆ str             ┆ str  │
╞══════╪═══════════╪══════╪═════════════════╪══════╡
│ 1    ┆ a         ┆ d    ┆ d               ┆ g    │
│ 2    ┆ b         ┆ e    ┆ e               ┆ h    │
│ 4    ┆ c         ┆ f    ┆ null            ┆ null │
└──────┴───────────┴──────┴─────────────────┴──────┘
# filter before `.collect()` => correct result
print(df_joined_left.filter(pl.col("duplicate_right").is_null()).collect())
shape: (1, 5)
┌──────┬───────────┬──────┬─────────────────┬──────┐
│ join ┆ duplicate ┆ col1 ┆ duplicate_right ┆ col2 │
│ ---  ┆ ---       ┆ ---  ┆ ---             ┆ ---  │
│ i64  ┆ str       ┆ str  ┆ str             ┆ str  │
╞══════╪═══════════╪══════╪═════════════════╪══════╡
│ 4    ┆ c         ┆ f    ┆ null            ┆ null │
└──────┴───────────┴──────┴─────────────────┴──────┘

Different filters also give wrong results

The above examples only use .is_null(), but other filters give the same issue:

print(df_joined_right.collect().filter(pl.col("duplicate") == "a"))
shape: (1, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
│ a         ┆ d    ┆ 1    ┆ d               ┆ g    │
└───────────┴──────┴──────┴─────────────────┴──────┘
print(df_joined_right.filter(pl.col("duplicate") == "a").collect())
shape: (0, 5)
┌───────────┬──────┬──────┬─────────────────┬──────┐
│ duplicate ┆ col1 ┆ join ┆ duplicate_right ┆ col2 │
│ ---       ┆ ---  ┆ ---  ┆ ---             ┆ ---  │
│ str       ┆ str  ┆ i64  ┆ str             ┆ str  │
╞═══════════╪══════╪══════╪═════════════════╪══════╡
└───────────┴──────┴──────┴─────────────────┴──────┘

Log output

Issue description

When filtering on a duplicate column name after a lazy join, it looks like the wrong column is used, which results in incorrect results. When collecting right after the join and then filtering, the results are correct.

Expected behavior

The results of a filter on a LazyFrame is the same as for a DataFrame.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.13.1 (main, Dec  3 2024, 17:59:52) [GCC 11.4.0]
LTS CPU:             False
@CasperTeirlinck CasperTeirlinck added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 8, 2025
@ritchie46 ritchie46 added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Feb 9, 2025
@nameexhaustion nameexhaustion self-assigned this Feb 10, 2025
@nameexhaustion nameexhaustion added the regression Issue introduced by a new release label Feb 10, 2025
@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Feb 10, 2025

@nameexhaustion nameexhaustion changed the title Filtering on duplicate columns after lazy join giving incorrect results Filtering on duplicate columns after lazy right-join giving incorrect results Feb 10, 2025
@nameexhaustion nameexhaustion added the A-optimizer Area: plan optimization label Feb 10, 2025
@c-peters c-peters added the accepted Ready for implementation label Feb 24, 2025
@c-peters c-peters added this to Backlog Feb 24, 2025
@c-peters c-peters moved this to Done in Backlog Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-optimizer Area: plan optimization accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
4 participants