Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): Give priority to pycapsule interface in from_dataframe #21377

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Feb 20, 2025

closes #20316. As in, to clarify:

  • The first example in that issue uses pandas numpy types, which don't support missing values, and so there was no possibility of ever round-tripping
  • The second example is already fixed with pandas 2.2.2+, but not with 2.2.1. We can't anything about interchange protocol bugs in old pandas versions, but we can use something better than the interchange protocol 😎 With this PR, the second example from the issue passes with pandas 2.2.1. Given that there's possibly millions of people still using pandas pre-2.2.2, I think this is worth fixing

As suggested in #20065 (comment), this gives priority to the pycapsule interface over the interchange protocol. I did the same in pandas: pandas-dev/pandas#60739

"pl.from_dataframe" shows up a scary amount of times in https://github.com/search?q=%22pl.from_dataframe%22&type=code 🥶

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Feb 20, 2025
Comment on lines 179 to 183
with pytest.raises(
CopyNotAllowedError,
match="byte-packed boolean buffer must be converted to bit-packed boolean",
):
result = pl.from_dataframe(df, allow_copy=False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this no longer raises with allow_copy=False - I don't know how strict (if at all?) the pycapsule interface is about copies, will take a look

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and none of the examples in https://github.com/search?q=%22pl.from_dataframe%22&type=code use allow_copy=False, so I'm not sure how much we should be concerned?

I'd like to suggest just deprecating the argument tbh

Copy link

codecov bot commented Feb 20, 2025

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.95%. Comparing base (25d83c6) to head (96a6c44).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
py-polars/polars/convert/general.py 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21377      +/-   ##
==========================================
+ Coverage   79.84%   79.95%   +0.10%     
==========================================
  Files        1597     1597              
  Lines      228911   228956      +45     
  Branches     2618     2621       +3     
==========================================
+ Hits       182785   183051     +266     
+ Misses      45527    45300     -227     
- Partials      599      605       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MarcoGorelli MarcoGorelli force-pushed the from-dataframe branch 6 times, most recently from 76415ba to 0adf009 Compare February 21, 2025 10:41
@MarcoGorelli MarcoGorelli changed the title feat(python): Give priority to pycapsule interface in from_dataframe feat(python!): Give priority to pycapsule interface in from_dataframe Feb 21, 2025
@github-actions github-actions bot added the breaking python Change that breaks backwards compatibility for the Python package label Feb 21, 2025
@MarcoGorelli MarcoGorelli marked this pull request as ready for review February 21, 2025 12:13
@MarcoGorelli MarcoGorelli marked this pull request as draft February 21, 2025 12:58
@MarcoGorelli MarcoGorelli changed the title feat(python!): Give priority to pycapsule interface in from_dataframe fix(python): Give priority to pycapsule interface in from_dataframe Feb 21, 2025
@github-actions github-actions bot added the fix Bug fix label Feb 21, 2025
@@ -123,7 +126,7 @@ def test_to_dataframe_pandas_zero_copy_parametric(df: pl.DataFrame) -> None:
)
def test_from_dataframe_pyarrow_parametric(df: pl.DataFrame) -> None:
df_pa = df.to_arrow()
result = pl.from_dataframe(df_pa)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are very specific to the interchange protocol, so I've kept the test to those (rather than letting the pycapsule interface take precedence)

@MarcoGorelli MarcoGorelli marked this pull request as ready for review February 21, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking python Change that breaks backwards compatibility for the Python package enhancement New feature or an improvement of an existing feature fix Bug fix python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error converting to and from pandas Dataframe for pandas 2.2.1
1 participant