Skip to content

Pandas doesn't always cast strings to int consistently when using .isin() #24918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BasilBeirouti opened this issue Jan 25, 2019 · 8 comments
Closed
Labels
Bug isin isin method

Comments

@BasilBeirouti
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,10, size=(9999, 2)), columns=['key', 'value'])

# df.value contains integers 2 and 4, does not contain strings '2', '4', or 'fart' or integer 11

# example A: this returns a non-empty df
print(len(df[df.value.isin(['2'])]))
# example B: this returns a non-empty df
print(len(df[df.value.isin(['2', 4])]))
# example C: this returns a non-empty df
print(len(df[df.value.isin(['fart', 4])]))
# example D: this returns a non-empty df
print(len(df[df.value.isin(['fart', 4, '2'])]))
# example E: this returns a non-empty df
print(len(df[df.value.isin(['2', '4'])]))
# example F: this returns a non-empty df
print(len(df[df.value.isin(['2', '4', '11'])]))
# example G: this DOES NOT return a non-empty df
print(len(df[df.value.isin(['fart', '2', '4'])]))

Problem description

In the examples above, examples A-F behave as expected; only example G behaves unexpectedly.

When using df[df.mycol.isin(alist)], and if mycol is an integer column, Pandas appears to be able to convert strings in alist to integers, and intelligently check if they occur in mycol. Only in the last case (example G) outlined above does it fail to do so. This seems inconsistent to me because it is able to ignore 'fart' in previous examples (examples C and D), It is also able to correctly cast '2' to an int and return matching rows (examples A, B, D). It is also able to do both of these things simultaneously (example D). It is also able to correctly cast two matching elements (Example E). Only when all elements need to be casted, and one of them is uncastable, does the unexpected behavior occur (example G). In Example F, all need to be casted, but they are all castable, so it works fine.

Basically, as long as they are all castable to the correct type, or there is at least one element of the correct type, it works.

To put it another way, if df[df.mycol.isin(alist)] returns something, then df[df.mycol.isin(alist + [newel])] should return at least that same thing. This seems reasonable to me and I think how it was intended to work.

For my own edification, I would like to understand where the code that does the casting / comparisons is implemented. Is it in C or Python?

Expected Output

last line should ignore the string 'fart', correctly cast the string '2' to int 2 and string '4' to int 4, as it does in previous examples, and return the matching rows in df.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.16
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Jan 25, 2019

Thanks for the report, though I believe the expectation is wrong. Why should A, E, F or G return anything?

@WillAyd WillAyd added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 25, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Jan 25, 2019
@BasilBeirouti
Copy link
Author

BasilBeirouti commented Jan 25, 2019

@WillAyd A, E and F do though. Why should G be different? Either all of them should return something or none of them should.

I think it is counterintuitive for a .isin(mylist) to return some rows, and .isin(mylist + [newel]) to return no rows. Adding more elements to the list that pandas is matching on should never result in fewer elements being returned.

Do you know if there is a C subroutine the performs the comparisons or if it's in Python?

@WillAyd
Copy link
Member

WillAyd commented Jan 25, 2019

Either all of them should return something or none of them should.

That's my point. None of these should return anything

The implementation of this for a Series is linked below if you want to take a look

def isin(self, values):

@BasilBeirouti
Copy link
Author

BasilBeirouti commented Jan 26, 2019

Thanks @WillAyd. I followed that function into the isin() implementation in core/algorithms.py

except (TypeError, ValueError, OverflowError):

That function sees that the series comps (which corresponds to mycol in my example) is of type int, so it casts values (same as alist in my example) to int. This throws ValueError with example G, but not with the others. The way ValueError is handled is to cast both values and comps to python objects and then attempt to compare one last time.

At first I thought that example D was returning matches for '2' and 4, but actually, it is only returning matching rows for 4. This actually make sense to me.

So really what it comes down to is that .astype('int64') is all or nothing, meaning that if only one element cannot be cast then no elements will be cast and ValueError is thrown. This of course makes sense.

Iterating through each element and casting it is probably not what we want. I can also see why ValueError is handled and not returned to the user. So really, I'm not sure if anything should be changed here, although @WillAyd may prefer to make the comparisons more strict and produce fewer matches. Another option might be to just include a warning message to alert the user that attempting to cast the elements raised a ValueError, and that elements were being compared as objects instead.

It was a fun rabbit hole to go down, thanks again Will!

@WillAyd
Copy link
Member

WillAyd commented Jan 27, 2019

@BasilBeirouti thanks a lot for investigating! Looking at the line you've called out I get the impression this is an unintended consequence of that exception handling

cc @jreback in case he knows of something I don't and/or objects to making this comparison stricter

@jreback
Copy link
Contributor

jreback commented Jan 27, 2019

this takes a pretty tricky path of code and sometimes we call out to numpy which has some weird inference rules

it’s possible might be a bug - it’s also somewhat performance sensitive

@jbrockmendel jbrockmendel added the isin isin method label Oct 30, 2020
@ma-ji
Copy link

ma-ji commented Jan 8, 2021

This is a consistency issue, and pandas should have a consistent strategy. On my end the problem is:

int isin str - this works; but str isin int, this does not work.

@jbrockmendel jbrockmendel removed the Indexing Related to indexing on series/frames, not to indexes themselves label Jun 12, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

As described #24918 (comment), the expected behavior is not to cast strings to ints, so the behavior we see in the OP is correct. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug isin isin method
Projects
None yet
Development

No branches or pull requests

6 participants