Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isin fails with large series/lists of tuples #17910

Closed
gpascualg opened this issue Oct 18, 2017 · 3 comments
Closed

isin fails with large series/lists of tuples #17910

gpascualg opened this issue Oct 18, 2017 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@gpascualg
Copy link

Code Sample

from random import randint

def bug_report(n=2000000, idmax=22750, prodmax=3414341):
    ids = [randint(1, idmax) for _ in range(n)]
    r = lambda: randint(1, prodmax)
    prods = [(-1,-2,-3), (-1,-2,-3)] + [(r(), r(), r()) for _ in range(n-2)]
    
    df = pd.DataFrame({'ids': ids, 'products': prods})
    counts = df['products'].value_counts()
    counts_idxs = counts[counts >= 2].index
    idxs = df['products'].isin(counts_idxs)
    return df[idxs]

Problem description

There are several ways to trigger the bug, either of them resulting in isin returning all False whereas some indexes should be True.

Take the example above, we have the tuple (-1,-2,-3) repeated twice, and it can be checked that both counts and counts_idxs are 2 and (-1,-2,-3), respectively. Then, independently from the rest of the products, the resulting dataset from taking the idxs from isin should have, at least, 2 items. Calling the function as is, does not. Explanation, causes and possible solutions below:

Manually importing from pandas.core.algorithms import isin and settings idxs = isin(df['products'], counts[counts >= 2].index) results in the exact same behaviour.

I've tried to reproduce this same behaviour when not using tuples at all and I can't seem to succeed.

Proposed solution

This seems to be a regression in 0.20.x as using latest 0.19.x (0.19.2) works perfectly fine. Indeed, manually copying isin from 0.19.x and using it instead of 0.20.x works. One can see that a particular if was reversed/erased in
https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L414
and
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L144
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L161

This results in 0.20.x relying in numpy.in1d whereas 0.19.x used lib.ismember, which is equivalent to htable.ismember_object in 0.20.x. One can confirm this becase:

htable = pandas._libs.hashtable
idxs = htable.ismember_object(df['products'].values, np.asarray(counts[counts >= 2].index))
df[idxs]

works fine, whereas

idxs = np.in1d(df['products'].values, np.asarray(counts[counts >= 2].index))
all_sets[idxs]

silently fails.

Now, either this is temporally fixed in pandas by not relying in in1d or an issue is submitted to numpy (which I will do once I can take a look at in1d and see what's happening). Also, one can solve it by not using tuples at all, and applying hash beforehand, for example.

I've narrowed a bit more the problem and it is not only related to n but also prodmax:

Any combination with n > 1000001 && prodmax > 1986 produces and empty dataframe:

bug_report(n=1000001, prodmax=1987)
bug_report(n=1000001)
bug_report()

Whereas having n <= 1000000 or prodmax <= 1986 works just fine. Parameter values have been deduced from:

def narrow():
    start = 256
    end = 2048
    while start + 1 < end:
        print(start, end)
        df = bug_report_4(n=1000001, prodmax=(start + end) // 2)
        if df.empty:
            end = (start + end) // 2
        else:
            start = (start + end) // 2
    
    return start, df.empty

narrow()
# (1896, False)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-72-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None

This has been confirmed and tested in multiple pcs and environments, always Python 3.x

@jorisvandenbossche
Copy link
Member

There has been a related issue and fix (#16012, #16969). So it might be this is fixed in the meantime in master.
Would you be able to test that with the 0.21.0 release candidate? (published a few days ago, see https://groups.google.com/forum/#!topic/pydata/kxVdd6ZHjvI)

@gpascualg
Copy link
Author

gpascualg commented Oct 18, 2017

I did a fresh install of python 3.6 and upgraded to 0.21.0-rc1 and can confirm the issue is no longer there, it has been solved.

For reference:

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-72-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0rc1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

I am closing the issue as it is solved already, if needed feel free to reopen!

Thank you!

@jorisvandenbossche
Copy link
Member

Thanks for testing!

@jorisvandenbossche jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Oct 18, 2017
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Oct 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants