Skip to content

pd.Series.isin has different behaviour based on number of rows #25395

Closed
@dimitar-petrov

Description

@dimitar-petrov

Code Sample, a copy-pastable example if possible

import numpy as np

ONEMIL = 1_000_000
big_ser = pd.Series([np.nan] + [1] * (ONEMIL))

big_ser.isin([np.nan]).head()
0    False
1    False
2    False
3    False
4    False
dtype: bool

big_ser.isin([np.nan]).value_counts()
False    1000001
dtype: int64

small_ser = pd.Series([np.nan] + [1] * (ONEMIL - 1))

small_ser.isin([np.nan]).head()
0     True
1    False
2    False
3    False
4    False
dtype: bool

small_ser.isin([np.nan]).value_counts()
False    999999
True          1
dtype: int64

Problem description

The behaviour is not consistent. The example above should be descriptive enough.

Expected Output

Accurate result of pd.Series.isin should not depend on series length.

Both cases should accurately count 1 row containing np.Nan

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.20.8-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.utf-8
LANG: en_US.utf-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.9
blosc: 1.6.2
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: 0.7.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffBug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions