Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior when DataFrame with strings and None is created from lists or dictionary. #32218

Closed
Tracked by #52523 ...
ali-cetin-4ss opened this issue Feb 24, 2020 · 8 comments · Fixed by #52727
Closed
Tracked by #52523 ...
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@ali-cetin-4ss
Copy link

ali-cetin-4ss commented Feb 24, 2020

Code Sample

import pandas as pd


pd.__version__

# df created from list, None is casted to str
df = pd.DataFrame(["1", "2", None], columns=["a"], dtype="str")
type(df.loc[2].values[0])


# Equivalent df created from dict, None remains NoneType
df = pd.DataFrame({"a": ["1", "2", None]}, dtype="str")
type(df.loc[2].values[0]))

Problem description

None is not casted consistently for DataFrames with None values and dtype set to str.

If DataFrame is created from list, then None is casted to str -> None -> "None".
If DataFrame is created from dict, then None remains NoneType -> None -> None.

IMO, the latter is the preferred behavior. I hope you consider this when continuing your work on string types and na types in future versions.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 41.2.0
Cython : None
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@MarcoGorelli MarcoGorelli added the Constructors Series/DataFrame/Index/pd.array Constructors label Feb 24, 2020
@prakhar987
Copy link
Contributor

None is converted to 'None' when numpy array (made from passed list ['1', '2', None]) is casted :

values = values.astype(dtype)

Should numpy handle this or we can do a workaround. Can work on a PR.

@ali-cetin-4ss
Copy link
Author

But isn't the "bigger" issue here that the behavior is different for seemingly two equivalent ways of instantiating a Dataframe?

@prakhar987
Copy link
Contributor

I am assuming one is wrong, the case where None becomes "None".

@mroeschke mroeschke added the Bug label Jun 28, 2020
@cgarciae
Copy link

I tested against master (895f0b4) and the issue is now gone, I think this can be closed now.

@jreback
Copy link
Contributor

jreback commented Apr 11, 2021

we would take a PR with a validation test

@cgarciae
Copy link

take

@simonjayhawkins simonjayhawkins added Needs Tests Unit test(s) needed to prevent regressions and removed Bug Constructors Series/DataFrame/Index/pd.array Constructors labels May 25, 2021
@mpark1994
Copy link

hey I'm looking for first-issues and I'm wondering what needs to be done?

@harsimran44
Copy link

Hey , can i do this. can you guide how to do this. please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
9 participants