-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import io
import pandas as pd
csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616.5,This is 2.0^64 + 0.5
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value object
description object
dtype: object
['42' '18446744073709551616.5']
Issue Description
If the CSV column doesn't contain preceding floats, pd.read_csv()
(without dtype=
) interprets numbers str
of dtype object
. The two first examples below work because the data stays below
Known work-arounds: One could use dtype={"value": "float64"}
, but in our case we'd really prefer to not build the machinery and a type information database. Also engine="pyarrow"
seems to work correctly, but we'd prefer to avoid that dependency.
OK, ex. 1: int smaller than $2^{64}$ , interpreted as int64
:
csv = io.StringIO(f"""\
value,description
42,A small integer
18446744073709551615,This is 2^63
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value uint64
description object
dtype: object
[ 42 18446744073709551615]
OK, ex. 2: float smaller than $2^{64}$ , interpreted as float64
:
csv = io.StringIO(f"""\
value,description
42,A small integer
18446744073709551615.0,This is 2.0^63
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value float64
description object
dtype: object
[4.20000000e+01 1.84467441e+19]
OK, ex. 3: small float followed by $2^{64}$ as an int, interpreted as float64
:
csv = io.StringIO(f"""\
value,description
4.2,A small float
18446744073709551616,This is 2^64
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value float64
description object
dtype: object
[4.20000000e+00 1.84467441e+19]
FAIL, ex. 4: small int followed by $2^{64}+0.5$ as a float, interpreted as a str
:
csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616.5,This is 2.0^64 + 0.5
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value object
description object
dtype: object
['42' '18446744073709551616.5']
FAIL, ex. 5: small int followed by $2^{64}$ as an int, interpreted as a str
:
csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616,This is 2.0^64
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value object
description object
dtype: object
['42' '18446744073709551616']
Expected Behavior
A small int followed by $2^{64}+0.5$ as a float should be interpreted as float64
:
csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616.5,This is 2.0^64 + 0.5
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value float64
description object
dtype: object
[4.20000000e+01 1.84467441e+19]
A small int followed by $2^{64}$ as an int should be interpreted as float64
:
csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616,This is 2.0^64
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value float64
description object
dtype: object
[4.20000000e+01 1.84467441e+19]
Installed Versions
INSTALLED VERSIONS
commit : 2e218d1
python : 3.11.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.85
Version : #1-NixOS SMP Wed Dec 21 16:36:38 UTC 2022
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : fi_FI.UTF-8
LOCALE : fi_FI.UTF-8
pandas : 1.5.3
numpy : 1.24.1
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 20.3.4
Cython : None
pytest : 7.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None