Skip to content

BUG: read_csv() fails to detect floats larger than 2.0^64 #51295

@akaihola

Description

@akaihola

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.


Reproducible Example

import io
import pandas as pd

csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616.5,This is 2.0^64 + 0.5
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          object
description    object
dtype: object
['42' '18446744073709551616.5']

Issue Description

If the CSV column doesn't contain preceding floats, pd.read_csv() (without dtype=) interprets numbers $\ge2^{64}$ as str of dtype object. The two first examples below work because the data stays below $2^{64}$. The third one works because there's a preceding float. Examples 4–5 fail because no preceding value looks like a float, and the large value $\ge2^{64}$.

Known work-arounds: One could use dtype={"value": "float64"}, but in our case we'd really prefer to not build the machinery and a type information database. Also engine="pyarrow" seems to work correctly, but we'd prefer to avoid that dependency.

OK, ex. 1: int smaller than $2^{64}$, interpreted as int64:

csv = io.StringIO(f"""\
value,description
42,A small integer
18446744073709551615,This is 2^63
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          uint64
description    object
dtype: object
[                  42 18446744073709551615]

OK, ex. 2: float smaller than $2^{64}$, interpreted as float64:

csv = io.StringIO(f"""\
value,description
42,A small integer
18446744073709551615.0,This is 2.0^63
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          float64
description     object
dtype: object
[4.20000000e+01 1.84467441e+19]

OK, ex. 3: small float followed by $2^{64}$ as an int, interpreted as float64:

csv = io.StringIO(f"""\
value,description
4.2,A small float
18446744073709551616,This is 2^64
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          float64
description     object
dtype: object
[4.20000000e+00 1.84467441e+19]

FAIL, ex. 4: small int followed by $2^{64}+0.5$ as a float, interpreted as a str:

csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616.5,This is 2.0^64 + 0.5
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          object
description    object
dtype: object
['42' '18446744073709551616.5']

FAIL, ex. 5: small int followed by $2^{64}$ as an int, interpreted as a str:

csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616,This is 2.0^64
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          object
description    object
dtype: object
['42' '18446744073709551616']

Expected Behavior

A small int followed by $2^{64}+0.5$ as a float should be interpreted as float64:

csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616.5,This is 2.0^64 + 0.5
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          float64
description     object
dtype: object
[4.20000000e+01 1.84467441e+19]

A small int followed by $2^{64}$ as an int should be interpreted as float64:

csv = io.StringIO(f"""\
value,description
42,A small int
18446744073709551616,This is 2.0^64
""")
data = pd.read_csv(csv)
print(data.dtypes)
print(data["value"].values)
value          float64
description     object
dtype: object
[4.20000000e+01 1.84467441e+19]

Installed Versions

INSTALLED VERSIONS

commit : 2e218d1
python : 3.11.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.85
Version : #1-NixOS SMP Wed Dec 21 16:36:38 UTC 2022
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : fi_FI.UTF-8
LOCALE : fi_FI.UTF-8

pandas : 1.5.3
numpy : 1.24.1
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 20.3.4
Cython : None
pytest : 7.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Metadata

Metadata

Assignees

Labels

BugDtype ConversionsUnexpected or buggy dtype conversionsIO CSVread_csv, to_csv

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions