-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv fails with TypeError: object cannot be converted to an IntegerDtype
yet succeeds when reading chunks
#25472
Comments
Have you been able to narrow down the cause? Possibly start reading the first |
That s part of the difficulty, depending on the chunk size the exception is raised or not. With a size of one, it succeeds. Bigger, the read fails, and i don t get why |
I suspect a specific value in the CSV is causing that. I'd recommend trying
with different values of `nrows` to see what that value is.
…On Thu, Feb 28, 2019 at 7:53 AM Matthieu Coudron ***@***.***> wrote:
That s part of the difficulty, depending on the chunk size the exception
is raised or not. With a size of one, it succeeds. Bigger, the read fails,
and i don t get why
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#25472 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIuInlFKIjkT6d73ZH9iwZOOalBOvks5vR99wgaJpZM4bWRmS>
.
|
Also, if you are able to share a file that can reproduce the issue, that would be great. |
Sorry I definitely had uploaded it but I may have messed up somewhere, it ended up not being visible, anyway I've put the file in the first post (upload.txt but it's a csv really). I think it's a bug because readling line by line, no value appears as a problem. the .csv file is generated so there should be no error in the values either. |
When I tried to use your code to read the file, most of the values in the column showed up as missing which might be the reason it's not reading as 'UInt64'. Reading it as default format and/or string works. |
I actually updated to pandas 0.24.1 because it supported empty rows via UInt64 (else why would it work when readling line by line). 'UInt64' also works for other columns with empty values, there are just some columns for which it doesn't and I can't fathom why. |
Have you had a chance to debug this @teto? |
I am not sure what else I can do, I've provided the data file and a standalone example. |
Gotcha, hopefully someone has time to take a look, but you may be the expert here as this is fairly new. cc @kprestel who implemented EA support for read_csv. |
because it was comparing values of different types. For now I encode the failing fields as str instead of UInt64 (dsnraw seems concerned as well) see pandas-dev/pandas#25472 for more details
I'll be able to take a look at this tonight hopefully. |
because it was comparing values of different types. For now I encode the failing fields as str instead of UInt64 (dsnraw seems concerned as well) see pandas-dev/pandas#25472 for more details
Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge. I'm running into the same problem as OP when I read 1 of the sheets of a .xlsl file ( This gave the same problem as OP:
However, ugly, but this seemed to have worked for me:
|
I just ran into this - it looks much more general than a read_csv problem to me.
I would expect that this should just work? As @NumesSanguis says above, converting via float does work, e.g.
This is using
@TomAugspurger - do you think a new issue needs to be opened for this? |
I thought we already had an issue for that (possible search for "strictness
of _from_sequence") but I may be wrong.
…On Tue, May 19, 2020 at 6:15 AM Luke Stanbra ***@***.***> wrote:
I just ran into this - it looks much more general than a read_csv problem
to me.
>>> pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe
return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array
raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
TypeError: object cannot be converted to an IntegerDtype
I would expect that this should just work? As @NumesSanguis
<https://github.com/NumesSanguis> says above, converting via float does
work, e.g.
>>> pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype())
0 1
1 2
2 3
dtype: Int64
This is using
>>> pd.__version__
'1.0.3'
@TomAugspurger <https://github.com/TomAugspurger> - do you think a new
issue needs to be opened for this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25472 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISOMPTHZT64PP2CIM3RSJS5NANCNFSM4G2ZDGJA>
.
|
Any ideas for a workaround if the integer (18 places) is too big for float64? |
@dekiesel |
Still no news about this? It seems like quite a significant bug, and has been open an extremely long time! |
@alexreg you or anyone is welcome to submit a PR to patch and the core team can review |
@jreback I'm not sure I'm a good person to analyse the root of this problem, but I'll have a look anyway, and if I can figure it out, will submit a PR. |
Resolves pandas-dev#25472, resolves pandas-dev#25288.
Resolves pandas-dev#25472, resolves pandas-dev#25288.
Resolves pandas-dev#25472, resolves pandas-dev#25288.
Resolves pandas-dev#25472, resolves pandas-dev#25288.
Code Sample, a copy-pastable example if possible
Download this file upload.txt
If I read in chunks, read_csv succeeds, if I try to read the column at once, I get
Expected Output
I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).
Output of
pd.show_versions()
pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: