Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: iterrows encounters OverflowError with str and datetime64[ns] columns #35665

Closed
2 of 3 tasks
Tracked by #7
allentsouhuang opened this issue Aug 11, 2020 · 7 comments · Fixed by #52748
Closed
2 of 3 tasks
Tracked by #7

BUG: iterrows encounters OverflowError with str and datetime64[ns] columns #35665

allentsouhuang opened this issue Aug 11, 2020 · 7 comments · Fixed by #52748
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type

Comments

@allentsouhuang
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
foo = pd.DataFrame({'some_string': ['612092d7-071f-467e832d-dd53e0f2b590-0006'], 'time': [pd.NaT]})
for _, row in foo.iterrows():
    pass

Problem description

The error OverflowError: Python int too large to convert to C long. I would expect the snippet to iterate just fine without any issue.

Expected Output

The snippet is not expected to show any output.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : d9fff2792bf16178d4e450fe7384244e50635733
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.5.0
Version          : Darwin Kernel Version 19.5.0: Tue May 26 20:41:44 PDT 2020; root:xnu-6153.121.2~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2
setuptools       : 49.2.0.post20200712
Cython           : 0.29.21
pytest           : 6.0.1
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.17.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.0
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-7-d71373909082> in <module>
      1 foo = pd.DataFrame({'some_string': ['612092d7-071f-467e832d-dd53e0f2b590-0006'], 'time': [pd.NaT]})
----> 2 for _, row in foo.iterrows():
      3     pass

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/frame.py in iterrows(self)
   1011         klass = self._constructor_sliced
   1012         for k, v in zip(self.index, self.values):
-> 1013             s = klass(v, index=columns, name=k)
   1014             yield k, s
   1015 

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    325                     data = data.copy()
    326             else:
--> 327                 data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
    328 
    329                 data = SingleBlockManager.from_array(data, index)

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_array(data, index, dtype, copy, raise_cast_failure)
    425         else:
    426             # we will try to copy be-definition here
--> 427             subarr = _try_cast(data, dtype, copy, raise_cast_failure)
    428 
    429     elif isinstance(data, ABCExtensionArray):

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/construction.py in _try_cast(arr, dtype, copy, raise_cast_failure)
    551             subarr = arr
    552         else:
--> 553             subarr = maybe_cast_to_datetime(arr, dtype)
    554 
    555         # Take care in creating object arrays (but iterators are not

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in maybe_cast_to_datetime(value, dtype, errors)
   1430             )
   1431         ):
-> 1432             value = maybe_infer_to_datetimelike(value)
   1433 
   1434     return value

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in maybe_infer_to_datetimelike(value, convert_dates)
   1290             # try timedelta first to avoid spurious datetime conversions
   1291             # e.g. '00:00:01' is a timedelta but technically is also a datetime
-> 1292             value = try_timedelta(v)
   1293             if lib.infer_dtype(value, skipna=False) in ["mixed"]:
   1294                 # cannot skip missing values, as NaT implies that the string

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in try_timedelta(v)
   1266 
   1267         try:
-> 1268             td_values = to_timedelta(v)
   1269         except ValueError:
   1270             return v.reshape(shape)

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/tools/timedeltas.py in to_timedelta(arg, unit, errors)
    110         arg = arg.item()
    111     elif is_list_like(arg) and getattr(arg, "ndim", 1) == 1:
--> 112         return _convert_listlike(arg, unit=unit, errors=errors)
    113     elif getattr(arg, "ndim", 1) > 1:
    114         raise TypeError(

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/tools/timedeltas.py in _convert_listlike(arg, unit, errors, name)
    149 
    150     try:
--> 151         value = sequence_to_td64ns(arg, unit=unit, errors=errors, copy=False)[0]
    152     except ValueError:
    153         if errors == "ignore":

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/arrays/timedeltas.py in sequence_to_td64ns(data, copy, unit, errors)
    926     if is_object_dtype(data.dtype) or is_string_dtype(data.dtype):
    927         # no need to make a copy, need to convert if string-dtyped
--> 928         data = objects_to_td64ns(data, unit=unit, errors=errors)
    929         copy = False
    930 

~/env/miniconda3/envs/data/lib/python3.8/site-packages/pandas/core/arrays/timedeltas.py in objects_to_td64ns(data, unit, errors)
   1036     values = np.array(data, dtype=np.object_, copy=False)
   1037 
-> 1038     result = array_to_timedelta64(values, unit=unit, errors=errors)
   1039     return result.view("timedelta64[ns]")
   1040 

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.array_to_timedelta64()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.timedelta_from_spec()

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.cast_from_unit()

OverflowError: Python int too large to convert to C long
@allentsouhuang allentsouhuang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2020
@simonjayhawkins
Copy link
Member

Thanks @allentsouhuang for the report. This appears to be a regression from 0.25.3

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 11, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 11, 2020
@simonjayhawkins simonjayhawkins added the Timedelta Timedelta data type label Aug 11, 2020
@simonjayhawkins
Copy link
Member

i've labeled this as timedelta because of the traceback (expand details in OP)

@phofl
Copy link
Member

phofl commented Oct 15, 2020

This was caused by #34634

This is actually quite interesting. The first string in '612092d7-071f-467e832d-dd53e0f2b590-0006' determines the frequency in

r = timedelta_from_spec(number, frac, unit)

So when the result is to large for the inferred frequency we get the OverflowError.

For example

foo = pd.DataFrame({'some_string': ['2222Y3'], 'time': [pd.Timedelta("0 days 00:00:00.990000")]})
for _, row in foo.iterrows():
    print(row)

raises too. I am not quite sure, why we are even try to convert this to Timedelta? The previously specified unit (via the default) caused a ValueError in

if isinstance(values[i], str) and errors != "coerce":
raise ValueError(
"unit must not be specified if the input contains a str"
)
which is was caught by
try:
td_values = to_timedelta(v)
except ValueError:
return v.reshape(shape)
and the array was just reshaped.

I am not quite sure what is the intended behavior here.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 20, 2020
@mroeschke
Copy link
Member

Looks like this doesn't raise an OverflowError on master anymore. Supposed this could use a test

In [1]: import pandas as pd
   ...: foo = pd.DataFrame({'some_string': ['612092d7-071f-467e832d-dd53e0f2b590-0006'], 'time': [pd.NaT]})
   ...: for _, row in foo.iterrows():
   ...:     pass
   ...:

In [2]: foo
Out[2]:
                                some_string time
0  612092d7-071f-467e832d-dd53e0f2b590-0006  NaT

In [3]: row
Out[3]:
some_string    612092d7-071f-467e832d-dd53e0f2b590-0006
time                                                NaT
Name: 0, dtype: object

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 8, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 9, 2021
@simonjayhawkins
Copy link
Member

fixed in [8cf3771] BUG: Timedelta(td64_out_of_bounds) silently overflowing (#38965)

@rxd330
Copy link
Contributor

rxd330 commented Mar 14, 2022

take

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Apr 10, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@MarvinGravert
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type
Projects
None yet
7 participants