Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interp with long cftime coordinates raises an error #3641

Closed
spencerkclark opened this issue Dec 18, 2019 · 8 comments · Fixed by #3631
Closed

interp with long cftime coordinates raises an error #3641

spencerkclark opened this issue Dec 18, 2019 · 8 comments · Fixed by #3631

Comments

@spencerkclark
Copy link
Member

spencerkclark commented Dec 18, 2019

MCVE Code Sample

In [1]: import xarray as xr

In [2]: times = xr.cftime_range('0001', periods=3, freq='500Y')

In [3]: da = xr.DataArray(range(3), dims=['time'], coords=[times])

In [4]: da.interp(time=['0002-05-01'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-f781cb4d500e> in <module>
----> 1 da.interp(time=['0002-05-01'])

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/dataarray.py in interp(self, coords, method, assume_sorted, kwargs, **coords_kwargs)
   1353             kwargs=kwargs,
   1354             assume_sorted=assume_sorted,
-> 1355             **coords_kwargs,
   1356         )
   1357         return self._from_temp_dataset(ds)

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/dataset.py in interp(self, coords, method, assume_sorted, kwargs, **coords_kwargs)
   2565                     if k in var.dims
   2566                 }
-> 2567                 variables[name] = missing.interp(var, var_indexers, method, **kwargs)
   2568             elif all(d not in indexers for d in var.dims):
   2569                 # keep unrelated object array

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/missing.py in interp(var, indexes_coords, method, **kwargs)
    607     new_dims = broadcast_dims + list(destination[0].dims)
    608     interped = interp_func(
--> 609         var.transpose(*original_dims).data, x, destination, method, kwargs
    610     )
    611

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/missing.py in interp_func(var, x, new_x, method, kwargs)
    683         )
    684
--> 685     return _interpnd(var, x, new_x, func, kwargs)
    686
    687

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/missing.py in _interpnd(var, x, new_x, func, kwargs)
    698
    699 def _interpnd(var, x, new_x, func, kwargs):
--> 700     x, new_x = _floatize_x(x, new_x)
    701
    702     if len(x) == 1:

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/missing.py in _floatize_x(x, new_x)
    556             # represented by float.
    557             xmin = x[i].values.min()
--> 558             x[i] = x[i]._to_numeric(offset=xmin, dtype=np.float64)
    559             new_x[i] = new_x[i]._to_numeric(offset=xmin, dtype=np.float64)
    560     return x, new_x

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/variable.py in _to_numeric(self, offset, datetime_unit, dtype)
   2001         """
   2002         numeric_array = duck_array_ops.datetime_to_numeric(
-> 2003             self.data, offset, datetime_unit, dtype
   2004         )
   2005         return type(self)(self.dims, numeric_array, self._attrs)

~/Software/miniconda3/envs/xarray-tests/lib/python3.7/site-packages/xarray/core/duck_array_ops.py in datetime_to_numeric(array, offset, datetime_unit, dtype)
    410     if array.dtype.kind in "mM":
    411         return np.where(isnull(array), np.nan, array.astype(dtype))
--> 412     return array.astype(dtype)
    413
    414

TypeError: float() argument must be a string or a number, not 'datetime.timedelta'

Problem Description

In principle we should be able to get this to work. The issue stems from the following logic in datetime_to_numeric:

if array.dtype.kind in "O":
# possibly convert object array containing datetime.timedelta
array = np.asarray(pd.Series(array.ravel())).reshape(array.shape)

Here we are relying on pandas to convert an array of datetime.timedelta objects to an array with dtype timedelta64[ns]. If the array of datetime.timedelta objects cannot be safely converted to timedelta64[ns] (e.g. due to an integer overflow) then this line is silently a no-op which leads to the error downstream at the dtype conversion step. This is my fault originally for suggesting this approach, #2668 (comment).

To solve this I think we'll need to write our own logic to convert datetime.timedelta objects to numeric values instead of relying on pandas/NumPy. (as @huard notes we should be able to use NumPy directly here for the conversion). We should not consider ourselves beholden to using nanosecond resolution for a couple of reasons:

  1. datetime.timedelta objects do not natively support nanosecond resolution; they have microsecond resolution natively, which corresponds with a NumPy timedelta range of +/- 2.9e5 years.
  2. One motivation/use-case for cftime dates is that they can represent long time periods that cannot be represented using a standard DatetimeIndex. We should do everything we can to support this with a CFTimeIndex.

@huard @dcherian this is an important issue we'll need to solve to be able to use a fixed offset for cftime dates for an application like polyfit/polyval.

xref: #3349 and #3631.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 19.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: None

xarray: 0.14.1
pandas: 0.25.0
numpy: 1.17.0
scipy: 1.3.1
netCDF4: None
pydap: installed
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.0.25
cfgrib: 0.9.7.1
iris: None
bottleneck: 1.2.1
dask: 2.9.0+2.gd0daa5bc
distributed: 2.9.0
matplotlib: 3.1.1
cartopy: None
seaborn: 0.9.0
numbagg: installed
setuptools: 42.0.2.post20191201
pip: 19.2.2
conda: None
pytest: 5.0.1
IPython: 7.10.1
sphinx: None

@huard
Copy link
Contributor

huard commented Dec 18, 2019

Another issue with datetime_to_numeric happens with:

import xarray as xr
import cftime
i = xr.CFTimeIndex(xr.cftime_range('2000-01-01', periods=2))
xr.core.duck_array_ops.datetime_to_numeric(i, cftime.DatetimeGregorian(2, 1, 1), datetime_unit='D')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.array_to_timedelta64()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.parse_timedelta_string()

TypeError: object of type 'datetime.timedelta' has no len()

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
<ipython-input-50-b03d9c4f220d> in <module>
----> 1 xr.core.duck_array_ops.datetime_to_numeric(i, cftime.DatetimeGregorian(2, 1, 1), datetime_unit='D')

~/src/xarray/xarray/core/duck_array_ops.py in datetime_to_numeric(array, offset, datetime_unit, dtype)
    395         else:
    396             offset = min(array)
--> 397     array = array - offset
    398 
    399     if not hasattr(array, "dtype"):  # scalar is converted to 0d-array

~/src/xarray/xarray/coding/cftimeindex.py in __sub__(self, other)
    431 
    432         if isinstance(other, (CFTimeIndex, cftime.datetime)):
--> 433             return pd.TimedeltaIndex(np.array(self) - np.array(other))
    434         elif isinstance(other, pd.TimedeltaIndex):
    435             return CFTimeIndex(np.array(self) - other.to_pytimedelta())

~/.conda/envs/xclim3/lib/python3.6/site-packages/pandas/core/indexes/timedeltas.py in __new__(cls, data, unit, freq, start, end, periods, closed, dtype, copy, name, verify_integrity)
    256 
    257         tdarr = TimedeltaArray._from_sequence(
--> 258             data, freq=freq, unit=unit, dtype=dtype, copy=copy
    259         )
    260         return cls._simple_new(tdarr._data, freq=tdarr.freq, name=name)

~/.conda/envs/xclim3/lib/python3.6/site-packages/pandas/core/arrays/timedeltas.py in _from_sequence(cls, data, dtype, copy, freq, unit)
    270         freq, freq_infer = dtl.maybe_infer_freq(freq)
    271 
--> 272         data, inferred_freq = sequence_to_td64ns(data, copy=copy, unit=unit)
    273         freq, freq_infer = dtl.validate_inferred_freq(freq, inferred_freq, freq_infer)
    274 

~/.conda/envs/xclim3/lib/python3.6/site-packages/pandas/core/arrays/timedeltas.py in sequence_to_td64ns(data, copy, unit, errors)
    971     if is_object_dtype(data.dtype) or is_string_dtype(data.dtype):
    972         # no need to make a copy, need to convert if string-dtyped
--> 973         data = objects_to_td64ns(data, unit=unit, errors=errors)
    974         copy = False
    975 

~/.conda/envs/xclim3/lib/python3.6/site-packages/pandas/core/arrays/timedeltas.py in objects_to_td64ns(data, unit, errors)
   1096     values = np.array(data, dtype=np.object_, copy=False)
   1097 
-> 1098     result = array_to_timedelta64(values, unit=unit, errors=errors)
   1099     return result.view("timedelta64[ns]")
   1100 

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.array_to_timedelta64()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.convert_to_timedelta64()

pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.delta_to_nanoseconds()

OverflowError: Python int too large to convert to C long

@spencerkclark
Copy link
Member Author

Yes, there's a simple workaround for that at least, #3631 (comment), but I agree it would be nice if we didn't need to worry about that.

@huard
Copy link
Contributor

huard commented Dec 18, 2019

Got it, thanks !

@huard
Copy link
Contributor

huard commented Dec 18, 2019

How about replacing
array = np.asarray(pd.Series(array.ravel())).reshape(array.shape)
by
array = array.astype("timedelta64")
?
with numpy 1.17 your example works and the test suite only fails on unrelated netcdf string errors.

@spencerkclark
Copy link
Member Author

spencerkclark commented Dec 18, 2019

That would indeed be a very clean approach (I don't know why that did not occur to me earlier!). In the past that kind of conversion used to have a bug, but it has been fixed as of NumPy 1.15 (see numpy/numpy#11096).

@huard
Copy link
Contributor

huard commented Dec 18, 2019

Note that at the moment, if we pass np.datetime64 objects that exceed the allowed time span, the function yields garbage without failing. Is this something we want to fix as well ?

One option is to convert array and offset to microseconds first, then compute the delta, but this may break people's code.

@maboualidev
Copy link

maboualidev commented Dec 20, 2019

I wanted to bring attention to a package that we are working on that originally started with remapping time axis. The package is called AxisUtilities and is available at https://github.com/coderepocenter/AxisUtilities.

It doesn’t yet have any support for CFTime yet (well it does support it now; but you need to manually convert cftime to a number for now) But we are working on it (so that the cftime to number conversion is more automatic). We have the basis there. We are now working on it to make it easier to use and remove certain steps.

It follows the ESMF or SCRIP interpolation pattern, i.e. once you make the remapper object, you could use it for multiple data set as long as the source and destination axis has not changed.

@spencerkclark
Copy link
Member Author

Thanks @maboualidev; I saw that @andersy005 posted about this too. I haven't had a chance to look deeply into your new package, but I am intrigued by the concept. I think patterns for working with data defined over intervals, be they in time or some other dimension, are something useful and should be explored.

#1475 is a good thread in particular if you are interested in ideas for how cell boundaries (and operations that depend on them) might be represented most cleanly within xarray. Discussion there seems somewhat dormant at the moment, but I'd jump in there if you have comments, ideas, or questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants