virtual variables not available when using open_dataset #121

jhamman · 2014-05-11T23:11:21Z

The tutorial provides an example of how to use xray's virtual_variables. The same functionality is not availble from a Dataset object created by open_dataset.

Tutorial:

In [135]:
foo_values = np.random.RandomState(0).rand(3, 4)
times = pd.date_range('2000-01-01', periods=3)
ds = xray.Dataset({'time': ('time', times),
                   'foo': (['time', 'space'], foo_values)})
ds['time.dayofyear']

Out[135]:
<xray.DataArray 'time.dayofyear' (time: 3)>
array([1, 2, 3], dtype=int32)
Attributes:
    Empty

however, reading a time coordinate / variable from a netCDF4 file, and applying the same logic raises an error:

In [136]:
ds = xray.open_dataset('sample_for_xray.nc')
ds['time']

Out[136]:
<xray.DataArray 'time' (time: 4)>
array([1979-09-16 12:00:00, 1979-10-17 00:00:00, 1979-11-16 12:00:00,
       1979-12-17 00:00:00], dtype=object)
Attributes:
    dimensions: 1
    long_name: time
    type_preferred: int

In [137]:
ds['time.dayofyear']

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-137-bfe1ae778782> in <module>()
----> 1 ds['time.dayofyear']

/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/dataset.pyc in __getitem__(self, key)
    408         """Access the given variable name in this dataset as a `DataArray`.
    409         """
--> 410         return data_array.DataArray._constructor(self, key)
    411 
    412     def __setitem__(self, key, value):

/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/data_array.pyc in _constructor(cls, dataset, name)
     95         if name not in dataset and name not in dataset.virtual_variables:
     96             raise ValueError('name %r must be a variable in dataset %r'
---> 97                              % (name, dataset))
     98         obj._dataset = dataset
     99         obj._name = name

ValueError: name 'time.dayofyear' must be a variable in dataset <xray.Dataset>
Dimensions:     (time: 4, x: 275, y: 205)
Coordinates:
    time            X                   
    x                        X          
    y                                X  
Noncoordinates:
    Wind            0        2       1  
Attributes:

sample data for xray from RASM project

Is there a reason that the virtual time variables are only available if the dataset is created from a pandas date_range? Lastly, this could be related to the #118 .

The text was updated successfully, but these errors were encountered:

shoyer · 2014-05-12T01:31:11Z

Yes, this is certainly related to #118. Virtual variables work by using pandas.DatetimeIndex methods, but if you're not using a standard calendar, you end up with an object array of netCDF4.datetime objects instead of an array of numpy.datetime64 objects (which can be turned into a DatetimeIndex).

Unfortunately, we do need to be able to make a DatetimeIndex to be able to use its (very quick) calculations for properties like year. The alternative is to write our own implementation, which would likely mean far slower pure-python code. We could also write a function to cast an array into a DatetimeIndex from datetime objects, which I'm guessing would be your preferred solution, even though there are issues like the difference between dates, as DatetimeIndex objects and numpy's datetime64 always assume a standard gregorian calendar.

jhamman · 2014-05-12T03:25:35Z

I think the simplest option would be to develop a function to cast the netCDF4.datetime objects into an array of numpy.datetime64 objects.

Does xray do a lot of timedelta operations? If not, I'd say its probably best to live with (for now) the discrepancies between the calendars. Just be sure that you always use the netCDF4.date2num when going back to ordinal times.

I've run into issues like this repeatedly and I think it would be really nice if the datetime module was calendar aware -- probably not going to happen. The closest thing I've found to this is the netCDF4 netcdftime.py which is written in pure python but could be expanded to handle the issues we're talking about.

shoyer · 2014-05-12T06:09:09Z

Timedelta operations are used in exactly one place in xray: speeding up decoding of dates from netCDF if a standard calendar is being used. Otherwise, that sort of stuff is left up to the user.

If dates with non-standard calendars can generally be most usefully expressed as a pandas.DatetimeIndex, then let's go ahead and default to decoding them into datetime64 arrays. The relevant function to modify is here (see also here) if you'd like to make a pull request!

jhamman · 2014-05-12T21:28:44Z

Ok, I just spent a few minutes working through a possible (although not ideal) solution for this. It works although it is a bit ugly and quite a bit slower than the standard calendar option. This option returns a numpy array with a dtype: datetime64[ns]. The downside is that it uses a datetime object as an intermediary.

In [1]:
import pandas as pd
from netCDF4 import num2date, date2num
import datetime
import numpy as np
from xray.conventions import decode_cf_datetime as decode

units = 'days since 0001-01-01'

# pandas time range
times = pd.date_range('2001-01-01-00', end='2001-06-30-23', freq='H')

# numpy array of numeric dates on noleap calendar
noleap_time = date2num(times.to_pydatetime(), units, calendar='noleap')

# numpy array of numeric dates on standard calendar
std_time = date2num(times.to_pydatetime(), units, calendar='standard')

# decoding function using datetime intermediary
def nctime_to_nptime(times):
    new = np.empty(len(times), dtype='M8[ns]')
    for i, t in enumerate(times):
        new[i] = np.datetime64(datetime.datetime(*t.timetuple()[:6]))
    return new

In [2]:
# decode noleap_time 
%timeit nctime_to_nptime(decode(noleap_time, units, calendar='noleap'))
noleap_datetimes = nctime_to_nptime(num2date(noleap_time, units, calendar='noleap'))
print 'dtype:', noleap_datetimes.dtype
print noleap_datetimes
10 loops, best of 3: 38.8 ms per loop
dtype: datetime64[ns]
['2000-12-31T16:00:00.000000000-0800' '2000-12-31T17:00:00.000000000-0800'
 '2000-12-31T18:00:00.000000000-0800' ...,
 '2001-06-30T14:00:00.000000000-0700' '2001-06-30T15:00:00.000000000-0700'
 '2001-06-30T16:00:00.000000000-0700']

In [3]:
# decode std_time using vectorized converter
%timeit decode(std_time, units, calendar='standard')
standard_datetimes = decode(std_time, units, calendar='standard')
print 'dtype:', standard_datetimes.dtype
print standard_datetimes
1000 loops, best of 3: 243 µs per loop
dtype: datetime64[ns]
['2000-12-31T16:00:00.000000000-0800' '2000-12-31T16:59:59.000000000-0800'
 '2000-12-31T17:59:59.000000000-0800' ...,
 '2001-06-30T13:59:59.000000000-0700' '2001-06-30T14:59:59.000000000-0700'
 '2001-06-30T15:59:59.000000000-0700']

Two three things to notice here:

the decode_cf_datetime is having a precision issue on the seconds when using the numpy timedelta approach (used by standard calendars).
The new function is substantially slower than the numpy vectorization approach (can't get around that though)
the dtype of the returned array matches regardless of the calendar

shoyer · 2014-05-12T21:51:14Z

Those precision issues are unfortunate but perhaps unavoidable in this case because you are representing dates as floating point numbers -- the units are in "days" but the frequency between time points is measured in "hours".

…o support virtual_variable indexing, pydata#121

This was referenced May 13, 2014

Return numpy.datetime64 arrays for non-standard calendars akleeman/xray#1

Closed

Return numpy.datetime64 arrays for non-standard calendars #126

Merged

shoyer closed this as completed May 16, 2014

akleeman pushed a commit to akleeman/xray that referenced this issue May 16, 2014

convert array of netCDF4.datetime objects to numpy.datetime64 array t…

b303744

…o support virtual_variable indexing, pydata#121

spencerkclark mentioned this issue Sep 25, 2018

xarray potential inconstistencies with cftime #2437

Closed

keewis pushed a commit to keewis/xarray that referenced this issue Jan 17, 2024

Fix bug in pypi publish GH workflow (pydata#121)

4871f25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

virtual variables not available when using open_dataset #121

virtual variables not available when using open_dataset #121

jhamman commented May 11, 2014

shoyer commented May 12, 2014

Uh oh!

jhamman commented May 12, 2014

Uh oh!

shoyer commented May 12, 2014

Uh oh!

jhamman commented May 12, 2014

Uh oh!

shoyer commented May 12, 2014

Uh oh!

Uh oh!

virtual variables not available when using open_dataset #121

virtual variables not available when using open_dataset #121

Comments

jhamman commented May 11, 2014

shoyer commented May 12, 2014

Uh oh!

jhamman commented May 12, 2014

Uh oh!

shoyer commented May 12, 2014

Uh oh!

jhamman commented May 12, 2014

Uh oh!

shoyer commented May 12, 2014

Uh oh!