to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units #2921

klindsay28 · 2019-04-25T22:08:52Z

Code Sample, a copy-pastable example if possible

import numpy as np
import xarray as xr

# create time and time_bounds DataArrays for Jan-1850 and Feb-1850
time_bounds_vals = np.array([[0.0, 31.0], [31.0, 59.0]])
time_vals = time_bounds_vals.mean(axis=1)

time_var = xr.DataArray(time_vals, dims='time',
                        coords={'time':time_vals})
time_bounds_var = xr.DataArray(time_bounds_vals, dims=('time', 'd2'),
                               coords={'time':time_vals})

# create Dataset of time and time_bounds
ds = xr.Dataset(coords={'time':time_var}, data_vars={'time_bounds':time_bounds_var})
ds.time.attrs = {'bounds':'time_bounds', 'calendar':'noleap',
                 'units':'days since 1850-01-01'} 

# write Jan-1850 values to file 
ds.isel(time=slice(0,1)).to_netcdf('Jan-1850.nc', unlimited_dims='time')

# write Feb-1850 values to file
ds.isel(time=slice(1,2)).to_netcdf('Feb-1850.nc', unlimited_dims='time')

# use open_mfdataset to read in files, combining into 1 Dataset
decode_times = True
decode_cf = True
ds = xr.open_mfdataset(['Jan-1850.nc', 'Feb-1850.nc'],
                       decode_cf=decode_cf, decode_times=decode_times)

# write combined Dataset out
ds.to_netcdf('JanFeb-1850.nc', unlimited_dims='time')

Problem description

The above code initially creates 2 netCDF files, for Jan-1850 and Feb-1850, that have the variables time and time_bounds, and time:bounds='time_bounds'. It then reads the 2 files back in as a single Dataset, using open_mfdataset, and this Dataset is written back out to a netCDF file. ncdump of this final file is

netcdf JanFeb-1850 {
dimensions:
	time = UNLIMITED ; // (2 currently)
	d2 = 2 ;
variables:
	int64 time(time) ;
		time:bounds = "time_bounds" ;
		time:units = "hours since 1850-01-16 12:00:00.000000" ;
		time:calendar = "noleap" ;
	double time_bounds(time, d2) ;
		time_bounds:_FillValue = NaN ;
		time_bounds:units = "days since 1850-01-01" ;
		time_bounds:calendar = "noleap" ;
data:

 time = 0, 708 ;

 time_bounds =
  0, 31,
  31, 59 ;
}

The problem is that the units attribute for time and time_bounds are different in this file, contrary to what CF conventions requires.

The final call to to_netcdf is creating a file where time's units (and type) differ from what they are in the intermediate files. These transformations are not being applied to time_bounds.

While the change to time's type is not necessarily an issue, I do find it surprising.

This inconsistency goes away if either of decode_times or decode_cf is set to False in the python code above. In particular, the transformations to time's units and type do not happen.

The inconsistency also goes away if open_mfdataset opens a single file. In this scenario also, the transformations to time's units and type do not happen.

I think that the desired behavior is to either not apply the units and type transformations to time, or to also apply them to time_bounds. The first option would be consistent with the current single-file behavior.

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.12.62-60.64.8-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.1
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.5
distributed: 1.26.1
matplotlib: 3.0.3
cartopy: None
seaborn: None
setuptools: 40.8.0
pip: 19.0.3
conda: None
pytest: 4.3.1
IPython: 7.4.0
sphinx: None

The text was updated successfully, but these errors were encountered:

klindsay28 · 2019-05-06T19:52:47Z

It looks like ds.time.encoding is not getting set when open_mfdataset is opening multiple files. I suspect that this is leading to the surprising unit for time when the ds is written out. The code below demonstrates that ds.time.encoding is set by open_mfdataset in the single-file case and is not set in the multi-file case. However, ds.time_bounds.encoding is set in both the single- and multi-file cases.

The possibility of this is alluded to the in a comment in #2436, which relates the issue to #1614.

import numpy as np
import xarray as xr

# create time and time_bounds DataArrays for Jan-1850 and Feb-1850
time_bounds_vals = np.array([[0.0, 31.0], [31.0, 59.0]])
time_vals = time_bounds_vals.mean(axis=1)

time_var = xr.DataArray(time_vals, dims='time',
                        coords={'time':time_vals})
time_bounds_var = xr.DataArray(time_bounds_vals, dims=('time', 'd2'),
                               coords={'time':time_vals})

# create Dataset of time and time_bounds
ds = xr.Dataset(coords={'time':time_var}, data_vars={'time_bounds':time_bounds_var})
ds.time.attrs = {'bounds':'time_bounds', 'calendar':'noleap',
                 'units':'days since 1850-01-01'}

# write Jan-1850 values to file
ds.isel(time=slice(0,1)).to_netcdf('Jan-1850.nc', unlimited_dims='time')

# write Feb-1850 values to file
ds.isel(time=slice(1,2)).to_netcdf('Feb-1850.nc', unlimited_dims='time')

# use open_mfdataset to read in files, combining into 1 Dataset
decode_times = True
decode_cf = True
ds = xr.open_mfdataset(['Jan-1850.nc'],
                       decode_cf=decode_cf, decode_times=decode_times)

print('time and time_bounds encoding, single-file open_mfdataset')
print(ds.time.encoding)
print(ds.time_bounds.encoding)

# use open_mfdataset to read in files, combining into 1 Dataset
decode_times = True
decode_cf = True
ds = xr.open_mfdataset(['Jan-1850.nc', 'Feb-1850.nc'],
                       decode_cf=decode_cf, decode_times=decode_times)

print('--------------------')
print('time and time_bounds encoding, multi-file open_mfdataset')
print(ds.time.encoding)
print(ds.time_bounds.encoding)

produces

time and time_bounds encoding, single-file open_mfdataset
{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (512,), 'source': '/gpfs/fs1/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/Jan-1850.nc', 'original_shape': (1,), 'dtype': dtype('float64'), '_FillValue': nan, 'units': 'days since 1850-01-01', 'calendar': 'noleap'}
{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 2), 'source': '/gpfs/fs1/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/Jan-1850.nc', 'original_shape': (1, 2), 'dtype': dtype('float64'), '_FillValue': nan, 'units': 'days since 1850-01-01', 'calendar': 'noleap'}
--------------------
time and time_bounds encoding, multi-file open_mfdataset
{}
{'zlib': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 2), 'source': '/gpfs/fs1/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/Jan-1850.nc', 'original_shape': (1, 2), 'dtype': dtype('float64'), '_FillValue': nan, 'units': 'days since 1850-01-01', 'calendar': 'noleap'}

dcherian · 2019-05-07T16:04:58Z

Thanks for the great report @klindsay28.

Looks like CF recommends that time_bounds (boundary variable) not have the units or calendar attributes:

Since a boundary variable is considered to be part of a coordinate variable’s metadata, it is not necessary to provide it with attributes such as long_name and units.

Boundary variable attributes which determine the coordinate type (units, standard_name, axis and positive) or those which affect the interpretation of the array values (units, calendar, leap_month, leap_year and month_lengths) must always agree exactly with the same attributes of its associated coordinate, scalar coordinate or auxiliary coordinate variable. To avoid duplication, however, it is recommended that these are not provided to a boundary variable.

We already have special treatment for time_bounds in the decode step. It makes sense to treat it specially in the encode step. In fact it looks like @spencerkclark identified this issue in #2571 👏

@klindsay28 Any interest in putting together a PR that would avoid setting these attributes on time_bounds during the encode step?

Ping @fmaussion for feedback.

spencerkclark · 2019-05-07T17:37:28Z

We already have special treatment for time_bounds in the decode step. It makes sense to treat it specially in the encode step.

+1 I didn't fully appreciate how strict the CF conventions were regarding this at the time of #2571. Where it is unambiguous I agree we should make an effort to comply (or preserve compliance).

mathause · 2019-05-15T15:52:12Z

Today @lukasbrunner and me ran into this problem. Opening an mfdataset and saving to_netcdfled to the following units for time and time_units:

time_bnds:units = "days since 1850-01-01" ;
time:units = "days since 2000-01-01" ;

Opening the dataset in xarray works fine, but ncdump and ncview use the units from time for time_bnds. Thus if I do

ncdump -tv time_bnds test.nc

the first date is '2150-01-01' (instead of '2000-01-01') - very confusing. (panoply shows the correct time_bnds).

Workaround

import xarray as xr
filenames = ['file1.nc', 'file2.nc']

ds = xr.open_mfdataset(fNs)
ds.load()

# make sure the encoding is really empty
assert not ds.time.encoding

# assign encoding, such that they are equal
ds.time.encoding.update(ds.time_bnds.encoding)

# save
ds.to_netcdf('~/F/test.nc')

Btw. thanks to @klindsay28 for the nice error report.

Fixes pydata#2921 1. Removes certain attributes from bounds variables on encode. 2. open_mfdataset: Sets encoding on variables based on encoding in first file.

* Don't set attributes on bounds variables. Fixes #2921 1. Removes certain attributes from bounds variables on encode. 2. open_mfdataset: Sets encoding on variables based on encoding in first file. * remove whitespace stuff. * Make sure variable exists in first file before assigning encoding * Make sure we iterate over coords too. * lint fix. * docs/comment fixes. * mfdataset encoding test. * time_bounds attrs test + allow for slight CF non-compliance. * I need to deal with encoding! * minor fixes. * another minor fix. * review fixes. * lint fixes. * Remove encoding changes and xfail test. * Update whats-new.rst

Change output format from NETCDF4 to NETCDF4_CLASSIC (for backwards compatibility; *believe* but haven't tested that this still permits writing >2Gb output files). If variable has a time axis, make it unlimited. Set units on the time bounds variable, if present, to address issue noted in pydata/xarray#2921 (comment)

Issue pydata#2921 is about mismatching time units between a time variable and its "bounds" companion. However, pydata#2965 does more than fixing pydata#2921, it removes all double attributes from "bounds" variables which has the undesired side effect that there is currently no way to save them to netcdf with xarray. Since the mentioned link is a recommendation and not a hard requirement for CF compliance, these attributes should be left to the caller to prepare the dataset variables appropriately if required. Reduces the amount of surprise that attributes are not written to disk and fixes pydata#8368.

dcherian added bug topic-CF conventions and removed bug labels May 7, 2019

dcherian added a commit to dcherian/xarray that referenced this issue May 15, 2019

Don't set attributes on bounds variables.

ff2fd49

Fixes pydata#2921 1. Removes certain attributes from bounds variables on encode. 2. open_mfdataset: Sets encoding on variables based on encoding in first file.

dcherian mentioned this issue May 15, 2019

Don't set encoding attributes on bounds variables. #2965

Merged

3 tasks

dcherian closed this as completed in #2965 Jun 25, 2019

tomvothecoder mentioned this issue Dec 1, 2021

Additional fixes for PR #158 xCDAT/xcdat#173

Merged

9 tasks

st-bender mentioned this issue Apr 9, 2024

to_netcdf: Unexpected drop of "units" attribute of attached "bounds" #8368

Open

5 tasks

st-bender mentioned this issue Apr 10, 2024

Keep attributes for "bounds" variables #8924

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units #2921

to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units #2921

klindsay28 commented Apr 25, 2019

klindsay28 commented May 6, 2019

Uh oh!

dcherian commented May 7, 2019 •

edited

Loading

Uh oh!

spencerkclark commented May 7, 2019

Uh oh!

mathause commented May 15, 2019

Uh oh!

Uh oh!

to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units #2921

to_netcdf with decoded time can create file with inconsistent time:units and time_bounds:units #2921

Comments

klindsay28 commented Apr 25, 2019

Code Sample, a copy-pastable example if possible

Problem description

klindsay28 commented May 6, 2019

Uh oh!

dcherian commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spencerkclark commented May 7, 2019

Uh oh!

mathause commented May 15, 2019

Workaround

Uh oh!

dcherian commented May 7, 2019 •

edited

Loading