Skip to content

Commit 12b4480

Browse files
DWesldcherian
andauthored
Read grid mapping and bounds as coords (#2844)
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: dcherian <deepak@cherian.net>
1 parent a8ed7ed commit 12b4480

File tree

7 files changed

+249
-17
lines changed

7 files changed

+249
-17
lines changed

doc/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
subprocess.run(["conda", "list"])
3535
else:
3636
print("pip environment:")
37-
subprocess.run(["pip", "list"])
37+
subprocess.run([sys.executable, "-m", "pip", "list"])
3838

3939
print(f"xarray: {xarray.__version__}, {xarray.__file__}")
4040

doc/weather-climate.rst

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,36 @@ Weather and climate data
1212

1313
.. _Climate and Forecast (CF) conventions: http://cfconventions.org
1414

15+
.. _cf_variables:
16+
17+
Related Variables
18+
-----------------
19+
20+
Several CF variable attributes contain lists of other variables
21+
associated with the variable with the attribute. A few of these are
22+
now parsed by XArray, with the attribute value popped to encoding on
23+
read and the variables in that value interpreted as non-dimension
24+
coordinates:
25+
26+
- ``coordinates``
27+
- ``bounds``
28+
- ``grid_mapping``
29+
- ``climatology``
30+
- ``geometry``
31+
- ``node_coordinates``
32+
- ``node_count``
33+
- ``part_node_count``
34+
- ``interior_ring``
35+
- ``cell_measures``
36+
- ``formula_terms``
37+
38+
This decoding is controlled by the ``decode_coords`` kwarg to
39+
:py:func:`open_dataset` and :py:func:`open_mfdataset`.
40+
41+
The CF attribute ``ancillary_variables`` was not included in the list
42+
due to the variables listed there being associated primarily with the
43+
variable with the attribute, rather than with the dimensions.
44+
1545
.. _metpy_accessor:
1646

1747
CF-compliant coordinate variables

doc/whats-new.rst

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,13 @@ Breaking changes
3939
always be set such that ``int64`` values can be used. In the past, no units
4040
finer than "seconds" were chosen, which would sometimes mean that ``float64``
4141
values were required, which would lead to inaccurate I/O round-trips.
42+
- Variables referred to in attributes like ``bounds`` and ``grid_mapping``
43+
are can be set as coordinate variables. These attributes
44+
are moved to :py:attr:`DataArray.encoding` from
45+
:py:attr:`DataArray.attrs`. This behaviour is controlled by the
46+
``decode_coords`` kwarg to :py:func:`open_dataset` and
47+
:py:func:`open_mfdataset`. The full list of decoded attributes is in
48+
:ref:`weather-climate` (:pull:`2844`, :issue:`3689`)
4249
- remove deprecated ``autoclose`` kwargs from :py:func:`open_dataset` (:pull:`4725`).
4350
By `Aureliana Barghini <https://github.com/aurghs>`_.
4451

@@ -347,7 +354,6 @@ New Features
347354
- Expose ``use_cftime`` option in :py:func:`~xarray.open_zarr` (:issue:`2886`, :pull:`3229`)
348355
By `Samnan Rahee <https://github.com/Geektrovert>`_ and `Anderson Banihirwe <https://github.com/andersy005>`_.
349356

350-
351357
Bug fixes
352358
~~~~~~~~~
353359

xarray/backends/api.py

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -354,9 +354,14 @@ def open_dataset(
354354
form string arrays. Dimensions will only be concatenated over (and
355355
removed) if they have no corresponding variable and if they are only
356356
used as the last dimension of character arrays.
357-
decode_coords : bool, optional
358-
If True, decode the 'coordinates' attribute to identify coordinates in
359-
the resulting dataset.
357+
decode_coords : bool or {"coordinates", "all"}, optional
358+
Controls which variables are set as coordinate variables:
359+
360+
- "coordinates" or True: Set variables referred to in the
361+
``'coordinates'`` attribute of the datasets or individual variables
362+
as coordinate variables.
363+
- "all": Set variables referred to in ``'grid_mapping'``, ``'bounds'`` and
364+
other attributes as coordinate variables.
360365
engine : {"netcdf4", "scipy", "pydap", "h5netcdf", "pynio", "cfgrib", \
361366
"pseudonetcdf", "zarr"}, optional
362367
Engine to use when reading files. If not provided, the default engine
@@ -613,9 +618,14 @@ def open_dataarray(
613618
form string arrays. Dimensions will only be concatenated over (and
614619
removed) if they have no corresponding variable and if they are only
615620
used as the last dimension of character arrays.
616-
decode_coords : bool, optional
617-
If True, decode the 'coordinates' attribute to identify coordinates in
618-
the resulting dataset.
621+
decode_coords : bool or {"coordinates", "all"}, optional
622+
Controls which variables are set as coordinate variables:
623+
624+
- "coordinates" or True: Set variables referred to in the
625+
``'coordinates'`` attribute of the datasets or individual variables
626+
as coordinate variables.
627+
- "all": Set variables referred to in ``'grid_mapping'``, ``'bounds'`` and
628+
other attributes as coordinate variables.
619629
engine : {"netcdf4", "scipy", "pydap", "h5netcdf", "pynio", "cfgrib"}, \
620630
optional
621631
Engine to use when reading files. If not provided, the default engine

xarray/backends/apiv2.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -195,10 +195,14 @@ def open_dataset(
195195
removed) if they have no corresponding variable and if they are only
196196
used as the last dimension of character arrays.
197197
This keyword may not be supported by all the backends.
198-
decode_coords : bool, optional
199-
If True, decode the 'coordinates' attribute to identify coordinates in
200-
the resulting dataset. This keyword may not be supported by all the
201-
backends.
198+
decode_coords : bool or {"coordinates", "all"}, optional
199+
Controls which variables are set as coordinate variables:
200+
201+
- "coordinates" or True: Set variables referred to in the
202+
``'coordinates'`` attribute of the datasets or individual variables
203+
as coordinate variables.
204+
- "all": Set variables referred to in ``'grid_mapping'``, ``'bounds'`` and
205+
other attributes as coordinate variables.
202206
drop_variables: str or iterable, optional
203207
A variable or list of variables to exclude from the dataset parsing.
204208
This may be useful to drop variables with problems or

xarray/conventions.py

Lines changed: 74 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,23 @@
1111
from .core.pycompat import is_duck_dask_array
1212
from .core.variable import IndexVariable, Variable, as_variable
1313

14+
CF_RELATED_DATA = (
15+
"bounds",
16+
"grid_mapping",
17+
"climatology",
18+
"geometry",
19+
"node_coordinates",
20+
"node_count",
21+
"part_node_count",
22+
"interior_ring",
23+
"cell_measures",
24+
"formula_terms",
25+
)
26+
CF_RELATED_DATA_NEEDS_PARSING = (
27+
"cell_measures",
28+
"formula_terms",
29+
)
30+
1431

1532
class NativeEndiannessArray(indexing.ExplicitlyIndexedNDArrayMixin):
1633
"""Decode arrays on the fly from non-native to native endianness
@@ -256,6 +273,9 @@ def encode_cf_variable(var, needs_copy=True, name=None):
256273
var = maybe_default_fill_value(var)
257274
var = maybe_encode_bools(var)
258275
var = ensure_dtype_not_object(var, name=name)
276+
277+
for attr_name in CF_RELATED_DATA:
278+
pop_to(var.encoding, var.attrs, attr_name)
259279
return var
260280

261281

@@ -499,7 +519,7 @@ def stackable(dim):
499519
use_cftime=use_cftime,
500520
decode_timedelta=decode_timedelta,
501521
)
502-
if decode_coords:
522+
if decode_coords in [True, "coordinates", "all"]:
503523
var_attrs = new_vars[k].attrs
504524
if "coordinates" in var_attrs:
505525
coord_str = var_attrs["coordinates"]
@@ -509,6 +529,38 @@ def stackable(dim):
509529
del var_attrs["coordinates"]
510530
coord_names.update(var_coord_names)
511531

532+
if decode_coords == "all":
533+
for attr_name in CF_RELATED_DATA:
534+
if attr_name in var_attrs:
535+
attr_val = var_attrs[attr_name]
536+
if attr_name not in CF_RELATED_DATA_NEEDS_PARSING:
537+
var_names = attr_val.split()
538+
else:
539+
roles_and_names = [
540+
role_or_name
541+
for part in attr_val.split(":")
542+
for role_or_name in part.split()
543+
]
544+
if len(roles_and_names) % 2 == 1:
545+
warnings.warn(
546+
f"Attribute {attr_name:s} malformed", stacklevel=5
547+
)
548+
var_names = roles_and_names[1::2]
549+
if all(var_name in variables for var_name in var_names):
550+
new_vars[k].encoding[attr_name] = attr_val
551+
coord_names.update(var_names)
552+
else:
553+
referenced_vars_not_in_variables = [
554+
proj_name
555+
for proj_name in var_names
556+
if proj_name not in variables
557+
]
558+
warnings.warn(
559+
f"Variable(s) referenced in {attr_name:s} not in variables: {referenced_vars_not_in_variables!s}",
560+
stacklevel=5,
561+
)
562+
del var_attrs[attr_name]
563+
512564
if decode_coords and "coordinates" in attributes:
513565
attributes = dict(attributes)
514566
coord_names.update(attributes.pop("coordinates").split())
@@ -542,9 +594,14 @@ def decode_cf(
542594
decode_times : bool, optional
543595
Decode cf times (e.g., integers since "hours since 2000-01-01") to
544596
np.datetime64.
545-
decode_coords : bool, optional
546-
Use the 'coordinates' attribute on variable (or the dataset itself) to
547-
identify coordinates.
597+
decode_coords : bool or {"coordinates", "all"}, optional
598+
Controls which variables are set as coordinate variables:
599+
600+
- "coordinates" or True: Set variables referred to in the
601+
``'coordinates'`` attribute of the datasets or individual variables
602+
as coordinate variables.
603+
- "all": Set variables referred to in ``'grid_mapping'``, ``'bounds'`` and
604+
other attributes as coordinate variables.
548605
drop_variables : str or iterable, optional
549606
A variable or list of variables to exclude from being parsed from the
550607
dataset. This may be useful to drop variables with problems or
@@ -664,6 +721,7 @@ def _encode_coordinates(variables, attributes, non_dim_coord_names):
664721

665722
global_coordinates = non_dim_coord_names.copy()
666723
variable_coordinates = defaultdict(set)
724+
not_technically_coordinates = set()
667725
for coord_name in non_dim_coord_names:
668726
target_dims = variables[coord_name].dims
669727
for k, v in variables.items():
@@ -674,6 +732,13 @@ def _encode_coordinates(variables, attributes, non_dim_coord_names):
674732
):
675733
variable_coordinates[k].add(coord_name)
676734

735+
if any(
736+
attr_name in v.encoding and coord_name in v.encoding.get(attr_name)
737+
for attr_name in CF_RELATED_DATA
738+
):
739+
not_technically_coordinates.add(coord_name)
740+
global_coordinates.discard(coord_name)
741+
677742
variables = {k: v.copy(deep=False) for k, v in variables.items()}
678743

679744
# keep track of variable names written to file under the "coordinates" attributes
@@ -691,7 +756,11 @@ def _encode_coordinates(variables, attributes, non_dim_coord_names):
691756
# we get support for attrs["coordinates"] for free.
692757
coords_str = pop_to(encoding, attrs, "coordinates")
693758
if not coords_str and variable_coordinates[name]:
694-
attrs["coordinates"] = " ".join(map(str, variable_coordinates[name]))
759+
attrs["coordinates"] = " ".join(
760+
str(coord_name)
761+
for coord_name in variable_coordinates[name]
762+
if coord_name not in not_technically_coordinates
763+
)
695764
if "coordinates" in attrs:
696765
written_coords.update(attrs["coordinates"].split())
697766

xarray/tests/test_backends.py

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@
5656
requires_dask,
5757
requires_fsspec,
5858
requires_h5netcdf,
59+
requires_iris,
5960
requires_netCDF4,
6061
requires_pseudonetcdf,
6162
requires_pydap,
@@ -858,6 +859,118 @@ def test_roundtrip_mask_and_scale(self, decoded_fn, encoded_fn):
858859
assert decoded.variables[k].dtype == actual.variables[k].dtype
859860
assert_allclose(decoded, actual, decode_bytes=False)
860861

862+
@staticmethod
863+
def _create_cf_dataset():
864+
original = Dataset(
865+
dict(
866+
variable=(
867+
("ln_p", "latitude", "longitude"),
868+
np.arange(8, dtype="f4").reshape(2, 2, 2),
869+
{"ancillary_variables": "std_devs det_lim"},
870+
),
871+
std_devs=(
872+
("ln_p", "latitude", "longitude"),
873+
np.arange(0.1, 0.9, 0.1).reshape(2, 2, 2),
874+
{"standard_name": "standard_error"},
875+
),
876+
det_lim=(
877+
(),
878+
0.1,
879+
{"standard_name": "detection_minimum"},
880+
),
881+
),
882+
dict(
883+
latitude=("latitude", [0, 1], {"units": "degrees_north"}),
884+
longitude=("longitude", [0, 1], {"units": "degrees_east"}),
885+
latlon=((), -1, {"grid_mapping_name": "latitude_longitude"}),
886+
latitude_bnds=(("latitude", "bnds2"), [[0, 1], [1, 2]]),
887+
longitude_bnds=(("longitude", "bnds2"), [[0, 1], [1, 2]]),
888+
areas=(
889+
("latitude", "longitude"),
890+
[[1, 1], [1, 1]],
891+
{"units": "degree^2"},
892+
),
893+
ln_p=(
894+
"ln_p",
895+
[1.0, 0.5],
896+
{
897+
"standard_name": "atmosphere_ln_pressure_coordinate",
898+
"computed_standard_name": "air_pressure",
899+
},
900+
),
901+
P0=((), 1013.25, {"units": "hPa"}),
902+
),
903+
)
904+
original["variable"].encoding.update(
905+
{"cell_measures": "area: areas", "grid_mapping": "latlon"},
906+
)
907+
original.coords["latitude"].encoding.update(
908+
dict(grid_mapping="latlon", bounds="latitude_bnds")
909+
)
910+
original.coords["longitude"].encoding.update(
911+
dict(grid_mapping="latlon", bounds="longitude_bnds")
912+
)
913+
original.coords["ln_p"].encoding.update({"formula_terms": "p0: P0 lev : ln_p"})
914+
return original
915+
916+
def test_grid_mapping_and_bounds_are_not_coordinates_in_file(self):
917+
original = self._create_cf_dataset()
918+
with create_tmp_file() as tmp_file:
919+
original.to_netcdf(tmp_file)
920+
with open_dataset(tmp_file, decode_coords=False) as ds:
921+
assert ds.coords["latitude"].attrs["bounds"] == "latitude_bnds"
922+
assert ds.coords["longitude"].attrs["bounds"] == "longitude_bnds"
923+
assert "latlon" not in ds["variable"].attrs["coordinates"]
924+
assert "coordinates" not in ds.attrs
925+
926+
def test_coordinate_variables_after_dataset_roundtrip(self):
927+
original = self._create_cf_dataset()
928+
with self.roundtrip(original, open_kwargs={"decode_coords": "all"}) as actual:
929+
assert_identical(actual, original)
930+
931+
with self.roundtrip(original) as actual:
932+
expected = original.reset_coords(
933+
["latitude_bnds", "longitude_bnds", "areas", "P0", "latlon"]
934+
)
935+
# equal checks that coords and data_vars are equal which
936+
# should be enough
937+
# identical would require resetting a number of attributes
938+
# skip that.
939+
assert_equal(actual, expected)
940+
941+
def test_grid_mapping_and_bounds_are_coordinates_after_dataarray_roundtrip(self):
942+
original = self._create_cf_dataset()
943+
# The DataArray roundtrip should have the same warnings as the
944+
# Dataset, but we already tested for those, so just go for the
945+
# new warnings. It would appear that there is no way to tell
946+
# pytest "This warning and also this warning should both be
947+
# present".
948+
# xarray/tests/test_conventions.py::TestCFEncodedDataStore
949+
# needs the to_dataset. The other backends should be fine
950+
# without it.
951+
with pytest.warns(
952+
UserWarning,
953+
match=(
954+
r"Variable\(s\) referenced in bounds not in variables: "
955+
r"\['l(at|ong)itude_bnds'\]"
956+
),
957+
):
958+
with self.roundtrip(
959+
original["variable"].to_dataset(), open_kwargs={"decode_coords": "all"}
960+
) as actual:
961+
assert_identical(actual, original["variable"].to_dataset())
962+
963+
@requires_iris
964+
def test_coordinate_variables_after_iris_roundtrip(self):
965+
original = self._create_cf_dataset()
966+
iris_cube = original["variable"].to_iris()
967+
actual = DataArray.from_iris(iris_cube)
968+
# Bounds will be missing (xfail)
969+
del original.coords["latitude_bnds"], original.coords["longitude_bnds"]
970+
# Ancillary vars will be missing
971+
# Those are data_vars, and will be dropped when grabbing the variable
972+
assert_identical(actual, original["variable"])
973+
861974
def test_coordinates_encoding(self):
862975
def equals_latlon(obj):
863976
return obj == "lat lon" or obj == "lon lat"

0 commit comments

Comments
 (0)