-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xarray: add indexes options and better define band names #764
base: main
Are you sure you want to change the base?
Conversation
ds = self.input | ||
band_names = self.band_names | ||
if indexes := cast_to_sequence(indexes): | ||
assert all(v > 0 for v in indexes), "Indexes value must be >= 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray won't complain when we pass data[-1]
so we need this tests
if indexes != (1,): | ||
raise ValueError( | ||
f"Invalid indexes {indexes} for array of shape {ds.shape}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 2D array we still allow indexes=1
@@ -1060,7 +1060,7 @@ def _get_reader(self, asset_info: AssetInfo) -> Tuple[Type[BaseReader], Dict]: | |||
assert info["netcdf"].crs | |||
|
|||
img = stac.preview(assets=["netcdf"]) | |||
assert img.band_names == ["netcdf_value"] | |||
assert img.band_names == ["netcdf_dataset"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
band value default to the DataArray's name ✨
if not self._dims: | ||
coords_name = list(self.input.coords) | ||
if len(coords_name) > 3 and (coord := coords_name[2]): | ||
return [str(self.input.coords[coord].data)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to put something like {coord_name}={coord_value}}
🤷♂️
but we don't do this for the other band names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not super easy to understand what this code is trying to accomplish, but I think it's problematic that band names are based on dimensions if _dims
is set as an attribute and the names are based on coordinates if not. I think it should always be based on non-spatial (as defined by rioxarray) dimensions. Since all dimensions have names, this should also make dealing with defaults simpler. Some documentation about how to map Xarray's data model into rio-tiler's assumptions would really help in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think it's problematic that band names are based on dimensions if _dims is set as an attribute and the names are based on coordinates if not. I think it should always be based on non-spatial (as defined by rioxarray) dimensions. Since all dimensions have names, this should also make dealing with defaults simpler. Some documentation about how to map Xarray's data model into rio-tiler's assumptions would really help in general.
@maxrjones I'm not sure to get this
The main issue is that when we pass a 2D dataarray (thinks about if you select the first time dim) then dims
is empty but the coordinates has the time
value which correspond to the name of the array slice (maybe I'm mistaken) so I felt we needed to have a way to surface this.
from datetime import datetime
import numpy
import xarray
import rioxarray
arr = numpy.arange(0.0, 33 * 35 * 2).reshape(2, 33, 35)
data = xarray.DataArray(
arr,
dims=("time", "y", "x"),
coords={
"x": numpy.arange(-170, 180, 10),
"y": numpy.arange(-80, 85, 5),
"time": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
},
)
data.attrs.update({"valid_min": arr.min(), "valid_max": arr.max()})
ds = data.to_dataset(name="dataset")
da = ds["dataset"][0]
da.rio.write_crs("epsg:4326", inplace=True)
da
# >>> <xarray.DataArray 'dataset' (y: 33, x: 35)> Size: 9kB
# array([[0.000e+00, 1.000e+00, 2.000e+00, ..., 3.200e+01, 3.300e+01,
# 3.400e+01],
# [3.500e+01, 3.600e+01, 3.700e+01, ..., 6.700e+01, 6.800e+01,
# 6.900e+01],
# [7.000e+01, 7.100e+01, 7.200e+01, ..., 1.020e+02, 1.030e+02,
# 1.040e+02],
# ...,
# [1.050e+03, 1.051e+03, 1.052e+03, ..., 1.082e+03, 1.083e+03,
# 1.084e+03],
# [1.085e+03, 1.086e+03, 1.087e+03, ..., 1.117e+03, 1.118e+03,
# 1.119e+03],
# [1.120e+03, 1.121e+03, 1.122e+03, ..., 1.152e+03, 1.153e+03,
# 1.154e+03]])
# Coordinates:
# * x (x) int64 280B -170 -160 -150 -140 -130 ... 130 140 150 160 170
# * y (y) int64 264B -80 -75 -70 -65 -60 -55 ... 55 60 65 70 75 80
# time datetime64[ns] 8B 2022-01-01
# spatial_ref int64 8B 0
# Attributes:
# valid_min: 0.0
# valid_max: 2309.0
_dims = [
d
for d in da.dims
if d not in [da.rio.x_dim, da.rio.y_dim]
]
_dims
# >>> []
coords_name = list(da.coords)
coords_name
# >>> ['x', 'y', 'time', 'spatial_ref']
if len(coords_name) > 3 and (coord := coords_name[2]):
print(str(da.coords[coord].data))
# >>> 2022-01-01T00:00:00.000000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxrjones could you double check the logic here?
@@ -105,6 +110,7 @@ def __attrs_post_init__(self): | |||
for d in self.input.dims | |||
if d not in [self.input.rio.x_dim, self.input.rio.y_dim] | |||
] | |||
assert len(self._dims) in [0, 1], "Can't handle >=4D DataArray" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check to make sure we don't have 4D arrays
@vincentsarago I'll try to look through this in a bit more detail today but it might be worth considering another naming convention for "indexes" in this context. In |
I've used But I think we can improve the documentation 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, and have a couple of nits that you could feel free to ignore:
- I think you should use da or dataarray throughout rather than ds to make it more clear that this is all only applicable to Xarray DataArrays and not usable for Xarray Datasets.
- I find it more robust to use by-name based integer lookup rather than strictly positional (https://docs.xarray.dev/en/stable/user-guide/indexing.html#quick-overview).
if not self._dims: | ||
coords_name = list(self.input.coords) | ||
if len(coords_name) > 3 and (coord := coords_name[2]): | ||
return [str(self.input.coords[coord].data)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not super easy to understand what this code is trying to accomplish, but I think it's problematic that band names are based on dimensions if _dims
is set as an attribute and the names are based on coordinates if not. I think it should always be based on non-spatial (as defined by rioxarray) dimensions. Since all dimensions have names, this should also make dealing with defaults simpler. Some documentation about how to map Xarray's data model into rio-tiler's assumptions would really help in general.
with XarrayReader(data) as dst: | ||
img = dst.info() | ||
print(img.band_descriptions)[0] | ||
>> ("b1", "2022-01-01T00:00:00.000000000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have the band name returned as b1
rather than time
or time1
(corresponding to <dimension-name><1-based index>
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mostly for compatibility but I think having <dim_name><idx>
would be 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will have the same issue when there are only 2 Dimensions what should the band name be? For now I would use the same code I'm using to get the band name from the first non-geo coordinate 🤷
Co-authored-by: Max Jones <14077947+maxrjones@users.noreply.github.com>
But that means we need to have the name of the dimension , I think rioxarray enforce the data array to be in OR we could change the |
I'm not sure this would need to be an either or decision, in that we could support positional indexing now and potentially label-based indexing later on. I have a few questions about the vision for titiler.xarray that would help inform responses to your comments above. I could open these as separate issues:
|
Right now, XarrayReader expect a DataArray, and in titiler.xarray we make sure to select the first With the proposed change, I'm trying to ease the case where no
Absolutely
🤔 I'm not sure how the we could have from datetime import datetime
import numpy
import xarray
da = xarray.DataArray(
numpy.arange(0.0, 33 * 35 * 2).reshape(2, 33, 35),
dims=("time", "y", "x"),
coords={
"x": numpy.arange(-170, 180, 10),
"y": numpy.arange(-80, 85, 5),
"time": [datetime(2022, 1, 1), datetime(2022, 1, 2)],
},
)
>>> da[0]
<xarray.DataArray (y: 33, x: 35)> Size: 9kB
array([[0.000e+00, 1.000e+00, 2.000e+00, ..., 3.200e+01, 3.300e+01,
3.400e+01],
[3.500e+01, 3.600e+01, 3.700e+01, ..., 6.700e+01, 6.800e+01,
6.900e+01],
[7.000e+01, 7.100e+01, 7.200e+01, ..., 1.020e+02, 1.030e+02,
1.040e+02],
...,
[1.050e+03, 1.051e+03, 1.052e+03, ..., 1.082e+03, 1.083e+03,
1.084e+03],
[1.085e+03, 1.086e+03, 1.087e+03, ..., 1.117e+03, 1.118e+03,
1.119e+03],
[1.120e+03, 1.121e+03, 1.122e+03, ..., 1.152e+03, 1.153e+03,
1.154e+03]])
Coordinates:
* x (x) int64 280B -170 -160 -150 -140 -130 ... 130 140 150 160 170
* y (y) int64 264B -80 -75 -70 -65 -60 -55 -50 ... 50 55 60 65 70 75 80
time datetime64[ns] 8B 2022-01-01
>>> da.loc["2022-01-01"]
<xarray.DataArray (y: 33, x: 35)> Size: 9kB
array([[0.000e+00, 1.000e+00, 2.000e+00, ..., 3.200e+01, 3.300e+01,
3.400e+01],
[3.500e+01, 3.600e+01, 3.700e+01, ..., 6.700e+01, 6.800e+01,
6.900e+01],
[7.000e+01, 7.100e+01, 7.200e+01, ..., 1.020e+02, 1.030e+02,
1.040e+02],
...,
[1.050e+03, 1.051e+03, 1.052e+03, ..., 1.082e+03, 1.083e+03,
1.084e+03],
[1.085e+03, 1.086e+03, 1.087e+03, ..., 1.117e+03, 1.118e+03,
1.119e+03],
[1.120e+03, 1.121e+03, 1.122e+03, ..., 1.152e+03, 1.153e+03,
1.154e+03]])
Coordinates:
* x (x) int64 280B -170 -160 -150 -140 -130 ... 130 140 150 160 170
* y (y) int64 264B -80 -75 -70 -65 -60 -55 -50 ... 50 55 60 65 70 75 80
time datetime64[ns] 8B 2022-01-0 🤔 BUT thinking about titiler, it would be pretty hard to know when to convert the input to string/interger. Having |
Quick Update: In this state we have XarrayReader which is compatible with other Reader (having This PR also updates the |
This PR does:
indexes
options (following Rasterio convention, starts at 1)