-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/rasterio #1070
Feature/rasterio #1070
Conversation
This is awesome! For 2: I think you don't have much more to do than make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to do anything special to enable lazy loading. That logic is all handled in the xarray.Variable
object.
except AttributeError: | ||
pass | ||
|
||
self.coords = _try_to_get_latlon_coords(self.coords, self._attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to return these in the dict returned by get_variables()
. DataStore
objects don't have a coords
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This now pulls in lat/long if the projection was successful, otherwise it just returns the basic x/y indices. Should we also be returning the original projection coords? (currently they are lost).
@@ -3,7 +3,7 @@ channels: | |||
- conda-forge | |||
dependencies: | |||
- python=2.7 | |||
- cdat-lite | |||
# - cdat-lite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like tests still need some work to pass on Travis-CI.
coords = dict(y=coords['y'], x=coords['x']) | ||
dims = ('y', 'x') | ||
|
||
coords_out['lat'] = DataArray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be just xarray.Variable
objects, not DataArray objects
proj = pyproj.Proj(attrs['crs']) | ||
x, y = np.meshgrid(coords['x'], coords['y']) | ||
proj_out = pyproj.Proj("+init=EPSG:4326", preserve_units=True) | ||
xc, yc = _transform_proj(proj, proj_out, x, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might try to do this calculation lazily, e.g., by making a utils.NDArrayMixin
subclass in conventions.py and wrapping with indexing.LazilyIndexedArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand how to do this. Is there an example I can go off of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some examples of making NDArrayMixin subclasses in conventions.py, e.g.,
https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L311
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to use dask.array
to build the coordinate arrays, though the NDArrayMixin
would avoid the need for dask.
except AttributeError: | ||
pass | ||
|
||
# def get_vardata(self, var_id=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be removed
|
||
self._attrs = OrderedDict() | ||
for attr_name in ['crs', 'transform', 'proj']: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use with pycompat.suppress(AttributeError):
def _try_to_get_latlon_coords(coords, attrs): | ||
coords_out = {} | ||
try: | ||
import pyproj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised rasterio doesn't have projections built in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do: https://mapbox.github.io/rasterio/topics/reproject.html
I will take a stab at doing it the rasterio way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a reprojection, which does seem useful, but it seems separate from giving you lat
and lon
coordinates
@@ -11,6 +11,7 @@ dependencies: | |||
- pandas | |||
- seaborn | |||
- scipy | |||
- rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless rasterio only supports Python 3, should add it to other test suites, too (especially Python 2.7).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rasterio supports 2.7 and 3.3-3.5 (https://mapbox.github.io/rasterio/)
I added this requirement to all other test suites, was that correct? Version 1.0 is needed as many function names changed in rasterio at V1.0.
Tested open_mfdataset() on 100+ geotiffs and lazyloading with rasterio does appear to be working. |
Travis-ci fail is because it can't find rasterio, which comes through the conda-forge channel (https://github.com/conda-forge/rasterio-feedstock). I think it needs to be added as described here (http://conda.pydata.org/docs/travis.html#additional-steps). But I am new to Travis-ci, so don't want to mess up the current .travis.yml file. |
@@ -12,5 +12,6 @@ dependencies: | |||
- scipy | |||
- pytest-cov | |||
- cyordereddict | |||
- rasterio>=1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no official release of rasterio >=1.0
. However, I did merge the latest alpha
, so if you remove the >=
you should get it.
Add
|
Any idea why Segmentation faults occur for 3.4 and 4.5? |
I would start by adding Line 72 in a247d93
|
class RasterioArrayWrapper(NDArrayMixin): | ||
def __init__(self, ds): | ||
self._ds = ds | ||
self.array = ds.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the behavior of rasterio changed with v1.0, this loads the data in memory. I might be wrong, but I think the call to ds.read()
should happen in __getitem__
, ideally with a window
kwarg.
I also wonder if the call to read
shouldn't happen in a with rasterio.Env():
environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer, do we need to maintain the array
attribute here? Would it make more sense to just populate the _ds
attribute and set the array in __getitem__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to set array
-- that's merely for the benefit of NDArrayMixin
in core/utils.py
. You might even skip NDArrayMixin
altogether and just use NdimSizeLenMixin
, though you'll want to define properties/methods for each of those defined by NDArrayMixin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can basically drop this in replacing the current RasterioArrayWrapper
:
class RasterioArrayWrapper(NdimSizeLenMixin):
"""Mixin class RasterIO datasets for making wrappers of N-dimensional
arrays that conform to the ndarray interface required for the data
argument to Variable objects.
A subclass should set the `array` property and override one or more of
`dtype`, `shape` and `__getitem__`.
"""
def __init__(self, ds):
self.ds = ds
@property
def dtype(self):
if len(set(self.ds.dtypes[0])) != 1:
raise ValueError(
'Can only handle Rastio dataset with all bands having the same type')
return np.dtype(self.ds.dtypes[0])
@property
def shape(self):
return self.ds.shape
def __array__(self, dtype=None):
'''Not sure if this will work as is'''
return np.asarray(self[...], dtype=dtype)
def __getitem__(self, key):
band = range(self.shape[0])[key[0]]
window = []
for win in key[1:]:
if instance(win, slice):
window.append((win.start, win.stop))
elif isinstance(win, int):
window.append((win, win + 1))
else: # integer ndarray
window.append((win.min(), win.max()))
raw_data = self.ds.read(band, window=window)
# now, fix up raw_data to conform to numpy indexing conventions
# - drop axes for integer band/windows
# - stride if windows are slices with win.step != 1
# - subset if window is an integer ndarray
def __repr__(self):
return '%s(array=%r)' % (type(self).__name__, self.ds)
I'm not sure the __array__
will work as it is until we get the __get_item__
fully functional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also worth consider how/if you want to handle automatically masking missing values -- it looks like ds.read()
can optionally return a MaskedArray
. The standard xarray approach would be to automatically promote the dtype and fill these with NaNs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, the above code is waiting for the decision from rasterio group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest getting something working here, with an eye toward keeping it self-contained. Later, we can try to port it upstream to rasterio.
def __getitem__(self, key): | ||
if key == () and self.ndim == 0: | ||
return self.array.get_value() | ||
return self.array[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what @fmaussion said above, I think this should be something like:
def __getitem__(self, key):
if key == () and self.ndim == 0:
return self._ds.read()
return self._ds.read(band, window=window)
Where band
and window
describe the slice of data to be read. Off the top of my head, I'm not exactly sure how to parse the key here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time indexers get here, they should have already passed through canonicalize_indexer, which means you only need to handle a key
that is a tuple of the appropriate number of dimensions (i.e., 3) composed of integers, slices and integer ndarrays.
Based on the docstring for read, we want something like:
def __getitem__(self, key):
band = range(self.shape[0])[key[0]]
window = []
for win in key[1:]:
if instance(win, slice):
window.append((win.start, win.stop))
elif isinstance(win, int):
window.append((win, win + 1))
else: # integer ndarray
window.append((win.min(), win.max()))
raw_data = self._ds.read(band, window=window)
# now, fix up raw_data to conform to numpy indexing conventions
# - drop axes for integer band/windows
# - stride if windows are slices with win.step != 1
# - subset if window is an integer ndarray
Honestly, this logic should probably live in rasterio if possible. I'm a little surprised that they have never implemented a __getitem__
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also: https://gist.github.com/lpinner/bd57b54a5c6903e4a6a2 (can't reuse this directly, though, because it doesn't have a license).
Anyways, I would definitely see if the rasterio folks are up for implementing a __getitem__
method. There's no reason why this is xarray specific -- you would need this for just using dask.array with rasterio, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also rasterio/rasterio#920 for related discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicWayand - this is coming along nicely. The remaining issues as I see them are:
- unit tests
-
__getitem__
wrapper method - coordinate projection
dependencies: | ||
- python=2.7 | ||
- pytest | ||
- numpy==1.9.3 | ||
- pandas==0.15.0 | ||
- rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can roll back these changes. This test setup is just for required dependencies which rasterio
will not be.
@@ -8,6 +10,7 @@ dependencies: | |||
- numpy | |||
- pandas | |||
- scipy | |||
- rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also roll back these changes as well
dependencies: | ||
- python=3.3 | ||
- pytest | ||
- pandas | ||
- rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roll back these changes
@@ -8,6 +10,7 @@ dependencies: | |||
- numpy | |||
- pandas | |||
- scipy | |||
- rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to test rasterio along with pydap so we can roll back these changes.
@@ -7,6 +9,7 @@ dependencies: | |||
- pandas | |||
- scipy | |||
- toolz | |||
- rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roll back these changes
@@ -7,6 +9,7 @@ dependencies: | |||
- netcdf4=1.1.9 | |||
- scipy=0.16.0 | |||
- toolz | |||
- rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roll back these changes
class RasterioArrayWrapper(NDArrayMixin): | ||
def __init__(self, ds): | ||
self._ds = ds | ||
self.array = ds.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer, do we need to maintain the array
attribute here? Would it make more sense to just populate the _ds
attribute and set the array in __getitem__
?
self._attrs = OrderedDict() | ||
with suppress(AttributeError): | ||
for attr_name in ['crs', 'transform', 'proj']: | ||
self._attrs[attr_name] = getattr(self.ds, attr_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some extra indentation here -- should be 4 spaces.
coords = _try_to_get_latlon_coords(self.coords, self._attrs) | ||
vars = {__rio_varname__: self.open_store_variable(__rio_varname__)} | ||
vars.update(coords) | ||
return FrozenOrderedDict(vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use another name other than vars
since it is a python builtin.
on_disk = open_dataset(tmp_file, engine='rasterio') | ||
actual = on_disk.rename({'foo': 'bar', 'x': 'y'}) | ||
del on_disk # trigger garbage collection | ||
self.assertDatasetIdentical(actual, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicWayand - take a look at the PyNio tests. They are probably the closest analog for what we need to test here.
@@ -69,7 +69,7 @@ install: | |||
- python setup.py install | |||
|
|||
script: | |||
- py.test xarray --cov=xarray --cov-report term-missing | |||
- py.test xarray --cov=xarray --cov-report term-missing -v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can actually be permanent -- it's valuable for tracking down segfaults on Travis more broadly.
Hi, unless there are objections I'd like to try to finish this PR. I just forked @NicWayand 's branch and started to look at it, and I hope to get something done during next week or so... |
Hi @fmaussion, no objections here. I got it working just barely for my project, and won't have time in the near future to devote to wrap this up. |
@gidden you might be interested in this |
@byersiiasa indeed I am, in fact I'm reviewing #1260 which overtook this one =) |
Closing this in favor of #1260 |
@jhamman started a backend for RasterIO that I have been working on. There are two issues I am stuck on that I could use some help:
Lat/long coords are not being decoded correctly (missing from output dataset). Lat/lon projection are correctly calculated and added here (https://github.com/NicWayand/xray/blob/feature/rasterio/xarray/backends/rasterio_.py#L117). But, it appears (with my limited knowledge of xarray) that the lat/long coords contained within
obj
are lost at this line (https://github.com/NicWayand/xray/blob/feature/rasterio/xarray/conventions.py#L930).Lazy-loading needs to be enabled. How can I setup/test this? Are there examples from other backends I could follow?
#790