Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/rasterio #1070

Closed
wants to merge 16 commits into from
Closed

Feature/rasterio #1070

wants to merge 16 commits into from

Conversation

NicWayand
Copy link

@jhamman started a backend for RasterIO that I have been working on. There are two issues I am stuck on that I could use some help:

  1. Lat/long coords are not being decoded correctly (missing from output dataset). Lat/lon projection are correctly calculated and added here (https://github.com/NicWayand/xray/blob/feature/rasterio/xarray/backends/rasterio_.py#L117). But, it appears (with my limited knowledge of xarray) that the lat/long coords contained within obj are lost at this line (https://github.com/NicWayand/xray/blob/feature/rasterio/xarray/conventions.py#L930).

  2. Lazy-loading needs to be enabled. How can I setup/test this? Are there examples from other backends I could follow?

#790

@fmaussion
Copy link
Member

fmaussion commented Oct 31, 2016

This is awesome!

For 2: I think you don't have much more to do than make __getitem__ use the indexer to read the data with a window (example here)

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to do anything special to enable lazy loading. That logic is all handled in the xarray.Variable object.

except AttributeError:
pass

self.coords = _try_to_get_latlon_coords(self.coords, self._attrs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to return these in the dict returned by get_variables(). DataStore objects don't have a coords attribute.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This now pulls in lat/long if the projection was successful, otherwise it just returns the basic x/y indices. Should we also be returning the original projection coords? (currently they are lost).

@@ -3,7 +3,7 @@ channels:
- conda-forge
dependencies:
- python=2.7
- cdat-lite
# - cdat-lite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it has been removed.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like tests still need some work to pass on Travis-CI.

coords = dict(y=coords['y'], x=coords['x'])
dims = ('y', 'x')

coords_out['lat'] = DataArray(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be just xarray.Variable objects, not DataArray objects

proj = pyproj.Proj(attrs['crs'])
x, y = np.meshgrid(coords['x'], coords['y'])
proj_out = pyproj.Proj("+init=EPSG:4326", preserve_units=True)
xc, yc = _transform_proj(proj, proj_out, x, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might try to do this calculation lazily, e.g., by making a utils.NDArrayMixin subclass in conventions.py and wrapping with indexing.LazilyIndexedArray.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand how to do this. Is there an example I can go off of?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some examples of making NDArrayMixin subclasses in conventions.py, e.g.,
https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L311

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to use dask.array to build the coordinate arrays, though the NDArrayMixin would avoid the need for dask.

except AttributeError:
pass

# def get_vardata(self, var_id=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be removed


self._attrs = OrderedDict()
for attr_name in ['crs', 'transform', 'proj']:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use with pycompat.suppress(AttributeError):

def _try_to_get_latlon_coords(coords, attrs):
coords_out = {}
try:
import pyproj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised rasterio doesn't have projections built in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do: https://mapbox.github.io/rasterio/topics/reproject.html

I will take a stab at doing it the rasterio way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a reprojection, which does seem useful, but it seems separate from giving you lat and lon coordinates

@@ -11,6 +11,7 @@ dependencies:
- pandas
- seaborn
- scipy
- rasterio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless rasterio only supports Python 3, should add it to other test suites, too (especially Python 2.7).

Copy link
Author

@NicWayand NicWayand Oct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rasterio supports 2.7 and 3.3-3.5 (https://mapbox.github.io/rasterio/)

I added this requirement to all other test suites, was that correct? Version 1.0 is needed as many function names changed in rasterio at V1.0.

@NicWayand
Copy link
Author

Tested open_mfdataset() on 100+ geotiffs and lazyloading with rasterio does appear to be working.

@NicWayand
Copy link
Author

Travis-ci fail is because it can't find rasterio, which comes through the conda-forge channel (https://github.com/conda-forge/rasterio-feedstock). I think it needs to be added as described here (http://conda.pydata.org/docs/travis.html#additional-steps). But I am new to Travis-ci, so don't want to mess up the current .travis.yml file.

@@ -12,5 +12,6 @@ dependencies:
- scipy
- pytest-cov
- cyordereddict
- rasterio>=1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no official release of rasterio >=1.0. However, I did merge the latest alpha, so if you remove the >= you should get it.

@shoyer
Copy link
Member

shoyer commented Oct 31, 2016

Add conda-forge it to the channels section of conda requirements yml files:

@NicWayand
Copy link
Author

Any idea why Segmentation faults occur for 3.4 and 4.5?

@shoyer
Copy link
Member

shoyer commented Oct 31, 2016

I would start by adding -v to our invocation of py.test in .travis.yml:

- py.test xarray --cov=xarray --cov-report term-missing

class RasterioArrayWrapper(NDArrayMixin):
def __init__(self, ds):
self._ds = ds
self.array = ds.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the behavior of rasterio changed with v1.0, this loads the data in memory. I might be wrong, but I think the call to ds.read() should happen in __getitem__, ideally with a window kwarg.

I also wonder if the call to read shouldn't happen in a with rasterio.Env(): environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer, do we need to maintain the array attribute here? Would it make more sense to just populate the _ds attribute and set the array in __getitem__?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to set array -- that's merely for the benefit of NDArrayMixin in core/utils.py. You might even skip NDArrayMixin altogether and just use NdimSizeLenMixin, though you'll want to define properties/methods for each of those defined by NDArrayMixin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicWayand -

I think you can basically drop this in replacing the current RasterioArrayWrapper:

class RasterioArrayWrapper(NdimSizeLenMixin):
    """Mixin class RasterIO datasets for making wrappers of N-dimensional 
    arrays that conform to the ndarray interface required for the data
    argument to Variable objects.
    A subclass should set the `array` property and override one or more of
    `dtype`, `shape` and `__getitem__`.
    """
    def __init__(self, ds):
        self.ds = ds

    @property
    def dtype(self):
        if len(set(self.ds.dtypes[0])) != 1:
            raise ValueError(
                'Can only handle Rastio dataset with all bands having the same type')
        return np.dtype(self.ds.dtypes[0])

    @property
    def shape(self):
        return self.ds.shape

    def __array__(self, dtype=None):
        '''Not sure if this will work as is'''
        return np.asarray(self[...], dtype=dtype)

    def __getitem__(self, key):
        band = range(self.shape[0])[key[0]]
        window = []
        for win in key[1:]:
            if instance(win, slice):
                window.append((win.start, win.stop))
            elif isinstance(win, int):
                window.append((win, win + 1))
            else:  # integer ndarray
                window.append((win.min(), win.max()))
        raw_data = self.ds.read(band, window=window)
        # now, fix up raw_data to conform to numpy indexing conventions
        # - drop axes for integer band/windows
        # - stride if windows are slices with win.step != 1
        # - subset if window is an integer ndarray

    def __repr__(self):
        return '%s(array=%r)' % (type(self).__name__, self.ds)

I'm not sure the __array__ will work as it is until we get the __get_item__ fully functional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth consider how/if you want to handle automatically masking missing values -- it looks like ds.read() can optionally return a MaskedArray. The standard xarray approach would be to automatically promote the dtype and fill these with NaNs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the above code is waiting for the decision from rasterio group?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest getting something working here, with an eye toward keeping it self-contained. Later, we can try to port it upstream to rasterio.

def __getitem__(self, key):
if key == () and self.ndim == 0:
return self.array.get_value()
return self.array[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what @fmaussion said above, I think this should be something like:

    def __getitem__(self, key):
        if key == () and self.ndim == 0:
            return self._ds.read()
        return self._ds.read(band, window=window)

Where band and window describe the slice of data to be read. Off the top of my head, I'm not exactly sure how to parse the key here though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time indexers get here, they should have already passed through canonicalize_indexer, which means you only need to handle a key that is a tuple of the appropriate number of dimensions (i.e., 3) composed of integers, slices and integer ndarrays.

Based on the docstring for read, we want something like:

def __getitem__(self, key):
    band = range(self.shape[0])[key[0]]
    window = []
    for win in key[1:]:
        if instance(win, slice):
            window.append((win.start, win.stop))
        elif isinstance(win, int):
            window.append((win, win + 1))
        else:  # integer ndarray
            window.append((win.min(), win.max()))
    raw_data = self._ds.read(band, window=window)
    # now, fix up raw_data to conform to numpy indexing conventions
    # - drop axes for integer band/windows
    # - stride if windows are slices with win.step != 1
    # - subset if window is an integer ndarray

Honestly, this logic should probably live in rasterio if possible. I'm a little surprised that they have never implemented a __getitem__ method.

Copy link
Member

@shoyer shoyer Oct 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: https://gist.github.com/lpinner/bd57b54a5c6903e4a6a2 (can't reuse this directly, though, because it doesn't have a license).

Anyways, I would definitely see if the rasterio folks are up for implementing a __getitem__ method. There's no reason why this is xarray specific -- you would need this for just using dask.array with rasterio, as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also rasterio/rasterio#920 for related discussion.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicWayand - this is coming along nicely. The remaining issues as I see them are:

  • unit tests
  • __getitem__ wrapper method
  • coordinate projection

dependencies:
- python=2.7
- pytest
- numpy==1.9.3
- pandas==0.15.0
- rasterio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can roll back these changes. This test setup is just for required dependencies which rasterio will not be.

@@ -8,6 +10,7 @@ dependencies:
- numpy
- pandas
- scipy
- rasterio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also roll back these changes as well

dependencies:
- python=3.3
- pytest
- pandas
- rasterio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roll back these changes

@@ -8,6 +10,7 @@ dependencies:
- numpy
- pandas
- scipy
- rasterio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to test rasterio along with pydap so we can roll back these changes.

@@ -7,6 +9,7 @@ dependencies:
- pandas
- scipy
- toolz
- rasterio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roll back these changes

@@ -7,6 +9,7 @@ dependencies:
- netcdf4=1.1.9
- scipy=0.16.0
- toolz
- rasterio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roll back these changes

class RasterioArrayWrapper(NDArrayMixin):
def __init__(self, ds):
self._ds = ds
self.array = ds.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer, do we need to maintain the array attribute here? Would it make more sense to just populate the _ds attribute and set the array in __getitem__?

self._attrs = OrderedDict()
with suppress(AttributeError):
for attr_name in ['crs', 'transform', 'proj']:
self._attrs[attr_name] = getattr(self.ds, attr_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some extra indentation here -- should be 4 spaces.

coords = _try_to_get_latlon_coords(self.coords, self._attrs)
vars = {__rio_varname__: self.open_store_variable(__rio_varname__)}
vars.update(coords)
return FrozenOrderedDict(vars)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use another name other than vars since it is a python builtin.

on_disk = open_dataset(tmp_file, engine='rasterio')
actual = on_disk.rename({'foo': 'bar', 'x': 'y'})
del on_disk # trigger garbage collection
self.assertDatasetIdentical(actual, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicWayand - take a look at the PyNio tests. They are probably the closest analog for what we need to test here.

@@ -69,7 +69,7 @@ install:
- python setup.py install

script:
- py.test xarray --cov=xarray --cov-report term-missing
- py.test xarray --cov=xarray --cov-report term-missing -v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can actually be permanent -- it's valuable for tracking down segfaults on Travis more broadly.

@fmaussion
Copy link
Member

Hi, unless there are objections I'd like to try to finish this PR. I just forked @NicWayand 's branch and started to look at it, and I hope to get something done during next week or so...

@NicWayand
Copy link
Author

Hi @fmaussion, no objections here. I got it working just barely for my project, and won't have time in the near future to devote to wrap this up.

@fmaussion fmaussion mentioned this pull request Feb 10, 2017
6 tasks
@byersiiasa
Copy link

@gidden you might be interested in this

@gidden
Copy link
Contributor

gidden commented May 22, 2017

@byersiiasa indeed I am, in fact I'm reviewing #1260 which overtook this one =)

@fmaussion
Copy link
Member

Closing this in favor of #1260

@fmaussion fmaussion closed this May 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants