Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/pickle rasterio #2131

Merged
merged 15 commits into from
Jun 7, 2018
3 changes: 3 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,9 @@ Bug fixes
By `Deepak Cherian <https://github.com/dcherian>`_.
- Colorbar limits are now determined by excluding ±Infs too.
By `Deepak Cherian <https://github.com/dcherian>`_.
- Fixed a bug in ``rasterio`` backend which prevented use with ``distributed``.
The ``rasterio`` backend now returns pickleable objects (:issue:`2021`).
By `Joe Hamman <https://github.com/jhamman>`_.
- Fixed ``to_iris`` to maintain lazy dask array after conversion (:issue:`2046`).
By `Alex Hilson <https://github.com/AlexHilson>`_ and `Stephan Hoyer <https://github.com/shoyer>`_.

Expand Down
40 changes: 40 additions & 0 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,3 +503,43 @@ def assert_open(self):
if not self._isopen:
raise AssertionError('internal failure: file must be open '
'if `autoclose=True` is used.')


class PickleByReconstructionWrapper(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a unit test verifies that this works properly independently of any concrete datastore.

Maybe something simple with open()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8b286c0


def __init__(self, opener, *args, **kwargs):

self.opener = opener
self.open_args = args
self.open_kwargs = kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, let's pool these into a single functools.partial object. Note that you can get out keyword arguments if desired from partial.keywords.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just have the backends pass in a partial function then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't anticipate adding any argument, then I think it's just as easy to make the partial function here. I suppose there's some potential overlap in needs with auto-closing a file, so I'm OK either way.


self._ds = None
self._isopen = False

@property
def value(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about the complexity of making file opening lazy. State makes things harder to reason about. For example, what happens with the mode argument if never actually accessed the file?

For now can we stick with constructing value eagerly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not a big fan of dropping the lazy opening. For one, there will be a non-negligible performance penalty. Perhaps a better argument is that we've been using this logic successfully in xarray for a while now. That said, its not clear to me what the potential problems with this approach so I'm open to convincing otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we always open files immediately to pull out metadata? (e.g., variables and dimensions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. I've removed most of this in 6669035

if self._ds is not None and self._isopen:
return self._ds
self._ds = self.opener(*self.open_args, **self.open_kwargs)
self._isopen = True
return self._ds

def __getstate__(self):
state = self.__dict__.copy()
del state['_ds']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we close the file here? state['_ds'].close()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be used after it is pickled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, maybe I misunderstood. I thought the purpose of this wrapper was to dereference (and close) the file before 'dump()' and re-open it after pickle.load().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea is that most open file handles can't be pickled so we need to provide a mechanism to remove the existing handle and generate a new one in the dump/load steps in the pickling. In most cases, we do want to keep the original file open.

del state['_isopen']
if self.open_kwargs.get('mode', None) == 'w':
# file has already been created, don't override when restoring
state['open_kwargs']['mode'] = 'a'
return state

def __setstate__(self, state):
self.__dict__.update(state)
self._ds = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too?

self._isopen = False

def __getitem__(self, key):
return self.value[key]

def __getattr__(self, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This magic is convenient but I think it could possibly be a source of bugs. I would stick to requiring pulling out .value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking someone would call me on this :). It was the path of least resistance for getting this working but I think a lot of care would be required not to trample on attributes. Do you think I should drop both getitem/getattr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 097e264

return getattr(self.value, name)
5 changes: 3 additions & 2 deletions xarray/backends/rasterio_.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from .. import DataArray
from ..core import indexing
from ..core.utils import is_scalar
from .common import BackendArray
from .common import BackendArray, PickleByReconstructionWrapper

try:
from dask.utils import SerializableLock as Lock
Expand Down Expand Up @@ -194,7 +194,8 @@ def open_rasterio(filename, parse_coordinates=None, chunks=None, cache=None,
"""

import rasterio
riods = rasterio.open(filename, mode='r')

riods = PickleByReconstructionWrapper(rasterio.open, filename, mode='r')

if cache is None:
cache = chunks is None
Expand Down
8 changes: 8 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -2649,6 +2649,14 @@ def test_chunks(self):
ex = expected.sel(band=1).mean(dim='x')
assert_allclose(ac, ex)

def test_pickle_rio(self):
# regression test for https://github.com/pydata/xarray/issues/2121
with create_tmp_geotiff() as (tmp_file, expected):
with xr.open_rasterio(tmp_file) as rioda:
temp = pickle.dumps(rioda)
with pickle.loads(temp) as actual:
assert_equal(actual, rioda)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to also add an integration test reading rasterio data with dask.distributed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 55a3abc.


def test_ENVI_tags(self):
rasterio = pytest.importorskip('rasterio', minversion='1.0a')
from rasterio.transform import from_origin
Expand Down