Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/pickle rasterio #2131

Merged
merged 15 commits into from
Jun 7, 2018
Merged

Feature/pickle rasterio #2131

merged 15 commits into from
Jun 7, 2018

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented May 14, 2018

cc @rsignell-usgs


self.opener = opener
self.open_args = args
self.open_kwargs = kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, let's pool these into a single functools.partial object. Note that you can get out keyword arguments if desired from partial.keywords.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just have the backends pass in a partial function then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't anticipate adding any argument, then I think it's just as easy to make the partial function here. I suppose there's some potential overlap in needs with auto-closing a file, so I'm OK either way.

self._isopen = False

@property
def value(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about the complexity of making file opening lazy. State makes things harder to reason about. For example, what happens with the mode argument if never actually accessed the file?

For now can we stick with constructing value eagerly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not a big fan of dropping the lazy opening. For one, there will be a non-negligible performance penalty. Perhaps a better argument is that we've been using this logic successfully in xarray for a while now. That said, its not clear to me what the potential problems with this approach so I'm open to convincing otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we always open files immediately to pull out metadata? (e.g., variables and dimensions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. I've removed most of this in 6669035

def __getitem__(self, key):
return self.value[key]

def __getattr__(self, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This magic is convenient but I think it could possibly be a source of bugs. I would stick to requiring pulling out .value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking someone would call me on this :). It was the path of least resistance for getting this working but I think a lot of care would be required not to trample on attributes. Do you think I should drop both getitem/getattr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 097e264

@@ -503,3 +503,43 @@ def assert_open(self):
if not self._isopen:
raise AssertionError('internal failure: file must be open '
'if `autoclose=True` is used.')


class PickleByReconstructionWrapper(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a unit test verifies that this works properly independently of any concrete datastore.

Maybe something simple with open()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8b286c0

with create_tmp_geotiff() as (tmp_file, expected):
with xr.open_rasterio(tmp_file) as rioda:
temp = pickle.dumps(rioda)
actual = pickle.loads(temp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to think about here -- I'm pretty sure this leaks a file descriptor. I'm not quite sure what the best way to deal with this is (maybe it's not worth worrying about yet), but in the past this has caused issues with tests on various platforms (especially Windows).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And indeed, it's causing a test failure on Windows :).

I think it should work to explicitly open actual in a context manager, e.g.,

with pickle.loads(temp) as actual:
    ...

Copy link
Member

@fmaussion fmaussion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this! I don't fully understand what's going on here but I wondered if we shouldn't close the file instead of just dereferencing it.


def __getstate__(self):
state = self.__dict__.copy()
del state['_ds']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we close the file here? state['_ds'].close()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be used after it is pickled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, maybe I misunderstood. I thought the purpose of this wrapper was to dereference (and close) the file before 'dump()' and re-open it after pickle.load().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic idea is that most open file handles can't be pickled so we need to provide a mechanism to remove the existing handle and generate a new one in the dump/load steps in the pickling. In most cases, we do want to keep the original file open.


def __setstate__(self, state):
self.__dict__.update(state)
self._ds = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too?

Copy link
Contributor

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see work in this direction.


def __getstate__(self):
state = self.__dict__.copy()
del state['_ds']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be used after it is pickled.

with xr.open_rasterio(tmp_file) as rioda:
temp = pickle.dumps(rioda)
with pickle.loads(temp) as actual:
assert_equal(actual, rioda)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to also add an integration test reading rasterio data with dask.distributed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 55a3abc.

@rsignell-usgs
Copy link

@jhamman should I test this out on my original workflow or wait a bit?

@jhamman
Copy link
Member Author

jhamman commented May 15, 2018

@rsignell-usgs - I think this should work in its current form. I'm writing some additional tests now and have some cleanup tasks as indicated in the reviews but I do not expect the behavior to change.

@shoyer
Copy link
Member

shoyer commented May 15, 2018

Thinking about this a little more, maybe we should consolidate pickling with autoclosing into a single, composable wrapper class. Usage would look something like:

wrapper = FileWrapper(netCDF4.Dataset, path, mode='r', autoclose=False, kwargs=kwargs)
with wrapper:
    # will close/open wrapped file
    with wrapper.acquire() as ds:
        # context manager is dummy, simply returning the open file object
        ... # do stuff with ds, a netCDF4.Dataset instance
# or could write wrapper.close() instead of using the context manager

wrapper = FileWrapper(netCDF4.Dataset, path, mode='r', autoclose=True, kwargs=kwargs)
with wrapper:
    # explicit opening/closing the wrapper is a no-op if autoclose=True
    with wrapper.acquire() as ds:
        # context manager opens/closes file
        # if file is already open, it's a no-op
        ... # do stuff with ds, a netCDF4.Dataset instance

We could then expose xarray.backends.FileWrapper as part of our external API for backend authors (without coupling it into the Datastore class directly).

Eventually, we might even change the default for autoclose to automatically use LRU caching (but that obviously would come later).

@rsignell-usgs
Copy link

@jhamman
Copy link
Member Author

jhamman commented May 15, 2018

@shoyer - I like the idea of a wrapper object that achieves both of these tasks. I'm incrementally working that way and have just updated the Pickle Wrapper for now.

@rsignell-usgs
Copy link

rsignell-usgs commented Jun 5, 2018

@jhamman , still very much interested in this -- could the existing functionality be merged and enhanced later?


lines = ['foo bar spam eggs']

with create_tmp_file() as tmp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to add allow_cleanup_failure=ON_WINDOWS for the test to pass on windows

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that did fix the windows failure.

@jhamman
Copy link
Member Author

jhamman commented Jun 7, 2018

All the tests are passing here. Its been a while so perhaps it would be good to get another review from @pydata/xarray.

@rsignell-usgs - can you give this another whirl on your use case?

Copy link
Member

@fmaussion fmaussion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@rsignell-usgs
Copy link

rsignell-usgs commented Jun 7, 2018

@jhamman , although I'm getting distributed workers to compute the mean from a bunch of images, I'm getting a "Failed to Serialize" error in cell [23] of this notebook:
https://gist.github.com/rsignell-usgs/90f15e2da918e3c6ba6ee5bb6095d594
If this is a bug, I think it was there before the recent updates.

You should be able to run this notebook without modification.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one bug to fix here, but otherwise I agree that we can merge this soon and clean it up later.

return self._ds

def __getstate__(self):
del self._ds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You definitely need to delete this from state rather than from the class, like what we do in DataStorePickleMixin:

def __getstate__(self):
state = self.__dict__.copy()
del state['_ds']
del state['_isopen']
if self._mode == 'w':
# file has already been created, don't override when restoring
state['_mode'] = 'a'
return state

Otherwise, this will break when you try to pickle this more than once, which is the error that shows up in @rsignell-usgs's notebook.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I had just noticed this problem and it should be fixed/tested it in 35520c0.

@rsignell-usgs
Copy link

rsignell-usgs commented Jun 7, 2018

@jhamman woohoo! Cell [20] completes nicely now:
https://gist.github.com/rsignell-usgs/90f15e2da918e3c6ba6ee5bb6095d594
I'm getting some errors in Cell [20], but I think those are unrelated and didn't affect the successful completion of the tasks, right? (this is on an HPC system)

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jhamman jhamman merged commit 21a9f3d into pydata:master Jun 7, 2018
@jhamman jhamman deleted the feature/pickle_rasterio branch June 7, 2018 18:03
@rsignell-usgs
Copy link

Might this PR warrant a new minor release?

@shoyer
Copy link
Member

shoyer commented Jun 7, 2018

@rsignell-usgs Sure, though interp() (#2104) is almost ready, too, so we might wait for that.

@rsignell-usgs
Copy link

Sounds good. Thanks @shoyer!

@shoyer
Copy link
Member

shoyer commented Jun 8, 2018

v0.10.7 is out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants