Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a GRIB backend via ECMWF cfgrib / ecCodes #2476

Merged
merged 29 commits into from
Oct 17, 2018

Conversation

alexamici
Copy link
Collaborator

@alexamici alexamici commented Oct 9, 2018

This is currently a WIP PR for review.

The implementation depends on the python module cfgrib and the C-library ecCodes to be installed.

Work in progress items:

  • the coordinate rename doesn't really belong here, move it to cfgrib,
  • port cfgrib backend to use the new CachingFileManager interface
  • implement proper locking
  • test dask support (real performance with dask depends on saving the external index)
  • 
    

cc @StephanSiemen @iainrussell

@pep8speaks
Copy link

pep8speaks commented Oct 9, 2018

Hello @alexamici! Thanks for updating the PR.

Comment last updated on October 15, 2018 at 13:59 Hours UTC

@shoyer
Copy link
Member

shoyer commented Oct 9, 2018

@alexamici great to see this!

I'm about to merge a refactor of xarray's backends for v0.11 in #2261. You'll need to refactor a little bit to accommodate this, but hopefully that should be straightforward. The new interface should be a little easier to use, especially for handling many files or dask-distributed support.

return array


FLAVOURS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this configuration dict something that could live in cfgrid rather than xarray?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do.

"""
def __init__(self, ds, variable_map={}, autoclose=False):
self.ds = ds
self.variable_map = variable_map.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for including variable_map in the interface on the DataStore class, rather than letting users change variable names later with Dataset.rename?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same issue as above. The whole coordinate rename doesn't belong here and will be moved in cfgrib.

@jhamman jhamman mentioned this pull request Oct 9, 2018
2 tasks
@alexamici alexamici force-pushed the feature/grib-support-via-cfgrib branch from c69b1e8 to 72606f7 Compare October 9, 2018 08:34
@alexamici
Copy link
Collaborator Author

alexamici commented Oct 9, 2018

@shoyer at long last! :)

I quickly sync'ed with the new backend API. I did some light testing.

Note that I didn't test with dask at all and I'm not using the passed lock. Pointers to how to properly use the new backend interface are welcome.

@shoyer
Copy link
Member

shoyer commented Oct 9, 2018

The appropriate lock to use depends on cfgrid. Is the library thread-safe? If not, it's probably best to use a global (per process) lock.

pynio is probably the simplest example of how to write a new backend:
https://github.com/pydata/xarray/blob/master/xarray/backends/pynio_.py

The main difference is that you should make use of CachingFileManager to add support for dask and opening many files at once. File managers make it possible to pickle a datastore, which is what we need to make dask-distirbuted work.

@alexamici
Copy link
Collaborator Author

The cfgrib_.py backend now uses the new CachingFileManager interface and I added a global process lock, just-in-case. The code is very simple and I mimicked the PyNIO backend so I expect it to work correctly with dask, but I'm not sure how to test for it.

Furthermore I expect dask performance to be abysmal until I implement ecmwf/cfgrib#20.

@alexamici
Copy link
Collaborator Author

@shoyer BTW what timeframe do you expect for the v0.11 release? And would you consider merging this Pull Request before the release, assuming that we do a cfgrib release with read support declared beta?

if lock is None:
lock = ECCODES_LOCK
self.lock = ensure_lock(lock)
backend_kwargs['filter_by_keys'] = tuple(backend_kwargs.get('filter_by_keys', {}).items())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little surprising to me -- can we simply pass on backend_kwargs directly to cfgrib?

Copy link
Collaborator Author

@alexamici alexamici Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ugly hack around the fact that filter_by_keys is a dict but CachingFileManager accepts only hashable backend_kwargs because they are passed to _HashedSequence.

filter_by_keys has a very nice dict semantics, so I'd prefer not to change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes please. I think the need to hash arguments used to open files with CachingFileManager is unavoidable, so this is a reasonable workaround. But please add a comment explaining this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added a comment and made the code more explicit.

xarray/backends/cfgrib_.py Show resolved Hide resolved
self.backend_array = backend_array

def __getattr__(self, item):
return getattr(self.backend_array, item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of forwarding is really error prone. Let's avoid if it in favor of a explicit solution if at all possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@shoyer
Copy link
Member

shoyer commented Oct 9, 2018 via email

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty reasonable but it needs tests, ideally including an integration test that verifies we can actually read data from a grib file. I will see if I can dig up a good example from xarray/tests/test_backends.py, but the pynio or pseudonetcdf tests are likely a good start.

if lock is None:
lock = ECCODES_LOCK
self.lock = ensure_lock(lock)
backend_kwargs['filter_by_keys'] = tuple(backend_kwargs.get('filter_by_keys', {}).items())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes please. I think the need to hash arguments used to open files with CachingFileManager is unavoidable, so this is a reasonable workaround. But please add a comment explaining this.


def __getitem__(self, key):
return indexing.explicit_indexing_adapter(
key, self.shape, indexing.IndexingSupport.BASIC, self._getitem)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify which forms of indexing cfgrib actually supports? In your previous commit this was OUTER_1VECTOR. See the docstring on IndexingSupport for details.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I checked, I can say that cfgrib supports indexing.IndexingSupport.OUTER. I don't remember how I ended up declaring OUTER_1VECTOR, and since it sounded wrong, in the last commit intended to play it as safe as possible.

Code updated.

xarray/backends/cfgrib_.py Show resolved Hide resolved
@alexamici
Copy link
Collaborator Author

@shoyer thank you very much for the guidance, it is very appreciated!

I stared working on the tests, but I've been blocked immediately by something that looks trivial. I cannot get the test class PyNioTest to run:

$ python setup.py test --addopts -v | grep Test
...
xarray/tests/test_backends.py::TestCommon::test_robust_getitem PASSED    [  2%]
xarray/tests/test_backends.py::TestRasterio::test_serialization SKIPPED  [  3%]
xarray/tests/test_backends.py::TestRasterio::test_utm SKIPPED            [  4%]
xarray/tests/test_backends.py::TestRasterio::test_non_rectilinear SKIPPED [  4%]
xarray/tests/test_backends.py::TestRasterio::test_platecarree SKIPPED    [  4%]
xarray/tests/test_backends.py::TestRasterio::test_notransform SKIPPED    [  4%]
xarray/tests/test_backends.py::TestRasterio::test_indexing SKIPPED       [  4%]
xarray/tests/test_backends.py::TestRasterio::test_caching SKIPPED        [  4%]
xarray/tests/test_backends.py::TestRasterio::test_chunks SKIPPED         [  4%]
xarray/tests/test_backends.py::TestRasterio::test_pickle_rasterio SKIPPED [  4%]
xarray/tests/test_backends.py::TestRasterio::test_ENVI_tags SKIPPED      [  4%]
xarray/tests/test_backends.py::TestRasterio::test_no_mftime SKIPPED      [  4%]
xarray/tests/test_backends.py::TestRasterio::test_http_url SKIPPED       [  4%]
xarray/tests/test_backends.py::TestEncodingInvalid::test_extract_nc4_variable_encoding PASSED [  4%]
xarray/tests/test_backends.py::TestEncodingInvalid::test_extract_h5nc_encoding PASSED [  4%]
xarray/tests/test_backends.py::TestValidateAttrs::test_validating_attrs PASSED [  4%]
xarray/tests/test_backends.py::TestDataArrayToNetCDF::test_dataarray_to_netcdf_no_name PASSED [  4%]
xarray/tests/test_backends.py::TestDataArrayToNetCDF::test_dataarray_to_netcdf_with_name PASSED [  4%]
xarray/tests/test_backends.py::TestDataArrayToNetCDF::test_dataarray_to_netcdf_coord_name_clash PASSED [  4%]
xarray/tests/test_backends.py::TestDataArrayToNetCDF::test_open_dataarray_options PASSED [  4%]
xarray/tests/test_backends.py::TestDataArrayToNetCDF::test_dataarray_to_netcdf_return_bytes PASSED [  4%]
xarray/tests/test_backends.py::TestDataArrayToNetCDF::test_dataarray_to_netcdf_no_name_pathlib PASSED [  4%]
...

How do you make pytest run test classes that do not start with Test?

@shoyer
Copy link
Member

shoyer commented Oct 11, 2018

@alexamici oops, we accidentally disabled most of our backend unit tests -- see #2479 for the fix.

@shoyer
Copy link
Member

shoyer commented Oct 12, 2018

Tests should be fixed if you merge in master now.

@alexamici
Copy link
Collaborator Author

alexamici commented Oct 14, 2018

@shoyer I added the TestCfGrib class with basic read tests, a test GRIB file and ecCodes / cfgrib to requirements-py36.yml.

Questions:

  1. I failed to reach 100% coverage becasue I didn't find the way to test CfGribDataStore.get_dimensions. Any hint? Done.
  2. Can /should I test dask and dask.distributed support in the unit tests? How? Any pointers?

It looks like we are very close, what do you think? Shall I move to the documentation?

@alexamici
Copy link
Collaborator Author

alexamici commented Oct 14, 2018

Tests added and passing with 100% coverage.
Added minimal documentation.

BTW, I did a 0.9.0 beta release of cfgrib and I plan to give the public announcement tomorrow. :)

@alexamici alexamici changed the title WIP: Add a GRIB backend via ECMWF cfgrib / ecCodes Add a GRIB backend via ECMWF cfgrib / ecCodes Oct 14, 2018
@alexamici
Copy link
Collaborator Author

@shoyer I'm ready to integrate more feedback (especially on the documentation), but I removed the WIP: prefix as I'd consider the PR good to go as soon as you like it.

Do you usually keep the history of PRs as it is or do you prefer me to rebase?

@shoyer
Copy link
Member

shoyer commented Oct 14, 2018

Do you usually keep the history of PRs as it is or do you prefer me to rebase?

We squash commits upon merging, so please leave things as is :)

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the way this looks! If you want to verify that things work with dask-distributed (which is probably a good idea!), I would suggest adding an integration test in test_distributed.py. The rasterio test in there is probably a good example.

@@ -0,0 +1,97 @@
#
# Copyright 2017-2018 European Centre for Medium-Range Weather Forecasts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really object to this notice, but we don't include it in other source code files for xarray so it looks a little out of place (perhaps we should?). Everything is Apache 2 licensed already, and owned by contributors (or whoever they assign it to, such as an employer).

Copy link
Collaborator Author

@alexamici alexamici Oct 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to handle this one. ECMWF is quite sensitive to license and IPR matters so I added the licence boilerplate to all cfgrib files and later I simply copied the existing backend code as part of the PR.

I'm the material author of the code and my name will appear in the contributors, but I've been working fully funded by ECMWF as an external contractor, so it looks like proper attribution would be lost in this case.

I need to ask @StephanSiemen if they object to removing the copyright notice or what else they propose. To my knowledge this is the first contribution to an external Open Source project funded by ECMWF this way and we are learning how to handle these kind of details as we go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "Copyright 2014-2018, xarray Developers" in our README.rst file, which was modeled off projects like NumPy: https://github.com/numpy/numpy/blob/master/LICENSE.txt

I see now that our LICENSE file just has the original Apache license text. Perhaps we should add in the more specific "Copyright xarray developers" line, like a project like TensorFlow: https://github.com/tensorflow/tensorflow/blob/master/LICENSE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexamici (and ECMWF by proxy I suppose) - first, I want to make sure you understand that I'm very encouraged by your recent developments of cfgrib as an open source package and your contributions to xarray. They are much appreciated and they stand to benefit a wide swath of the geoscience community.

That said, at this point, I'm somewhat against adding this copyright/license header for these reasons:

  • Xarray already has a clear copyright statement that is permissive enough to allow you (and ECMWF) to maintain copyrights over your contributions.
  • We don't do this anywhere else in the xarray code base. Though many of the contributors that have developed xarray have done so as employees/contractors of various organizations, we've not adopted this level of documentation with regard to the original author or subsequent editors.

Now, I understand that it can be important for organizations to make visible their open source contributions so we want to make sure this sort of engagement can continue to happen. We've recently joined NUMFOCUS, in part to give us access to some proper legal advice when necessary. We can certainly solicit their advice here if we have technical questions.

It's probably worth pinging the rest of @pydata/xarray to get their thoughts here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexamici, @shoyer, @jhamman,
We at ECMWF are very happy for the copyright to be adjusted according to other contributions in xarray. Any acknowledgement of the contribution is much appreciated. Thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I was being overzealous :)

I removed the copyright notice and licence boilerplate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexamici
Copy link
Collaborator Author

alexamici commented Oct 14, 2018

@shoyer I added a test for dask.distributed and it passes but please check that it is meaningful, as I'm not completely sure what to test.

doc/io.rst Outdated
to :py:func:`~xarray.open_dataset`:

.. ipython:: python

ds_grib = xr.open_dataset('example.grib', engine='cfgrib')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If example.grib doesn't actually exist when we build the docs, this will give a nasty error message. It would be better to use a :verbatim: directive here -- scroll open to the opendap examples to see what that looks like.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch! I noticed that there was no docs testing in Travis-CI and wondered if you where testing the docs at all. I should have built the docs myself. What is the intended way do you build the docs? python sertup.py build_sphinx fails on my setup due to missing numpydoc.

I'm fixing it as you suggest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the :verbatim: directive, but I was not able to test the build of the documentation due to several errors when building the gallery (a number of core dump in GEOS, maybe due to the setup on my MacOS).

The fix looks trivial enough that it may work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work:

.. ipython::
    :verbatim:

    In [3]: ds_grib = xr.open_dataset('example.grib', engine='cfgrib')

We actually do test the doc build on Travis-CI, but unfortunately there's no easy way to see generated docs and also errors in ipython directive blocks don't stop the build (there's a bug we filed about this somewhere).

Copy link
Collaborator Author

@alexamici alexamici Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I mis-read the example.

@shoyer shoyer merged commit 9f4474d into pydata:master Oct 17, 2018
@shoyer
Copy link
Member

shoyer commented Oct 17, 2018

Thanks @alexamici and ECMWF!

@iainrussell
Copy link

Congratulations!

@alexamici alexamici deleted the feature/grib-support-via-cfgrib branch October 25, 2018 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants