Writeable backends via entrypoints #5954

rabernat · 2021-11-08T15:47:12Z

The backend refactor has gone a long way towards making it easier to implement custom backend readers via entry points. However, it is still not clear how to implement a writeable backend from a third party package as an entry point. Some of the reasons for this are:

While our reading function (open_dataset) has a generic name, our writing functions (Dataset.to_netcdf / Dataset.to_zarr) are still format specific. (Related to load_store and dump_to_store #3638). I propose we introduce a generic Dataset.to method and deprecate the others.

The BackendEntrypoint base class does not have a writing method, just open_dataset:

xarray/xarray/backends/common.py

Lines 356 to 370 in e0deb9c

    
           class BackendEntrypoint: 
        
               """ 
        
               ``BackendEntrypoint`` is a class container and it is the main interface 
        
               for the backend plugins, see :ref:`RST backend_entrypoint`. 
        
               It shall implement: 
        
               - ``open_dataset`` method: it shall implement reading from file, variables 
        
                 decoding and it returns an instance of :py:class:`~xarray.Dataset`. 
        
                 It shall take in input at least ``filename_or_obj`` argument and 
        
                 ``drop_variables`` keyword argument. 
        
                 For more details see :ref:`RST open_dataset`. 
        
               - ``guess_can_open`` method: it shall return ``True`` if the backend is able to open 
        
                 ``filename_or_obj``, ``False`` otherwise. The implementation of this 
        
                 method is not mandatory. 
        
               """

(Related to API Design for Xarray Backends #1970)

As a result, writing is implemented ad-hoc for each backend.
This makes it impossible for a third-party package to to implement writing.

We should fix this situation! Here are the steps I would take.

Decide on the desired API for writeable backends.
Formalize this in the BackendEntrypoint base class.
Refactor the existing writeable backends (netcdf4-python, h5netcdf, scipy, Zarr) to use this API
Maybe deprecate to_zarr and to_netcdf (or at least refactor to make a shallow call to a generic method)
Encourage third party implementors to try it (e.g. TileDB)

The text was updated successfully, but these errors were encountered:

jhamman · 2021-11-08T18:59:14Z

Thanks @rabernat for opening up this issue. I think now that the refactor for read support is completed, it is a great time to discuss the opportunities for adding write support to the plugin interface.

pinging @aurghs and @alexamici since I know they have some thoughts developed here.

abkfenris · 2021-11-08T20:23:43Z

Is ds.to( the most discoverable method for users?

What about making it so that backends can add methods to ds.to, so ds.to.netcdf() or ds.to.tile_db() based on what backends are installed? That way they might not have to guess as much as to what engine and file types can be written.

Illviljan · 2021-11-09T05:28:52Z

Another option is using a similiarly named store function as the read functions:

xr.open_dataset(...)
xr.store_dataset(...)

keewis · 2021-11-09T07:37:17Z

If we do that, I'd call it save_dataset to be consistent with {open,save}_mfdataset

alexamici · 2021-11-09T07:53:00Z

@rabernat and all, at the time of the read-only backend refactor @aurghs and I spent quite some time analysing write support and thinking of a unifying strategy. This is my interpretation of our findings:

one of the big advantages of the unified xr.open_dataset API is that you don't need to specify the engine of the input data and you can rely on xarray guessing it. This is in general non true when you write your data, as you care about what format you are storing it.
another advantage of xr.open_dataset is that xarray manages all the functionaries related to dask and to in-memory cacheing, so backends only need to know how to lazily read from the storage. Current (rather complex) implementation has support for writing from dask and distributed workers but most backends serialise writes anyway, so the advantage is limited. This is not to say that it is not worth, but the cost / benefit ratio of supporting potentially distributed writes is much lower than read support.
that said, I'd really welcome a unified write API like ds.save(engine=...) or even xr.save_dataset(ds, engine=...) with a engine keyword argument and possibly other common options. Adding support for a single save_dataset entry point to the backend API is trivial, but adding full support for possibly distributed writes looks like it is much more work.

Also note that ATM @aurghs and I are overloaded at work and we would have very little time that we can spend on this :/

rabernat · 2021-11-09T11:56:30Z

Thanks for the info @alexamici!

2. but most backends serialise writes anyway, so the advantage is limited.

I'm not sure I understand this comment, specifically what is meant by "serialise writes". I often use Xarray to do distributed writes to Zarr stores using 100+ distributed dask workers. It works great. We would need the same thing from a TileDB backend.

We are focusing on the user-facing API, but in the end, whether we call it .to, .to_dataset, or .store_dataset is not really a difficult or important question. It's clear we need some generic writing method. The much harder question is the back-end API. As Alessandro says:

Adding support for a single save_dataset entry point to the backend API is trivial, but adding full support for possibly distributed writes looks like it is much more work.

alexamici · 2021-11-09T16:28:59Z

but most backends serialise writes anyway, so the advantage is limited.

I'm not sure I understand this comment, specifically what is meant by "serialise writes". I often use Xarray to do distributed writes to Zarr stores using 100+ distributed dask workers. It works great. We would need the same thing from a TileDB backend.

I should have added "except Zarr" 😅 .

All netCDF writers use xr.backends.locks.get_write_lock to get a scheduler appropriate writing lock. The code is intricate and I don't find where to point you, but as I recall the lock was used so only one worker/process/thread could write to disk at a time.

Concurrent writes a la Zarr are awesome and xarray supports them now, so my point was: we can add non-concurrent write support to the plugin architecture quite easily and that will serve a lot of users. But supporting Zarr and other advanced backends via the plugin architecture is a lot more work.

rabernat mentioned this issue Nov 8, 2021

Writeable xarray backend? TileDB-Inc/TileDB-CF-Py#112

Closed

TomNicholas added API design topic-backends labels Nov 8, 2021

snowman2 mentioned this issue Nov 13, 2021

Option to write many rasters for chunked DataArray? corteva/rioxarray#433

Open

keewis mentioned this issue Mar 28, 2023

Feature proposal: DataArray.to_zarr() #7692

Closed

keewis mentioned this issue Dec 12, 2023

Shaping the future of Backends #8548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writeable backends via entrypoints #5954

Writeable backends via entrypoints #5954

rabernat commented Nov 8, 2021

jhamman commented Nov 8, 2021

abkfenris commented Nov 8, 2021

Illviljan commented Nov 9, 2021

keewis commented Nov 9, 2021

alexamici commented Nov 9, 2021 •

edited

Loading

rabernat commented Nov 9, 2021

alexamici commented Nov 9, 2021

Writeable backends via entrypoints #5954

Writeable backends via entrypoints #5954

Comments

rabernat commented Nov 8, 2021

jhamman commented Nov 8, 2021

abkfenris commented Nov 8, 2021

Illviljan commented Nov 9, 2021

keewis commented Nov 9, 2021

alexamici commented Nov 9, 2021 • edited Loading

rabernat commented Nov 9, 2021

alexamici commented Nov 9, 2021

alexamici commented Nov 9, 2021 •

edited

Loading