Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialise zarr metadata without computing dask graph #6084

Open
dougiesquire opened this issue Dec 17, 2021 · 6 comments
Open

Initialise zarr metadata without computing dask graph #6084

dougiesquire opened this issue Dec 17, 2021 · 6 comments
Labels
topic-backends topic-zarr Related to zarr storage library

Comments

@dougiesquire
Copy link

Is your feature request related to a problem? Please describe.
On writing large zarr stores, the xarray docs recommend first creating an initial Zarr store without writing all of its array data. The recommended approach is to first create a dummy dask-backed Dataset, and then call to_zarr with compute=False to write only metadata to Zarr. This works great.

It seems that in one common use case for this approach (including the example in the above docs), the entire dataset to be written to zarr is already represented in a Dataset (let's call this ds). Thus, rather than creating a dummy Dataset with exactly the same metadata as ds, it is more convenient to initialise the zarr Store with ds.to_zarr(..., compute=False). See for example:

https://discourse.pangeo.io/t/many-netcdf-to-single-zarr-store-using-concurrent-futures/2029
https://discourse.pangeo.io/t/map-blocks-and-to-zarr-region/2019
https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119/12
https://discourse.pangeo.io/t/best-practice-for-memory-management-to-iteratively-write-a-large-dataset-with-xarray/1989

However, calling to_zarr with compute=False still computes the dask graph for writing the Zarr store. The graph is never used in this use-case, but computing the graph can take a really long time for large graphs.

Describe the solution you'd like
Is there scope to add an option to to_zarr to initialise the store without computing the dask graph? Or perhaps an initialise_zarr method would be cleaner?

@TomNicholas TomNicholas added the topic-zarr Related to zarr storage library label Dec 19, 2021
@shoyer
Copy link
Member

shoyer commented Dec 21, 2021

The challenge is that Xarray needs some way to represent the "schema" for the desired entire dataset. I'm very open to alternatives, but so far, the most convenient way to do this has been to load Dask arrays into an xarray.Dataset.

It's worth noting that any dask arrays with the desired chunking scheme will do -- you don't need to use the same dask arrays that you want to compute. When I do this sort of thing, I will often use xarray.zeros_like() to create low overhead versions of dask arrays, e.g., in this example from Xarray-Beam:
https://github.com/google/xarray-beam/blob/0.2.0/examples/era5_climatology.py#L61-L68

@dcherian
Copy link
Contributor

What metadata is being determined by computing the whole array?

@dougiesquire
Copy link
Author

Thanks @shoyer. I understand the need for the schema, but is there a need to actually generate the dask graph when all the user wants to do is initialise an empty zarr store? E.g., I think skipping this line would save some of the users in my original post a lot of time.

Regardless, your suggestion to just create a low-overhead version of the array being initialised is probably better/cleaner than adding a specific option or method. Would it be worth adding the xarray.zeros_like(ds) recommendation to the docs?

@shoyer
Copy link
Member

shoyer commented Jan 12, 2022

E.g., I think skipping this line would save some of the users in my original post a lot of time.

I don't think that line adds any measurable overhead. It's just telling dask to delay computation of a single function.

For sure this would be worth elaborating on in the Xarray docs! I wrote a little bit about this in the docs for Xarray-Beam: see "One recommended pattern" in https://xarray-beam.readthedocs.io/en/latest/read-write.html#writing-data-to-zarr

@sehoffmann
Copy link

sehoffmann commented Apr 3, 2024

As a user, I find the topic very unclear and would hope that were to be a clear and concise way to do this in the future.
In essence, I should just be expected to pass in my coords, dims, shapes and chunks and let xarray handle the rest. Similar to how the xr.Dataset or xr.DataArray ctor needs this meta information from me already as well. This could also be an extra function such as xr.create_zarr(shapes, dims, coords, chunks) for instance.

In general, incrementally writing a zarr array with xarray seems to be very convoluted in my opinion, especially if compared with the actual python zarr API.

@dougiesquire

Regardless, your suggestion to just create a low-overhead version of the array being initialised is probably better/cleaner than adding a specific option or method. Would it be worth adding the xarray.zeros_like(ds) recommendation to the docs?

zeros_like() allocates memory, doesn't it? Same with empty_like. If you are in the business of incrementally writing data to disk, its usually because memory is not big enough anymore. So this unfortunately not an option for me.

@max-sixty
Copy link
Collaborator

FYI I think #8460 should solve most of this. Or would anything remain?

zeros_like() allocates memory, doesn't it? Same with empty_like. If you are in the business of incrementally writing data to disk, its usually because memory is not big enough anymore. So this unfortunately not an option for me.

If we create the array with chunks, then it doesn't allocate memory! There's more context in the linked PR / some links from there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

7 participants