-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialise zarr metadata without computing dask graph #6084
Comments
The challenge is that Xarray needs some way to represent the "schema" for the desired entire dataset. I'm very open to alternatives, but so far, the most convenient way to do this has been to load Dask arrays into an xarray.Dataset. It's worth noting that any dask arrays with the desired chunking scheme will do -- you don't need to use the same dask arrays that you want to compute. When I do this sort of thing, I will often use |
What metadata is being determined by computing the whole array? |
Thanks @shoyer. I understand the need for the schema, but is there a need to actually generate the dask graph when all the user wants to do is initialise an empty zarr store? E.g., I think skipping this line would save some of the users in my original post a lot of time. Regardless, your suggestion to just create a low-overhead version of the array being initialised is probably better/cleaner than adding a specific option or method. Would it be worth adding the |
I don't think that line adds any measurable overhead. It's just telling dask to delay computation of a single function. For sure this would be worth elaborating on in the Xarray docs! I wrote a little bit about this in the docs for Xarray-Beam: see "One recommended pattern" in https://xarray-beam.readthedocs.io/en/latest/read-write.html#writing-data-to-zarr |
As a user, I find the topic very unclear and would hope that were to be a clear and concise way to do this in the future. In general, incrementally writing a zarr array with xarray seems to be very convoluted in my opinion, especially if compared with the actual python zarr API.
|
FYI I think #8460 should solve most of this. Or would anything remain?
If we create the array with chunks, then it doesn't allocate memory! There's more context in the linked PR / some links from there... |
Is your feature request related to a problem? Please describe.
On writing large zarr stores, the xarray docs recommend first creating an initial Zarr store without writing all of its array data. The recommended approach is to first create a dummy dask-backed
Dataset
, and then callto_zarr
withcompute=False
to write only metadata to Zarr. This works great.It seems that in one common use case for this approach (including the example in the above docs), the entire dataset to be written to zarr is already represented in a
Dataset
(let's call thisds
). Thus, rather than creating a dummyDataset
with exactly the same metadata asds
, it is more convenient to initialise the zarr Store withds.to_zarr(..., compute=False)
. See for example:https://discourse.pangeo.io/t/many-netcdf-to-single-zarr-store-using-concurrent-futures/2029
https://discourse.pangeo.io/t/map-blocks-and-to-zarr-region/2019
https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119/12
https://discourse.pangeo.io/t/best-practice-for-memory-management-to-iteratively-write-a-large-dataset-with-xarray/1989
However, calling
to_zarr
withcompute=False
still computes the dask graph for writing the Zarr store. The graph is never used in this use-case, but computing the graph can take a really long time for large graphs.Describe the solution you'd like
Is there scope to add an option to
to_zarr
to initialise the store without computing the dask graph? Or perhaps aninitialise_zarr
method would be cleaner?The text was updated successfully, but these errors were encountered: