Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read + Write Zarr with consolidated metadata #3066

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

mannreis
Copy link
Contributor

This PR is motivated by #2987 and it is a follow up on the closed PR2992. It infers if the dataset is consolidated and acts accordingly. The implementations was inspired on the developments in Zarr3 support (by @DennisHeimbigner) which could simplify adding the same feature on the next version.

In short, this PR adds a layer (NCZMD, for NetCDF ZarrMetaData) that implements:

  • Listing of variables, groups and attributes
  • Fetching content of the several Zarr metadata json files (.zattrs,.zgroup,.zarray)

This layer would be extended in the same way for writing (updating internal consolidated json and sync it on closure)

Depending on the existence of /.zmetadata the operations above are either process from it's content, or done directly on the storage, via zmap.

The feature above allows the S3 client implementation to be used against to vanilla HTTP servers, when authentication is out of the picture. But such is only possible because of 0832d450d207223fe43a9ee619bb722f9a29bff8, which avoids the S3 ListObjects.


As an example on how to produce a consolidated dataset in python:

import zarr
import numpy as np

name = f'test-{zarr.__version__}'
z = zarr.open(name, mode='w')
print(name)
z.attrs['Description'] = 'Consolidated zarr test'
G1 = z.create_group('G1')
G1.attrs['Details'] = 'Variables are chunked'
v1 = G1.create_group('subg1')
v1.array('myarray', np.arange(90, dtype='i4').reshape(6, 15), chunks=(6,15))
G2 = z.create_group('G2')
G2.array('other variable with spaces',np.arange(15).reshape(3, 5))
zarr.consolidate_metadata(z.store)

This can be used to check if the reading output remains the same after (re)moving the .zmetadata

ncdump file://test-2.18.2#mode=zarr > csl.out
mv test-2.18.2/.zmetadata .
diff csl.out <(ncdump file://test-2.18.2#mode=zarr)

Similar is done on 6346e91 taking into accound zip and file modes. Integrated tests exercising S3 are limited on my side (i'll try to add some here). However I have used it against my own endpoints and it seems to be functional.

@WardF
Copy link
Member

WardF commented Dec 17, 2024

@DennisHeimbigner Failures in the code preventing compilation aside, I'd be interested in your thoughts on this, particularly in advance of our scheduled conversation with @mannreis and Flo re: consolidated metadata. Thanks!

Copy link
Member

@WardF WardF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mannreis I will take a look at the compilation failures in the next couple of days and pitch in where I can. I'm going to convert this to a draft PR for the time being, until we have the compilation and tests passing. Thanks!

@WardF WardF marked this pull request as draft December 17, 2024 18:12
@DennisHeimbigner
Copy link
Collaborator

In our meeting this morning, you indicated that you had modified ncjson
to make dict insertions faster. Can you point me to that code?

@mannreis
Copy link
Contributor Author

mannreis commented Dec 18, 2024

I mentioned that with respect to my developments on write operations. And the main concern wasn't speed but key-value duplication when inserting a value with an already existing key: 57bf0b9. I'll merge the write functionality to this branch and rename the PR to Read + Write then.

@mannreis mannreis changed the title Read Zarr with consolidated metadata Read + Write Zarr with consolidated metadata Dec 18, 2024
@WardF
Copy link
Member

WardF commented Dec 23, 2024

I've merged the latest main into this branch to capture the work done to incorporate various updates to the Github Action work.

@DennisHeimbigner
Copy link
Collaborator

See draft pr #3068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants