Read + Write Zarr with consolidated metadata #3066

mannreis · 2024-12-17T11:04:30Z

This PR is motivated by #2987 and it is a follow up on the closed PR2992. It infers if the dataset is consolidated and acts accordingly. The implementations was inspired on the developments in Zarr3 support (by @DennisHeimbigner) which could simplify adding the same feature on the next version.

In short, this PR adds a layer (NCZMD, for NetCDF ZarrMetaData) that implements:

Listing of variables, groups and attributes
Fetching content of the several Zarr metadata json files (.zattrs,.zgroup,.zarray)

This layer would be extended in the same way for writing (updating internal consolidated json and sync it on closure)

Depending on the existence of /.zmetadata the operations above are either process from it's content, or done directly on the storage, via zmap.

The feature above allows the S3 client implementation to be used against to vanilla HTTP servers, when authentication is out of the picture. But such is only possible because of 0832d450d207223fe43a9ee619bb722f9a29bff8, which avoids the S3 ListObjects.

As an example on how to produce a consolidated dataset in python:

import zarr
import numpy as np

name = f'test-{zarr.__version__}'
z = zarr.open(name, mode='w')
print(name)
z.attrs['Description'] = 'Consolidated zarr test'
G1 = z.create_group('G1')
G1.attrs['Details'] = 'Variables are chunked'
v1 = G1.create_group('subg1')
v1.array('myarray', np.arange(90, dtype='i4').reshape(6, 15), chunks=(6,15))
G2 = z.create_group('G2')
G2.array('other variable with spaces',np.arange(15).reshape(3, 5))
zarr.consolidate_metadata(z.store)

This can be used to check if the reading output remains the same after (re)moving the .zmetadata

ncdump file://test-2.18.2#mode=zarr > csl.out
mv test-2.18.2/.zmetadata .
diff csl.out <(ncdump file://test-2.18.2#mode=zarr)

Similar is done on 6346e91 taking into accound zip and file modes. Integrated tests exercising S3 are limited on my side (i'll try to add some here). However I have used it against my own endpoints and it seems to be functional.

WardF · 2024-12-17T18:10:54Z

@DennisHeimbigner Failures in the code preventing compilation aside, I'd be interested in your thoughts on this, particularly in advance of our scheduled conversation with @mannreis and Flo re: consolidated metadata. Thanks!

WardF

@mannreis I will take a look at the compilation failures in the next couple of days and pitch in where I can. I'm going to convert this to a draft PR for the time being, until we have the compilation and tests passing. Thanks!

DennisHeimbigner · 2024-12-18T20:22:14Z

In our meeting this morning, you indicated that you had modified ncjson
to make dict insertions faster. Can you point me to that code?

mannreis · 2024-12-18T20:36:28Z

I mentioned that with respect to my developments on write operations. And the main concern wasn't speed but key-value duplication when inserting a value with an already existing key: 57bf0b9. I'll merge the write functionality to this branch and rename the PR to Read + Write then.

WardF · 2024-12-23T23:27:14Z

I've merged the latest main into this branch to capture the work done to incorporate various updates to the Github Action work.

DennisHeimbigner · 2024-12-24T04:33:35Z

See draft pr #3068

mannreis added 12 commits December 11, 2024 10:41

Add dispatch layer for consolidated ZARR access

0260f9c

Skip s3 listing until actually necessary

0832d45

Resolving rebase issues

5061334

Use zmetadata layer

19046fc

Ensure no s3 listing with using consolidated metadata

32a98a1

Remove draft consolidated zarr3

802cb06

Free zarr metadata pointer

29fae5d

Fix memory leaks

a2d9fb1

Housekeeping

50e4fa0

Add consolidated [nc]zarr tests with mode=file|zip

6346e91

Allow to define S3 endpoint for tests during build

3ac802a

Setup initial Zarr consolidated tests

7ddd7c9

mannreis requested review from WardF and DennisHeimbigner as code owners December 17, 2024 11:04

Add more tests

c20aec1

mannreis force-pushed the zarr-csl branch from 913cf9f to c20aec1 Compare December 17, 2024 16:45

WardF assigned WardF and DennisHeimbigner Dec 17, 2024

WardF added this to the 4.9.3 milestone Dec 17, 2024

WardF reviewed Dec 17, 2024

View reviewed changes

WardF marked this pull request as draft December 17, 2024 18:12

mannreis added 8 commits December 18, 2024 11:26

Fix warnings

2b9f305

Adjusting DEFINEs

42ff051

Removing dead code

e2eef20

NCjson make uploads take const pointer

19a23d8

Add Zarr metadata write layer

3ffe288

Add zarr consolidated write

4064baa

Ensure no repeated keys in JSON dictionaries

57bf0b9

Write .zmetadata when syncing file

f3d8dfb

mannreis added 3 commits December 18, 2024 11:47

Fix warnings

207a8a6

Rebase issues

40e2e85

Dead code

eaa936b

mannreis changed the title ~~Read Zarr with consolidated metadata~~ Read + Write Zarr with consolidated metadata Dec 18, 2024

Merge branch 'main' of https://github.com/Unidata/netcdf-c into zarr-csl

82b3458

DennisHeimbigner mentioned this pull request Dec 30, 2024

PR adding support for Zarr V3 #3068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read + Write Zarr with consolidated metadata #3066

Read + Write Zarr with consolidated metadata #3066

mannreis commented Dec 17, 2024

WardF commented Dec 17, 2024

WardF left a comment

DennisHeimbigner commented Dec 18, 2024

mannreis commented Dec 18, 2024 •

edited

Loading

WardF commented Dec 23, 2024

DennisHeimbigner commented Dec 24, 2024

Read + Write Zarr with consolidated metadata #3066

Are you sure you want to change the base?

Read + Write Zarr with consolidated metadata #3066

Conversation

mannreis commented Dec 17, 2024

WardF commented Dec 17, 2024

WardF left a comment

Choose a reason for hiding this comment

DennisHeimbigner commented Dec 18, 2024

mannreis commented Dec 18, 2024 • edited Loading

WardF commented Dec 23, 2024

DennisHeimbigner commented Dec 24, 2024

mannreis commented Dec 18, 2024 •

edited

Loading