Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow ReferenceFileSystem to hold dicts, which are treated as JSON files #1562

Merged
merged 3 commits into from
May 30, 2024

Conversation

bendichter
Copy link
Contributor

@bendichter bendichter commented Apr 4, 2024

Currently, when a ReferenceFileSystem wants to create an inline JSON file, the value needs to be a JSON string, e.g.

{
  "version": 1,
  "refs": {
    ".zgroup": "{\"zarr_format\": 2}",
    "data/.zarray": "{\"chunks\": [100, 100], \"compressor\": null, \"dtype\": \"<i8\", \"fill_value\": null, \"filters\": null, \"order\": \"C\", \"shape\": [100, 100], \"zarr_format\": 2}",
    "data/.zattrs": "{\"_ARRAY_DIMENSIONS\": [\"a\", \"b\"]}",
    "data/0.0": [
      "example4.h5",
      2048,
      80000
    ]
  }
}

The proposed change allows the JSON string to instead be dicts, which would allow the RFS to be:

{
  "version": 1,
  "refs": {
    ".zgroup": {
      "zarr_format": 2
    },
    "data/.zarray": {
      "chunks": [
        100,
        100
      ],
      "compressor": null,
      "dtype": "<i8",
      "fill_value": null,
      "filters": null,
      "order": "C",
      "shape": [
        100,
        100
      ],
      "zarr_format": 2
    },
    "data/.zattrs": {
      "_ARRAY_DIMENSIONS": [
        "a",
        "b"
      ]
    },
    "data/0.0": [
      "example4.h5",
      2048,
      80000
    ]
  }
}

This allows for easier reading, writing, manipulation, and JSON-specific search tools

@bendichter
Copy link
Contributor Author

This would clearly benefit from tests but I just wanted to see if there was interest in supporting this feature before continuing

@rly
Copy link
Contributor

rly commented Apr 4, 2024

The Neurodata Without Borders project is moving toward using Zarr and ReferenceFileSystem for accessing large-scale neurophysiology data stored in the cloud and locally. This feature would make reading/inspecting, writing, editing, and querying these data much easier.

@martindurant
Copy link
Member

I am fine with this, but a test would be nice. Since it only activates for the small JSON metadata files within a zarr dataset, the cost at runtime should be minimal. I suppose this JSON-as-dict representation doesn't survive loading into referenceFS and saving again; it could be a valid option you might want to provide.

Would love to hear more about the Neurodata Without Borders use case.

add tests json-as-dict for RFS ver0 and ver1
@magland
Copy link

magland commented Apr 4, 2024

Would love to hear more about the Neurodata Without Borders use case.

@martindurant This is the early-stage project that uses reference file systems (.zarr.json) for NWB. NWB is traditionally built on hdf5, but there are advantages of using Zarr as the backend and the kerchunk approach for utilizing data chunks from existing files on DANDI.

@martindurant
Copy link
Member

Sorry I forgot about this! Looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants