Skip to content

Conversation

@neilSchroeder
Copy link
Contributor

@neilSchroeder neilSchroeder commented Oct 31, 2025

Checklist

How zarr.py Handles Zarr V2 Stores

ZarrParser should now support both Zarr V2 and V3 stores by normalizing V2 stores to appear as V3. This approach ensures that all parsers produce V3-compatible outputs, and confines modifications to zarr.py.

V2 → V3 Normalization Strategy

The parser performs a two-part normalization:

1. Chunk Key Mapping (get_chunk_mapping_prefix)

For V2 arrays:

  • Chunk files are stored directly under the array path: array_name/0, array_name/0.1.2
  • Metadata files (.zarray, .zattrs, etc.) are filtered out
  • Chunk coordinates are normalized to dot-separated format: "0.1.2"
  • File paths in the manifest point to the actual V2 chunk locations
  • Manifest keys contain only chunk coordinates (no path structure)

2. Metadata Conversion (get_metadata())

After converting V2 metadata to V3 using _convert_array_metadata, we have to replace the chunk_key_encoding.

  • The automatic converter preserves V2ChunkKeyEncoding in the V3 metadata
  • When zarr/xarray sees V2ChunkKeyEncoding, it requests chunks using V2-style paths: array/0
  • With DefaultChunkKeyEncoding, zarr requests chunks using V3-style paths: array/c/0
  • ManifestStore.get() expects V3-style paths and uses parse_manifest_index() to extract chunk coordinates
  • parse_manifest_index() requires the /c/ component to correctly parse the path

Additional metadata handling

  • None fill values: Converted to appropriate dtype defaults
  • Dimension names: Extracted from _ARRAY_DIMENSIONS attribute or generated as {array_name}_dim_{i}
  • All other metadata: Converted using zarr's standard V2→V3 migration utilities

Implementation Notes

I'm not convinced I've done a particularly elegant implementation here, but adding another class for V2 parsing didn't seem like it would be particularly extensible. Very happy to hear thoughts on perhaps a better implementation.

@TomNicholas thank you very much for your feedback, it definitely helped me wrap my head around the right approach to take here.

Edit: I've done a bit of re-design to use a strategy pattern for dispatching to parsing v2 and v3 arrays. This should make future integrations of zarr array version parsing a lot more maintainable. This is also just a lot easier to read than my original implementation. Tests and documentation are also up to date.

@neilSchroeder
Copy link
Contributor Author

neilSchroeder commented Oct 31, 2025

How zarr.py Handles Zarr V2 Stores

ZarrParser should now support both Zarr V2 and V3 stores by normalizing V2 stores to appear as V3. This approach ensures that all parsers produce V3-compatible outputs, and confines modifications to zarr.py.

V2 → V3 Normalization Strategy

The parser performs a two-part normalization:

1. Chunk Key Mapping (get_chunk_mapping_prefix)

For V2 arrays:

  • Chunk files are stored directly under the array path: array_name/0, array_name/0.1.2
  • Metadata files (.zarray, .zattrs, etc.) are filtered out
  • Chunk coordinates are normalized to dot-separated format: "0.1.2"
  • File paths in the manifest point to the actual V2 chunk locations
  • Manifest keys contain only chunk coordinates (no path structure)

2. Metadata Conversion (get_metadata())

After converting V2 metadata to V3 using _convert_array_metadata, we have to replace the chunk_key_encoding.

  • The automatic converter preserves V2ChunkKeyEncoding in the V3 metadata
  • When zarr/xarray sees V2ChunkKeyEncoding, it requests chunks using V2-style paths: array/0
  • With DefaultChunkKeyEncoding, zarr requests chunks using V3-style paths: array/c/0
  • ManifestStore.get() expects V3-style paths and uses parse_manifest_index() to extract chunk coordinates
  • parse_manifest_index() requires the /c/ component to correctly parse the path

Additional metadata handling

  • None fill values: Converted to appropriate dtype defaults
  • Dimension names: Extracted from _ARRAY_DIMENSIONS attribute or generated as {array_name}_dim_{i}
  • All other metadata: Converted using zarr's standard V2→V3 migration utilities

Implementation Notes

I'm not convinced I've done a particularly elegant implementation here, but adding another class for V2 parsing didn't seem like it would be particularly extensible. Very happy to hear thoughts on perhaps a better implementation.

@TomNicholas thank you very much for your feedback up there, definitely helped me wrap my head around the right approach to take here.

Edit: I've done a bit of re-design to use a strategy pattern for dispatching to parsing v2 and v3 arrays. This should make future integrations of zarr array version parsing a lot more maintainable. This is also just a lot easier to read than my original implementation. Tests and such are also up to date. Also going to move this into the PR notes instead of huge comment here.

@TomNicholas
Copy link
Member

Let me know when you would like a review of this @neilSchroeder !

@neilSchroeder neilSchroeder marked this pull request as ready for review November 3, 2025 22:22
@neilSchroeder
Copy link
Contributor Author

@TomNicholas I think it's ready for a review.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @neilSchroeder ! I mostly have a bunch of small gripes 😁

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 99.13793% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.31%. Comparing base (cb2912e) to head (a4a271f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
virtualizarr/parsers/zarr.py 99.13% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #822      +/-   ##
==========================================
+ Coverage   87.71%   88.31%   +0.60%     
==========================================
  Files          35       35              
  Lines        1880     1968      +88     
==========================================
+ Hits         1649     1738      +89     
+ Misses        231      230       -1     
Files with missing lines Coverage Δ
virtualizarr/parsers/zarr.py 99.33% <99.13%> (+2.55%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@neilSchroeder
Copy link
Contributor Author

@TomNicholas I'm ready for another review whenever you've got time

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you so much @neilSchroeder !

strategy = get_strategy(zarr_array)
chunk_map = await strategy.get_chunk_mapping(zarr_array, path)

if not chunk_map:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually ignore me, I think what you're done here is good.

I'm a little unclear about order of operations and whether or not these scenarios are realistic.

These scenarios are definitely plausible.

Or maybe handled differently?

There might be a way to refactor to have a few fewer levels of functions, but this is good.

@neilSchroeder
Copy link
Contributor Author

@TomNicholas what are the next steps here? Will this be merged whenever someone has time to do the next release? Do we need another reviewer?

@TomNicholas TomNicholas merged commit acb0bb6 into zarr-developers:main Nov 11, 2025
13 checks passed
@neilSchroeder neilSchroeder deleted the parse-zarr-v2 branch November 11, 2025 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Virtualize Native Zarr V2 format

2 participants