-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr Sprint Topics #33
Comments
As a Zarr Sprint track focused on enabling support for V3 in Zarr-Python we are joining an ongoing effort working toward Zarr-Python version 3.0 (roadmap).
The types of skills we need to complete this task are moderate to advanced familiarity with Python and Zarr. |
A Zarr Sprint track focused on geospatial multi-scales / pyramids. Our focus is on identifying and addressing shortcomings of the ndpyramid utility, either though development in that library or deciding where else development would need to happen.
We anticipate that this sprint will progress development towards geospatial pyramids in Zarr that can be used broadly for dynamic client visualization approaches, tiling servers, and multi-scale analysis. This will serve data providers, front-end developers, and researchers. This will be achieved when:
Some potentially more attainable goals for the short sprint:
The types of skills needed complete this task are moderate familiarity with Python and Zarr. We would especially encourage participate from those familiar with geospatial projections, multiscale representations, and metadata conventions. We expect the level of difficulty to complete this to be medium. |
Zarr Linked Hierarchy for HTTP-enabled BrowsingFocus and OutcomesOur focus is on achieving the ability to explore nested Zarr groups over HTTP or other stores that do not provide a LIST-style operation. This will enable
More ContextZarr is not a file format; it is a specification for how to organize a nested hierarchy of numerical arrays and metadata in storage. In order to explore the contents of a Zarr hierarchy, clients generally need the ability to list the contents of directories in the storage layer. For filesystems or s3-compatible object storage, this is straightforward. However, most cloud-native geospatial data formats provide first-class read-only support via a vanilla HTTP protocol. To address this need, Zarr V2 implemented a somewhat hacky “consolidated metadata” approach, in which all the metadata from a hierarchy are condensed into a single json file. This approach does not scale to very large, deeply nested Zarr hierarchies. Now that Zarr V3 has been ratified, there is an opportunity to develop an extension that supports this HTTP-browsing use case in a more scalable and robust way. Specifically, we imagine developing a STAC-like mechanism for explicit links between parent and child groups that allow an HTTP client to quickly traverse a Zarr hierarchy. Requirements
Non-goals:
Implementation Plan and Skills NeededWe will try to implement this capability in zarr-python on the V3 branch. Contributors should be intermediate Python programmers (understand best practices around Python objects, typing, and code structure). Familiarity with the Zarr code base is not required but helpful. Participants should review the V3 roadmap and design document. I'm also open to implementing this first in a javascript library, rather than Python. For example, in the source.coop viewers package. |
Along the lines of my comment above, I have a concrete proposal that could be fun for someone (like @kylebarron 😉) to work on. Zarr Python's V3 store interface is being redesigned to provide an all-async interface. The idea we have been discussing is to write a store on top of the Rust Object-Store crate. There are already Python bindings for this project but they are not async ready. If this particular plan is successful, it is possible this could become the core store in the zarr-python project. |
@jhamman , been readying https://docs.rs/pyo3-asyncio/latest/pyo3_asyncio/index.html very carefully. Given I already did this once in rfsspec, I am prepared to give it a go on top of object-store. rfsspec showed marginal benefits, so while it may be worthwhile, do not expect a big return for the probably substantial effort. Note that using rust async (tokio) in python async (asyncio) requires two event loops on two threads; it isn't simple! We also want to enable dask-style access from multiple (python) threads, so... Also, python |
(I would be interested in this, because a rust-only zarr and kerchunk solution is very generally interested for those that need a C-level API; however, if we don't use numpy as the storage, and we don't have numcodecs directly, it maybe asks more questions than it answers; cf https://github.com/sci-rs/zarr ). |
FWIW the next version of pyo3 is likely to have big progress in async handling, and it sounds like it might no longer need two event loops? PyO3/pyo3#1632 (comment)
Why are they annoying? Is it because the memory is Python-allocated instead of Rust-allocated? |
Yes, but also the internal immutability guarantee makes zero-copy handing the memory to/from rust hard. In rfsspec, I already wrote code around the python buffer protocol to cope with this, which appears to work but sidelines rust's memory protections. |
An additional, v3 sprint topic idea, this one aimed at @TomNicholas. Manifest storage transformer. Specific goals for this sprint could be to:
|
In the Zarr pyramids breakout group, Thomas Maschler and I discussed the motivations for following the OGC TileMatrixSet 2.0 specification within the GeoZarr specification, which will be shared as a new issue to supersede #30. We also discussed reading those TMS into rio-tiler using Xarray and started a refactor of ndpyramid to support the TMS specification. |
Zarr-Python post-sprint update
Thanks all! |
I added two example scripts for interactive geozarr in qgis. |
In the "chunk manifest / virtual concatenation" group our main outcome was a long technical discussion, which I've written up in ZEP-like form here zarr-developers/zarr-specs#288 (comment) |
Per our discussions in the bi-weekly GeoZarr SWG meeting, we identified a few focus tracks for the zarr sprint coming up on February 7/8th, 2024. In addition, I reviewed the original brainstorming ideas first discussed a year ago documented https://hackmd.io/t2DWpX1iQEWMKx1Fi4Px7A?both#Let’s-brainstorm. Many of these ideas are captured by the proposed list we discussed on January 24th. The topic of bidirectional interoperability with gdal is another clear theme, although as we discussed at the last SWG meeting, this would be very difficult to tackle in a single sprint and more importantly we may not have someone to lead this. Nevertheless I am listing it as an option to see if we could identify folks in the community to lead.
Here are the topics I have narrowed down to:
Here is the proposed template that I ask the folks who are tagged as leading the tracks above to complete and share below.
Topic leaders, if you can fill in the above template by Monday January 29th, then as a community, we provide ranked responses by Wednesday January 31st.
The text was updated successfully, but these errors were encountered: