-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-chunk metadata (e.g., bbox) #4
Comments
It could also be defined at the shard level... |
Maybe can we have @joshmoore input here? |
If images have different spatial extents, I think it would make more sense to store them as distinct arrays, rather than as chunks of the same array. |
I want to close this as out of scope. Zarr does not allow per-chunk metadata. We are not making any Zarr extensions here. So therefore, we need to find a different solution to this use case. The obvious one to me is to just store images with different bbox in separate arrays. |
I think that at this stage this is still a good place here (better in a new issue) for discussing if/how in general we can facilitate spatial indexing and/or partitioning of large datasets, even if this would require multiple zarr arrays (groups?) or any kind of zarr extension. |
I might be missing something, but this should be possible today without need for per-chunk metadata. As long as you know have something like the geotransform so that you know where the "origin" pixel is and the space between each pixel, and the size of each chunk, you should be able to get the bbox of each chunk with a bit of math. This should be exactly the same has how GDAL / COG handle reading a single block out of a larger COG, just using multiple files / chunks. Perhaps it isn't safe to assume that every chunk of this dataset is on the same grid / projection. But in that case, I'd recommend storing them in separate arrays. |
This is interesting. 🤔 So the idea is that you would have an array stack with dimensions
and then coordinate variables like
Then you could construct the geotransforms on the fly for the entire collection, create a geodataframe, etc. For the image collections we are talking about, it is safe to assume that the images all have the same |
Sorry for the slow response. I don't any definitive responses but ...
Big 👍 for this strategy on this repo with the caveat that the individual convention efforts (GeoZarr, NGFF, etc) will likely identify things that to make it to
This reminds me somewhat of ome/ngff#138 (which also triggered a discussion in NGFF space about use of cfconventions...) |
@rabernat mentioned in https://twitter.com/rabernat/status/1617209410702696449 the idea of attaching a "GeoBox" (i.e., bbox + CRS + grid metadata) to a dataset, which is implemented in odc-geo and which is indeed useful for indexing.
Now I'm wondering if it would be possible to reconstruct such GeoBox for each chunk of a Zarr array or dataset? This would require storing a bbox per chunk. I'm not very familiar with Zarr specs, though. Is it possible/easy to store arbitrary metadata per chunk?
One potential use case would be scalable (e.g., dask-friendly) implementation of spatial regridding / resampling algorithms that would work with non-trivial datasets (e.g., curvilinear grids).
There is an interesting, somewhat related discussion in the geoarrow-specs repository: geoarrow/geoarrow#19. As far as I understand, geospatial vector datasets are currently partitioned using multiple parquet files (dask-geopandas parquet IO, dask-geopandas.read_parquet). For GeoZarr, however, we don't want one Zarr dataset per spatial partition I guess.
The text was updated successfully, but these errors were encountered: