-
Notifications
You must be signed in to change notification settings - Fork 18
Design/Datamodel Decision: Very large datasets #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thinking out loud here: Even if we are able to solve this with a custom index, is there a way to encode this into a zarr store, so that we can open it with a vanilla xarray open_dataset command or do we need to provide a dggs-specific opener? |
Also thinking out loud: maybe we could store lat-lon bounding boxes computed for each chunk or shard in a separate Zarr array. The decoding step would construct a custom index that loads only the bounding box array in memory and create a lazy (dask-baked?) coordinate variable for cell ids. Data selection would first look at the bounding boxes and then only load the appropriate cell id chunks before doing the full lookup. This would only work for data selection with lat-lon coordinates or polygons, not selection with cell ids directly. |
the reason why the cell ids are loaded into memory is that we create a standard For serialization I'm less worried: if you keep the (lazy) |
There might be other ways of compressing cell id data, assuming that data consists of a contiguous coverage of some region of the globe, thus probably represented as a contiguous range of dggs cell ids on the same level, although the details may vary from one system to another. |
for healpix, that's what MOCs do nowadays, apparently (specifically, these are sets of (merged) ranges of max level cell ids) and it might be possible to do something similar for other DGGS |
I really think that should be a strong priority? And I already apologize if I might trivialize this, but I think that figuring out this sort of thing would really push this packages usefulness. I am thinking especially about eventually opening a full hierarchy of zooms in a single zarr (as a datatree for insance)? |
The bounding box approach is probably a naive & sub-optimal one but that would work in the general case (i.e., making no assumption on the DGGS used and the kind of cell coverage). I think it should be reasonably easy to try implementing it. I might want to take a stab at putting together a proof-of-concept in the 2-3 coming weeks if I find the time (along with using Xarray CoordinateTransform for lat/lon coordinates). I'm not familiar with MOC, it is probably a better approach than the bounding box one when using Healpix. I'm not sure how it could be generalized in order to support other DGGSs in xdggs, though. IIUC it seems very dependent on how cell ids are spatially ordered. |
One simple solution for global data we often use, is to not have a coordinate at all. Of course, this won't be applicable for very high resolved regional data, but maybe it's still simple enough to give it a thought. |
As we go for healpix/zarr, I feel lazy loading coordinate option more attractive. Then when we do |
What about using dask.dataframe instead of pandas.Index? |
The problem with not having a coordinate is that we cannot perform "label-based" (thus here grid-aware or spatial-aware) selection reusing Xarray's API, so for the special global case we would need a separate function or method to do that, which is not ideal IMO. Having lazy coordinates with a custom Xarray index seems doable to me and would nicely scale up to the global case at fine refinement levels. At least at the dataset opening and decoding stages, as keeping those coordinates lazy through multiple operations (selection or other) might be trickier. |
Here is what @d70-t just explained to me as a proposed workflow:
Our question is:
|
We already support most of this workflow in xdggs via an Xarray custom
|
The way we currently think of this at the healpix CF standardization process is as follows: There's some mechanism (in this case healpix, but it could be any kind of DGGS), which relates positions indicated by some lat/lon coordinates to cell index numbers. These cell index numbers are then conceptually located on a "cell"-dimension. This "cell"-dimension may be an actual dimension in the dataset, meaning that every position from 0...N is densely filled with either some data or an indicator for missing values (i.e. it's a dense array). I think this should be possible with any kind of DGGS, but it might be quite inefficient if there are large gaps in the indexing space. Another option is to not store that data along the "cell"-dimension, but instead as a sparse array, i.e. we just don't reserve any space for the gaps, but we remember where the data would have been, using a 1d array, carring all the index positions. CF-Conventions formalize this as "compression by gathering", and to my understanding, this is almost exactly what (the data values of) And the third option could be to not store all indices, but rather consecutive ranges. In CF-language, this might be another way of compression, and in xdggs language, it might be what MOC tend to do. In my head, these three methods would correspond to three different kinds of "indices" in xarray language. |
Thanks @d70-t for the explanation and for referencing the Healpix CF discussion, which I didn't followed. Regarding your comment cf-convention/cf-conventions#433 (comment), I think that Xarray will simply ignore dimensions defined in a netCDF file that are not used by any of the variables (I'd need to check, though). Note that I'm not sure that other formats like Zarr support unused dimensions, so the "compression by gathering" formal representation for CF Healpix might not be easy to reuse in, e.g., other specs like geozarr (if the latter eventually supports DGGS). Healpix MOC seems similar - if not equivalent - to H3's "compacted cells" and S2's (s2geometry) "cell union" or "region coverer". If we translate these three methods in terms of dimensions we would have something like:
My understanding is that there is two kinds of compression (2 and 3) that are non-mutually exclusive. In a serialized format like netCDF or Zarr, cell ids would ideally be stored as a The For global datasets, it would still be nice to have a DGGSIndex as it is used to propagate DGGS info through Xarray operations. Since an the Xarray model requires that an index must be associated with at least one coordinate, we'd need to create a |
I checked. Xarray and zarr will drop unused dimensions, netCDF supports them. I also think, this is not a problem for a DGGS, as we don't learn a lot from the existence of the empty dimension. We know it should exist from the the compression metadata, and we know it's size from the DGGS metadata, and there's not really anything more an empty dimension would provide. So in practical terms, I'd guess we can live with the dimension being dropped. |
Looks like CF's tie point index mapping could be a nice way to represent cell ids compressed with ranges, as suggested in cf-convention/cf-conventions#433 (comment). This may look like:
My proposal for
|
Yes, I think this is a good way to go. |
Agreed, this is what is proposed already for global datasets. Actually, the three global, small regional and large regional cases would all yield the same user-facing dataset including a |
Documenting this here for permanence (already brought up during the km-scale-hackathon).
The current implementation basically faces a hard scaling stop based on the clients machine since we rely on a labelled xarray coordinate ('cell_ids'), which will by default be loaded to memory.
Here is a small example with public data that will blow up my laptop at higher zoom levels, since the cell dimension itself (if labelled) will be larger than my system memory:
even if I assign a coordinate without index (see @benbovy s comment in pydata/xarray#1650 (comment)):
upon applying
xdggs.decode(ds)
, the coordinates will be loaded into memory. I am not quite sure what the options are here, but we should probably treat this case as a general scenario, and have the goal to be able to do something like this without issue eventually:The text was updated successfully, but these errors were encountered: