Design/Datamodel Decision: Very large datasets #143

jbusecke · 2025-05-14T20:07:27Z

Documenting this here for permanence (already brought up during the km-scale-hackathon).

The current implementation basically faces a hard scaling stop based on the clients machine since we rely on a labelled xarray coordinate ('cell_ids'), which will by default be loaded to memory.

Here is a small example with public data that will blow up my laptop at higher zoom levels, since the cell dimension itself (if labelled) will be larger than my system memory:

import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

cell_ids = xr.DataArray(dsa.arange(ds.cell.size,chunks=(1048576)),dims=['cell_ids'], name='cell_ids')
ds = ds.assign_coords({'cell_ids': cell_ids})

ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds

even if I assign a coordinate without index (see @benbovy s comment in pydata/xarray#1650 (comment)):

import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

# the coordinate labels are wayyy bigger than my memory! So xarray loading them by default is a non-starter.
# Trying https://github.com/pydata/xarray/issues/1650#issuecomment-1697282386
cell_ids = xr.Coordinates({"cell_ids": ("cell_ids", dsa.arange(ds.cell.size, chunks=(1048576)))}, indexes={})

ds = ds.assign_coords(cell_ids)
ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds

upon applying xdggs.decode(ds), the coordinates will be loaded into memory. I am not quite sure what the options are here, but we should probably treat this case as a general scenario, and have the goal to be able to do something like this without issue eventually:

Open a massively large dggs dataset
decode, subset, and plot

The text was updated successfully, but these errors were encountered:

jbusecke · 2025-05-14T20:09:30Z

Thinking out loud here: Even if we are able to solve this with a custom index, is there a way to encode this into a zarr store, so that we can open it with a vanilla xarray open_dataset command or do we need to provide a dggs-specific opener?

benbovy · 2025-05-14T20:30:33Z

Also thinking out loud: maybe we could store lat-lon bounding boxes computed for each chunk or shard in a separate Zarr array. The decoding step would construct a custom index that loads only the bounding box array in memory and create a lazy (dask-baked?) coordinate variable for cell ids. Data selection would first look at the bounding boxes and then only load the appropriate cell id chunks before doing the full lookup. This would only work for data selection with lat-lon coordinates or polygons, not selection with cell ids directly.

keewis · 2025-05-14T20:37:10Z

the reason why the cell ids are loaded into memory is that we create a standard pandas.Index object to allow indexing / alignment. If we want to support larger-than-memory coordinates we'd either need a lazy index (e.g. using dask) or a new index implementation based on compacted cell ids / MOCs. Both options appear to be quite a bit of work.

For serialization I'm less worried: if you keep the (lazy) cell_ids variable and the metadata in whatever form, you can just drop the index without losing information, and thus obtain a dataset that you can roundtrip to any format without loss of information.

benbovy · 2025-05-14T20:37:45Z

There might be other ways of compressing cell id data, assuming that data consists of a contiguous coverage of some region of the globe, thus probably represented as a contiguous range of dggs cell ids on the same level, although the details may vary from one system to another.

keewis · 2025-05-14T20:52:00Z

thus probably represented as a contiguous range of dggs cell ids on the same level

for healpix, that's what MOCs do nowadays, apparently (specifically, these are sets of (merged) ranges of max level cell ids) and it might be possible to do something similar for other DGGS

jbusecke · 2025-05-14T20:58:03Z

If we want to support larger-than-memory coordinates

I really think that should be a strong priority? And I already apologize if I might trivialize this, but I think that figuring out this sort of thing would really push this packages usefulness. I am thinking especially about eventually opening a full hierarchy of zooms in a single zarr (as a datatree for insance)?

benbovy · 2025-05-14T21:18:09Z

The bounding box approach is probably a naive & sub-optimal one but that would work in the general case (i.e., making no assumption on the DGGS used and the kind of cell coverage). I think it should be reasonably easy to try implementing it. I might want to take a stab at putting together a proof-of-concept in the 2-3 coming weeks if I find the time (along with using Xarray CoordinateTransform for lat/lon coordinates).

I'm not familiar with MOC, it is probably a better approach than the bounding box one when using Healpix. I'm not sure how it could be generalized in order to support other DGGSs in xdggs, though. IIUC it seems very dependent on how cell ids are spatially ordered.

d70-t · 2025-05-15T03:00:40Z

One simple solution for global data we often use, is to not have a coordinate at all.
In those cases, where cell_ids would be the equivalent of np.arange(n_cells), there's no real point in having them listed individually. Further, on dimensions which don't have a coordinate, xarray's .sel() will do the same thing as .isel(). Thus skipping the coordinate can be a simple option.

Of course, this won't be applicable for very high resolved regional data, but maybe it's still simple enough to give it a thought.

tinaok · 2025-05-15T03:57:34Z

There might be other ways of compressing cell id data, assuming that data consists of a contiguous coverage of some region of the globe, thus probably represented as a contiguous range of dggs cell ids on the same level, although the details may vary from one system to anothe
Here you can try with level=15, global data
I assume even if you 'guess' if we load all the data linked with 'gues'ed data to the memory, it crashes.

As we go for healpix/zarr, I feel lazy loading coordinate option more attractive. Then when we do xddgs.sel_latlon(longitude=full_lon, latitude=full_lat) can be done in dask, which is good as well?

tinaok · 2025-05-15T04:13:27Z

the reason why the cell ids are loaded into memory is that we create a standard pandas.Index object to allow indexing / alignment.

What about using dask.dataframe instead of pandas.Index?

benbovy · 2025-05-15T06:50:34Z

The problem with not having a coordinate is that we cannot perform "label-based" (thus here grid-aware or spatial-aware) selection reusing Xarray's API, so for the special global case we would need a separate function or method to do that, which is not ideal IMO.

Having lazy coordinates with a custom Xarray index seems doable to me and would nicely scale up to the global case at fine refinement levels. At least at the dataset opening and decoding stages, as keeping those coordinates lazy through multiple operations (selection or other) might be trickier.

tinaok · 2025-05-15T07:11:04Z

Here is what @d70-t just explained to me as a proposed workflow:

xdggs checks the dataset (or retrieves the necessary metadata manually) to determine the level and order (nested etc).
xdggs receives a user query as a (lat, lon) pair and computes the corresponding healpix_cell_id.
xdggs passes this query (via healpix_cell_id) to xarray.
In the dataset, the coordinate values might be omitted or lazily loaded, but the dataset still contains enough information to determine the actual data index, using one of three options:
- 1. data covers the full globe, (that what @d70-t do with global data, data index == healpix_cell_id)
- 1. regional, using compression by gathering (as defined in the CF conventions), (what we do with xdggs today)
- 1. combination of several healpix_cell_id ranges (note: this third case is not yet implemented in CF conventions, but could probably be another "compression" method).
Ideally, xarray (or cf-xarray) should be able to support this kind of query — even when actual coordinate arrays does not exist or lazily loaded. Using the dimension structure and the compression-by-gathering metadata (e.g. attributes on dimensions), it should be possible to locate and retrieve the data for the queried healpix_cell_id.

Our question is:

would this be possible?
would this be handled more by xdggs? or by cf-xarray? (May be 'decoding' CF convention part (2. compression by gathering), fall more under the responsibility of cf-xarray rather than xarray itself? And could cf-xarray handle this kind of compression-by-gathering in a memory-efficient way?)

benbovy · 2025-05-15T08:37:49Z

We already support most of this workflow in xdggs via an Xarray custom DGGSIndex.

for (a): array index == healpix_cell_id is true for Healpix but not necessarily for other DGGS I think? For healpix global datasets users are free to skip decoding the Dataset if they want, and for example call xdggs.decode() after selecting a subset of the global data using, e.g., Xarray's isel(). However, if we can support DGGSIndex with a lazy cell_id coordinate it won't be more expensive to create than no coordinate at all. We could probably support that the input Dataset given to xdggs.decode() has no cell id coordinate yet and that in this case a lazy coordinate will be created and returned with the output dataset (using DGGS's level info passed with the grid parameters).
for (b): I'm not sure to understand how we currently support CF-like compression by gathering in xdggs?

d70-t · 2025-05-15T11:47:06Z

I'm not sure to understand how we currently support CF-like compression by gathering in xdggs?

The way we currently think of this at the healpix CF standardization process is as follows: There's some mechanism (in this case healpix, but it could be any kind of DGGS), which relates positions indicated by some lat/lon coordinates to cell index numbers. These cell index numbers are then conceptually located on a "cell"-dimension.

This "cell"-dimension may be an actual dimension in the dataset, meaning that every position from 0...N is densely filled with either some data or an indicator for missing values (i.e. it's a dense array). I think this should be possible with any kind of DGGS, but it might be quite inefficient if there are large gaps in the indexing space.

Another option is to not store that data along the "cell"-dimension, but instead as a sparse array, i.e. we just don't reserve any space for the gaps, but we remember where the data would have been, using a 1d array, carring all the index positions. CF-Conventions formalize this as "compression by gathering", and to my understanding, this is almost exactly what (the data values of) cell_ids in xdggs and the corresponding index are doing.

And the third option could be to not store all indices, but rather consecutive ranges. In CF-language, this might be another way of compression, and in xdggs language, it might be what MOC tend to do.

In my head, these three methods would correspond to three different kinds of "indices" in xarray language.

benbovy · 2025-05-15T14:08:04Z

Thanks @d70-t for the explanation and for referencing the Healpix CF discussion, which I didn't followed.

Regarding your comment cf-convention/cf-conventions#433 (comment), I think that Xarray will simply ignore dimensions defined in a netCDF file that are not used by any of the variables (I'd need to check, though). Note that I'm not sure that other formats like Zarr support unused dimensions, so the "compression by gathering" formal representation for CF Healpix might not be easy to reuse in, e.g., other specs like geozarr (if the latter eventually supports DGGS).

Healpix MOC seems similar - if not equivalent - to H3's "compacted cells" and S2's (s2geometry) "cell union" or "region coverer".

If we translate these three methods in terms of dimensions we would have something like:

cell: dimension for the whole globe at a given refinement level, unused in the case of compressed non-global datasets
data_cell: actual dimension used by the data variables of compressed non-global datasets
compacted_cell: dimension used to store the compressed cell ids for non-global datasets (Healpix MOC, H3 compated cells, S2 region coverer, etc.)

My understanding is that there is two kinds of compression (2 and 3) that are non-mutually exclusive.

In a serialized format like netCDF or Zarr, cell ids would ideally be stored as a data_cell(compacted_cell) variable. For global datasets a cell(compacted_cell) variable is not necessary IIUC.

The data_cell(compacted_cell) variable is not very useful to have in an Xarray Dataset, though. Ideally the xdggs.decode() step would drop that variable and return the "uncompressed" coordinate data_cell(data_cell) instead, which is more user-friendly. data_cell(data_cell) is a lazy coordinate associated with a DGGSIndex, which internally holds the compressed cell id data and which will compute uncompressed cell ids on-demand (via Xarray API like .load(), .compute(), .sel(), .isel(), etc.).

For global datasets, it would still be nice to have a DGGSIndex as it is used to propagate DGGS info through Xarray operations. Since an the Xarray model requires that an index must be associated with at least one coordinate, we'd need to create a cell(cell) coordinate, which for Healpix could just be dask.array.arange(ncells) (cheap).

d70-t · 2025-05-15T17:44:40Z

I checked. Xarray and zarr will drop unused dimensions, netCDF supports them.

I also think, this is not a problem for a DGGS, as we don't learn a lot from the existence of the empty dimension. We know it should exist from the the compression metadata, and we know it's size from the DGGS metadata, and there's not really anything more an empty dimension would provide. So in practical terms, I'd guess we can live with the dimension being dropped.

benbovy · 2025-05-23T08:13:43Z

Looks like CF's tie point index mapping could be a nice way to represent cell ids compressed with ranges, as suggested in cf-convention/cf-conventions#433 (comment). This may look like:

   dimensions:
     time = 100;
     cell = 15234;
     tp_cell = 44;

   variables:
     float time(time) ;
       time.units = "days since 2024-09-04" ;
     float temp(time, cell) ;
       temp:standard_name = "air_temperature" ;
       temp:units = "K" ;
       temp:grid_mapping = "crs: cell" ;
     char l_interpolation ;
       l_interpolation:interpolation_name = "linear" ;
       l_interpolation:tie_point_mapping = "cell: cell_indices tp_cell"  ;
     int cell(tp_cell) ;
     int cell_indices(tp_cell) ;
     int crs ;
       crs:grid_mapping_name = "healpix" ;
       crs:refinement_level = 7 ;
       crs:healpix_order = "nested" ;

    data:
      cell_indices = 0, 20, 21, 25, 26, ... ;
      cell = 1114, 1134, 2564, 2568, 6581, ... ;
      ...

My proposal for xdggs.HealpixIndex (let's see later if we can apply that to other DGGSs):

global dataset
- no pre-existing cell index coordinate
- API: a factory (class) method HealpixIndex.full(level: int, order: str = "nested") combined with xarray.Coordinates.from_xindex()
- HealpixIndex wraps an xarray.indexes.PandasIndex that itself wraps a pandas.RangeIndex, which is memory saving
small regional dataset (currently supported by xdggs)
- pre-existing cell(cell) dimension coordinate
- API: ds.set_xindex("cell", HealpixIndex, ...)
- HealpixIndex wraps an xarray.indexes.PandasIndex that itself wraps a basic pandas.Index
large regional dataset
- pre-existing tie-point coordinates cell(tp_cell) and cell_indices(tp_cell)
- API: ds.set_xindex(["cell", "cell_indices"], HealpixIndex, ...)
- Reusing pandas, one possible implementation would be to have a "top-level" pandas.IntervalIndex object and a list of pandas.RangeIndex objects (one for each interval). Or alternatively use some custom structure implemented in backend libraries...
- For convenience, reuse Xarray's coordinate transforms to associate the new HealpixIndex object with a cell(cell) virtual / lazy dimension coordinate, instead of propagating the tie point coordinates that are not very useful for users (we can drop these and if needed provide some API to re-generate it, e.g., from the wrapped pandas indexes)

d70-t · 2025-05-23T08:40:08Z

Yes, I think this is a good way to go.
As far as CF-Standardization goes, I guess the two non-global options should be fine already (they rely on already existing constructs in the conventions). I'm not yet sure, if we'll have the first one (global), as this would as of now require to imply a cell-coordinate of the form range(0, N) (which of course no one would ever materialize, but dataset usage should be as-if it would exist). I think we should still promote this implicit coordinate, because it simplifies a common case quite a lot. But if this won't be accepted, we can still use the large regional dataset method to cover global data.

benbovy · 2025-05-23T08:59:21Z

I think we should still promote this implicit coordinate, because it simplifies a common case quite a lot.

Agreed, this is what is proposed already for global datasets.

Actually, the three global, small regional and large regional cases would all yield the same user-facing dataset including a cell(cell) dimension coordinate associated with a HealpixIndex. The difference between each of these cases is what's under the hood, which users don't really need to care about.

d70-t mentioned this issue May 21, 2025

Convention for HEALPix grid parameters cf-convention/cf-conventions#433

Open

Design/Datamodel Decision: Very large datasets #143

Design/Datamodel Decision: Very large datasets #143

Comments

jbusecke commented May 14, 2025

jbusecke commented May 14, 2025

Uh oh!

benbovy commented May 14, 2025

Uh oh!

keewis commented May 14, 2025

Uh oh!

benbovy commented May 14, 2025

Uh oh!

keewis commented May 14, 2025

Uh oh!

jbusecke commented May 14, 2025

Uh oh!

benbovy commented May 14, 2025

Uh oh!

d70-t commented May 15, 2025

Uh oh!

tinaok commented May 15, 2025

Uh oh!

tinaok commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbovy commented May 15, 2025

Uh oh!

tinaok commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbovy commented May 15, 2025

Uh oh!

d70-t commented May 15, 2025

Uh oh!

benbovy commented May 15, 2025

Uh oh!

d70-t commented May 15, 2025

Uh oh!

benbovy commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d70-t commented May 23, 2025

Uh oh!

benbovy commented May 23, 2025

Uh oh!

tinaok commented May 15, 2025 •

edited

Loading

tinaok commented May 15, 2025 •

edited

Loading

benbovy commented May 23, 2025 •

edited

Loading