Skip to content

Design/Datamodel Decision: Very large datasets #143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbusecke opened this issue May 14, 2025 · 19 comments
Open

Design/Datamodel Decision: Very large datasets #143

jbusecke opened this issue May 14, 2025 · 19 comments

Comments

@jbusecke
Copy link
Contributor

Documenting this here for permanence (already brought up during the km-scale-hackathon).

The current implementation basically faces a hard scaling stop based on the clients machine since we rely on a labelled xarray coordinate ('cell_ids'), which will by default be loaded to memory.

Here is a small example with public data that will blow up my laptop at higher zoom levels, since the cell dimension itself (if labelled) will be larger than my system memory:

import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

cell_ids = xr.DataArray(dsa.arange(ds.cell.size,chunks=(1048576)),dims=['cell_ids'], name='cell_ids')
ds = ds.assign_coords({'cell_ids': cell_ids})

ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds

even if I assign a coordinate without index (see @benbovy s comment in pydata/xarray#1650 (comment)):

import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

# the coordinate labels are wayyy bigger than my memory! So xarray loading them by default is a non-starter.
# Trying https://github.com/pydata/xarray/issues/1650#issuecomment-1697282386
cell_ids = xr.Coordinates({"cell_ids": ("cell_ids", dsa.arange(ds.cell.size, chunks=(1048576)))}, indexes={})

ds = ds.assign_coords(cell_ids)
ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds

upon applying xdggs.decode(ds), the coordinates will be loaded into memory. I am not quite sure what the options are here, but we should probably treat this case as a general scenario, and have the goal to be able to do something like this without issue eventually:

  • Open a massively large dggs dataset
  • decode, subset, and plot
@jbusecke
Copy link
Contributor Author

Thinking out loud here: Even if we are able to solve this with a custom index, is there a way to encode this into a zarr store, so that we can open it with a vanilla xarray open_dataset command or do we need to provide a dggs-specific opener?

@benbovy
Copy link
Member

benbovy commented May 14, 2025

Also thinking out loud: maybe we could store lat-lon bounding boxes computed for each chunk or shard in a separate Zarr array. The decoding step would construct a custom index that loads only the bounding box array in memory and create a lazy (dask-baked?) coordinate variable for cell ids. Data selection would first look at the bounding boxes and then only load the appropriate cell id chunks before doing the full lookup. This would only work for data selection with lat-lon coordinates or polygons, not selection with cell ids directly.

@keewis
Copy link
Collaborator

keewis commented May 14, 2025

the reason why the cell ids are loaded into memory is that we create a standard pandas.Index object to allow indexing / alignment. If we want to support larger-than-memory coordinates we'd either need a lazy index (e.g. using dask) or a new index implementation based on compacted cell ids / MOCs. Both options appear to be quite a bit of work.

For serialization I'm less worried: if you keep the (lazy) cell_ids variable and the metadata in whatever form, you can just drop the index without losing information, and thus obtain a dataset that you can roundtrip to any format without loss of information.

@benbovy
Copy link
Member

benbovy commented May 14, 2025

There might be other ways of compressing cell id data, assuming that data consists of a contiguous coverage of some region of the globe, thus probably represented as a contiguous range of dggs cell ids on the same level, although the details may vary from one system to another.

@keewis
Copy link
Collaborator

keewis commented May 14, 2025

thus probably represented as a contiguous range of dggs cell ids on the same level

for healpix, that's what MOCs do nowadays, apparently (specifically, these are sets of (merged) ranges of max level cell ids) and it might be possible to do something similar for other DGGS

@jbusecke
Copy link
Contributor Author

If we want to support larger-than-memory coordinates

I really think that should be a strong priority? And I already apologize if I might trivialize this, but I think that figuring out this sort of thing would really push this packages usefulness. I am thinking especially about eventually opening a full hierarchy of zooms in a single zarr (as a datatree for insance)?

@benbovy
Copy link
Member

benbovy commented May 14, 2025

The bounding box approach is probably a naive & sub-optimal one but that would work in the general case (i.e., making no assumption on the DGGS used and the kind of cell coverage). I think it should be reasonably easy to try implementing it. I might want to take a stab at putting together a proof-of-concept in the 2-3 coming weeks if I find the time (along with using Xarray CoordinateTransform for lat/lon coordinates).

I'm not familiar with MOC, it is probably a better approach than the bounding box one when using Healpix. I'm not sure how it could be generalized in order to support other DGGSs in xdggs, though. IIUC it seems very dependent on how cell ids are spatially ordered.

@d70-t
Copy link

d70-t commented May 15, 2025

One simple solution for global data we often use, is to not have a coordinate at all.
In those cases, where cell_ids would be the equivalent of np.arange(n_cells), there's no real point in having them listed individually. Further, on dimensions which don't have a coordinate, xarray's .sel() will do the same thing as .isel(). Thus skipping the coordinate can be a simple option.

Of course, this won't be applicable for very high resolved regional data, but maybe it's still simple enough to give it a thought.

@tinaok
Copy link
Contributor

tinaok commented May 15, 2025

There might be other ways of compressing cell id data, assuming that data consists of a contiguous coverage of some region of the globe, thus probably represented as a contiguous range of dggs cell ids on the same level, although the details may vary from one system to anothe
Here you can try with level=15, global data
I assume even if you 'guess' if we load all the data linked with 'gues'ed data to the memory, it crashes.

As we go for healpix/zarr, I feel lazy loading coordinate option more attractive. Then when we do xddgs.sel_latlon(longitude=full_lon, latitude=full_lat) can be done in dask, which is good as well?

@tinaok
Copy link
Contributor

tinaok commented May 15, 2025

the reason why the cell ids are loaded into memory is that we create a standard pandas.Index object to allow indexing / alignment.

What about using dask.dataframe instead of pandas.Index?

@benbovy
Copy link
Member

benbovy commented May 15, 2025

The problem with not having a coordinate is that we cannot perform "label-based" (thus here grid-aware or spatial-aware) selection reusing Xarray's API, so for the special global case we would need a separate function or method to do that, which is not ideal IMO.

Having lazy coordinates with a custom Xarray index seems doable to me and would nicely scale up to the global case at fine refinement levels. At least at the dataset opening and decoding stages, as keeping those coordinates lazy through multiple operations (selection or other) might be trickier.

@tinaok
Copy link
Contributor

tinaok commented May 15, 2025

Here is what @d70-t just explained to me as a proposed workflow:

  • xdggs checks the dataset (or retrieves the necessary metadata manually) to determine the level and order (nested etc).
  • xdggs receives a user query as a (lat, lon) pair and computes the corresponding healpix_cell_id.
  • xdggs passes this query (via healpix_cell_id) to xarray.
  • In the dataset, the coordinate values might be omitted or lazily loaded, but the dataset still contains enough information to determine the actual data index, using one of three options:
  • Ideally, xarray (or cf-xarray) should be able to support this kind of query — even when actual coordinate arrays does not exist or lazily loaded. Using the dimension structure and the compression-by-gathering metadata (e.g. attributes on dimensions), it should be possible to locate and retrieve the data for the queried healpix_cell_id.

Our question is:

  • would this be possible?
  • would this be handled more by xdggs? or by cf-xarray? (May be 'decoding' CF convention part (2. compression by gathering), fall more under the responsibility of cf-xarray rather than xarray itself? And could cf-xarray handle this kind of compression-by-gathering in a memory-efficient way?)

@benbovy
Copy link
Member

benbovy commented May 15, 2025

We already support most of this workflow in xdggs via an Xarray custom DGGSIndex.

  • for (a): array index == healpix_cell_id is true for Healpix but not necessarily for other DGGS I think? For healpix global datasets users are free to skip decoding the Dataset if they want, and for example call xdggs.decode() after selecting a subset of the global data using, e.g., Xarray's isel(). However, if we can support DGGSIndex with a lazy cell_id coordinate it won't be more expensive to create than no coordinate at all. We could probably support that the input Dataset given to xdggs.decode() has no cell id coordinate yet and that in this case a lazy coordinate will be created and returned with the output dataset (using DGGS's level info passed with the grid parameters).

  • for (b): I'm not sure to understand how we currently support CF-like compression by gathering in xdggs?

@d70-t
Copy link

d70-t commented May 15, 2025

I'm not sure to understand how we currently support CF-like compression by gathering in xdggs?

The way we currently think of this at the healpix CF standardization process is as follows: There's some mechanism (in this case healpix, but it could be any kind of DGGS), which relates positions indicated by some lat/lon coordinates to cell index numbers. These cell index numbers are then conceptually located on a "cell"-dimension.

This "cell"-dimension may be an actual dimension in the dataset, meaning that every position from 0...N is densely filled with either some data or an indicator for missing values (i.e. it's a dense array). I think this should be possible with any kind of DGGS, but it might be quite inefficient if there are large gaps in the indexing space.

Another option is to not store that data along the "cell"-dimension, but instead as a sparse array, i.e. we just don't reserve any space for the gaps, but we remember where the data would have been, using a 1d array, carring all the index positions. CF-Conventions formalize this as "compression by gathering", and to my understanding, this is almost exactly what (the data values of) cell_ids in xdggs and the corresponding index are doing.

And the third option could be to not store all indices, but rather consecutive ranges. In CF-language, this might be another way of compression, and in xdggs language, it might be what MOC tend to do.

In my head, these three methods would correspond to three different kinds of "indices" in xarray language.

@benbovy
Copy link
Member

benbovy commented May 15, 2025

Thanks @d70-t for the explanation and for referencing the Healpix CF discussion, which I didn't followed.

Regarding your comment cf-convention/cf-conventions#433 (comment), I think that Xarray will simply ignore dimensions defined in a netCDF file that are not used by any of the variables (I'd need to check, though). Note that I'm not sure that other formats like Zarr support unused dimensions, so the "compression by gathering" formal representation for CF Healpix might not be easy to reuse in, e.g., other specs like geozarr (if the latter eventually supports DGGS).

Healpix MOC seems similar - if not equivalent - to H3's "compacted cells" and S2's (s2geometry) "cell union" or "region coverer".

If we translate these three methods in terms of dimensions we would have something like:

  1. cell: dimension for the whole globe at a given refinement level, unused in the case of compressed non-global datasets
  2. data_cell: actual dimension used by the data variables of compressed non-global datasets
  3. compacted_cell: dimension used to store the compressed cell ids for non-global datasets (Healpix MOC, H3 compated cells, S2 region coverer, etc.)

My understanding is that there is two kinds of compression (2 and 3) that are non-mutually exclusive.

In a serialized format like netCDF or Zarr, cell ids would ideally be stored as a data_cell(compacted_cell) variable. For global datasets a cell(compacted_cell) variable is not necessary IIUC.

The data_cell(compacted_cell) variable is not very useful to have in an Xarray Dataset, though. Ideally the xdggs.decode() step would drop that variable and return the "uncompressed" coordinate data_cell(data_cell) instead, which is more user-friendly. data_cell(data_cell) is a lazy coordinate associated with a DGGSIndex, which internally holds the compressed cell id data and which will compute uncompressed cell ids on-demand (via Xarray API like .load(), .compute(), .sel(), .isel(), etc.).

For global datasets, it would still be nice to have a DGGSIndex as it is used to propagate DGGS info through Xarray operations. Since an the Xarray model requires that an index must be associated with at least one coordinate, we'd need to create a cell(cell) coordinate, which for Healpix could just be dask.array.arange(ncells) (cheap).

@d70-t
Copy link

d70-t commented May 15, 2025

I checked. Xarray and zarr will drop unused dimensions, netCDF supports them.

I also think, this is not a problem for a DGGS, as we don't learn a lot from the existence of the empty dimension. We know it should exist from the the compression metadata, and we know it's size from the DGGS metadata, and there's not really anything more an empty dimension would provide. So in practical terms, I'd guess we can live with the dimension being dropped.

@benbovy
Copy link
Member

benbovy commented May 23, 2025

Looks like CF's tie point index mapping could be a nice way to represent cell ids compressed with ranges, as suggested in cf-convention/cf-conventions#433 (comment). This may look like:

   dimensions:
     time = 100;
     cell = 15234;
     tp_cell = 44;

   variables:
     float time(time) ;
       time.units = "days since 2024-09-04" ;
     float temp(time, cell) ;
       temp:standard_name = "air_temperature" ;
       temp:units = "K" ;
       temp:grid_mapping = "crs: cell" ;
     char l_interpolation ;
       l_interpolation:interpolation_name = "linear" ;
       l_interpolation:tie_point_mapping = "cell: cell_indices tp_cell"  ;
     int cell(tp_cell) ;
     int cell_indices(tp_cell) ;
     int crs ;
       crs:grid_mapping_name = "healpix" ;
       crs:refinement_level = 7 ;
       crs:healpix_order = "nested" ;

    data:
      cell_indices = 0, 20, 21, 25, 26, ... ;
      cell = 1114, 1134, 2564, 2568, 6581, ... ;
      ...

My proposal for xdggs.HealpixIndex (let's see later if we can apply that to other DGGSs):

  • global dataset

    • no pre-existing cell index coordinate
    • API: a factory (class) method HealpixIndex.full(level: int, order: str = "nested") combined with xarray.Coordinates.from_xindex()
    • HealpixIndex wraps an xarray.indexes.PandasIndex that itself wraps a pandas.RangeIndex, which is memory saving
  • small regional dataset (currently supported by xdggs)

    • pre-existing cell(cell) dimension coordinate
    • API: ds.set_xindex("cell", HealpixIndex, ...)
    • HealpixIndex wraps an xarray.indexes.PandasIndex that itself wraps a basic pandas.Index
  • large regional dataset

    • pre-existing tie-point coordinates cell(tp_cell) and cell_indices(tp_cell)
    • API: ds.set_xindex(["cell", "cell_indices"], HealpixIndex, ...)
    • Reusing pandas, one possible implementation would be to have a "top-level" pandas.IntervalIndex object and a list of pandas.RangeIndex objects (one for each interval). Or alternatively use some custom structure implemented in backend libraries...
    • For convenience, reuse Xarray's coordinate transforms to associate the new HealpixIndex object with a cell(cell) virtual / lazy dimension coordinate, instead of propagating the tie point coordinates that are not very useful for users (we can drop these and if needed provide some API to re-generate it, e.g., from the wrapped pandas indexes)

@d70-t
Copy link

d70-t commented May 23, 2025

Yes, I think this is a good way to go.
As far as CF-Standardization goes, I guess the two non-global options should be fine already (they rely on already existing constructs in the conventions). I'm not yet sure, if we'll have the first one (global), as this would as of now require to imply a cell-coordinate of the form range(0, N) (which of course no one would ever materialize, but dataset usage should be as-if it would exist). I think we should still promote this implicit coordinate, because it simplifies a common case quite a lot. But if this won't be accepted, we can still use the large regional dataset method to cover global data.

@benbovy
Copy link
Member

benbovy commented May 23, 2025

I think we should still promote this implicit coordinate, because it simplifies a common case quite a lot.

Agreed, this is what is proposed already for global datasets.

Actually, the three global, small regional and large regional cases would all yield the same user-facing dataset including a cell(cell) dimension coordinate associated with a HealpixIndex. The difference between each of these cases is what's under the hood, which users don't really need to care about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants