CI | |
---|---|
Docs | |
Package | |
License | |
Citation |
Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
Finding, investigating, loading these assets into data array containers such as xarray can be a daunting task due to the large number of files a user may be interested in. Intake-esm aims to address these issues by providing necessary functionality for searching, discovering, data access/loading.
intake-esm
is a data cataloging utility built on top of intake, pandas, and xarray, and it's pretty awesome!
-
Opening an ESM collection definition file: An ESM (Earth System Model) collection file is a JSON file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file,
intake-esm
establishes a link to a database (CSV file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from). The collection JSON file can be stored on a local filesystem or can be hosted on a remote server.In [1]: import intake In [2]: col_url = "https://gist.githubusercontent.com/andersy005/7f416e57acd8319b20fc2b88d129d2b8/raw/987b4b336d1a8a4f9abec95c23eed3bd7c63c80e/pangeo-gcp-subset.json" In [3]: col = intake.open_esm_datastore(col_url) In [4]: col Out[4]: <pangeo-cmip6 catalog with 4287 dataset(s) from 282905 asset(s)>
-
Search and Discovery:
intake-esm
provides functionality to execute queries against the catalog:In [5]: col_subset = col.search( ...: experiment_id=["historical", "ssp585"], ...: table_id="Oyr", ...: variable_id="o2", ...: grid_label="gn", ...: ) In [6]: col_subset Out[6]: <pangeo-cmip6 catalog with 18 dataset(s) from 138 asset(s)>
-
Access: when the user is satisfied with the results of their query, they can ask
intake-esm
to load data assets (netCDF/HDF files and/or Zarr stores) into xarray datasets:In [7]: dset_dict = col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True}) --> The keys in the returned dictionary of datasets are constructed as follows: 'activity_id.institution_id.source_id.experiment_id.table_id.grid_label' |███████████████████████████████████████████████████████████████| 100.00% [18/18 00:10<00:00]
See documentation for more information.
Intake-esm can be installed from PyPI with pip:
python -m pip install intake-esm
It is also available from conda-forge
for conda installations:
conda install -c conda-forge intake-esm