High-res zarr products - build tracking thread #38

cisaacstern · 2023-08-18T23:27:32Z

I am currently building zarr products for the high-res data. Opening this thread so we have a public place to track progress on these efforts. By way of background:

When complete, the output product will be two zarr data objects on the LEAP Google Cloud Storage, one for mli and one for mlo , totaling ~48 TB together. These datasets will be publicly available to everyone on the internet with no egress costs. If accessed from a cloud compute node (e,g., the LEAP JupyterHub), this will allow users of the data to access the full high res data product directly, without downloading anything.
Here is the data ingestion + transformation code I am using to create these zarr stores. This code leverages the pangeo-forge-recipes Python package, which uses Apache Beam as its distributed parallel computation framework, here's something I wrote recently on Beam, for those interested.
Once these zarr stores are complete (currently, I've been debugging the long-running compute jobs), I'll devote some effort to contributing some data loading code + examples to the github repo that demonstrates how to access them.

My second full-scale attempt at running these jobs has now been running for a little over 2 days:

The first time I tried this they crashed after 3 days, and I think I fixed the bug that caused that crash. So if this attempt just works, they'll be done by early next week I'd guess. If these jobs crash, I'll restart them early next week and then maybe the next shot we'd have is for end of next week (budgeting a couple days per attempt).

The text was updated successfully, but these errors were encountered:

cisaacstern · 2023-08-21T23:49:10Z

Monday update: of the two jobs left running over the weekend, the mlo job apparently succeeded, whereas the mli job failed:

Still working on debugging the cause of the mli failure. As for mlo, the output dataset can be opened as shown below. A few caveats:

🙂 Please do not take this to be an official release of the Zarr dataset. This is an early preview, more validation work is required before we consider this canonical.
⏳ Loading the dataset with xarray takes ~4 min (on my local laptop, maybe faster on a data-adjacent compute node, e.g. the LEAP hub). This is admittedly very fast compared to the alternative of downloading all ~13 TB, but not as fast as I'd like. I have some ideas as to why this is and will open/link related issues momentarily.

And a few notes on things that seem to have worked (please correct me if anything here seems inaccurate):

📆 Time is parsed into an indexable coordinate (as opposed to a data variable, as it exists in the original NetCDF files). And 210240 timesteps are present, which is the expected number of time steps, as represented here.
💾 Dataset totals ~13.4 TB (uncompressed), which is a plausible size for the aggregate mlo data: 210240 time steps x 61 MB per file = ~12.8 TB on disk, which gives a compression ratio of just under 1.05. This matches basically exactly with a compression ratio calculated for a single file of the mlo source data.
📝 Any attributes listed in this google sheet have been added to the variables.
🔢 As shown in the Details section below, chunksize is (2, 60, 21600) for time, ncol, and lev dimensions respectively. This means ~120 MB per chunk for mlo.

The mlo (prelim/preview only, no guarantees yet! 😄 ) dataset can be loaded as follows:

import xarray as xr
path = "gs://leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882522942-1/climsim-highres-mlo.zarr"
ds = xr.open_dataset(path, engine="zarr", chunks={})  # requires `gcsfs`, takes ~4 mins on my laptop
ds.nbytes / 1e12  # -> 13.36924905984 TB
len(ds.time)  # -> 210240
ds.state_t.attrs  # -> {'long_name': 'Air temperature', 'units': 'K'}
ds

<xarray.Dataset>
Dimensions:         (time: 210240, ncol: 21600, lev: 60)
Coordinates:
  * time            (time) object 0001-02-01 00:00:00 ... 0009-01-31 23:40:00
Dimensions without coordinates: ncol, lev
Data variables: (12/16)
    cam_out_FLWDS   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_NETSW   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECC   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECSC  (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLL    (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLLD   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    ...              ...
    state_q0003     (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_t         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_u         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_v         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    tod             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
    ymd             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
Attributes:
    calendar:  NO_LEAP
    fv_nphys:  2
    ne:        30

duncanwp · 2024-06-22T12:12:47Z

Hey @cisaacstern - I'd love to use this version of ClimSim so I can just grab a spatial slice of the data, but I can't seem to access the above URL. Did you resolve the issue in the end? Is there a new zarr url I can use (hopefully for both mlo and mli)?

cisaacstern · 2024-06-22T16:40:52Z

Hi @duncanwp! I haven't been keeping up with this particular issue lately, @jbusecke may have some insight!

jbusecke · 2024-06-24T14:06:36Z

Howdie @duncanwp. I have moved all climsim related ingestion stuff to https://github.com/leap-stc/climsim_feedstock

As you can tell from leap-stc/climsim_feedstock#7, I am still struggling with ingestions the lowres data! I am hesitant to even try the highres data until then.

There is some ClimSim data in gs://leap-persistent-ro/sungdukyu, but I am unsure if it is the lowres or highres (maybe @sungdukyu or @SammyAgrawal can provide clarity).

Please let me know if this is urgent to you and I can shift priorities to try to get this to work.

I also opened a PR to add climsim into our catalog (https://catalog.leap.columbia.edu). We are not able to share links to specific datasets quite yet (tracking that in leap-stc/data-management#129), so for any future updates I recommend checking the catalog periodically!

duncanwp · 2024-06-24T14:26:39Z

Brilliant, thanks @jbusecke. I'll keep an eye on that repo, but it's not urgent as I can work around it for now.

SammyAgrawal · 2024-06-24T16:14:54Z

The gs://leap-persistent-ro/sungdukyu cloud bucket contains the low resolution data, specifically the first 8 years.

jbusecke · 2024-08-27T16:40:33Z

I think (at least for the low res data) we now have the opportunity to ingest the data as a virtual zarr reference directly from HF. See https://github.com/jbusecke/hugging_face_data_testing?tab=readme-ov-file as example

cisaacstern self-assigned this Aug 18, 2023

cisaacstern mentioned this issue Aug 28, 2023

Consolidating existing stores? leap-stc/cmip6-leap-feedstock#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High-res zarr products - build tracking thread #38

High-res zarr products - build tracking thread #38

cisaacstern commented Aug 18, 2023 •

edited

Loading

cisaacstern commented Aug 21, 2023 •

edited

Loading

duncanwp commented Jun 22, 2024

cisaacstern commented Jun 22, 2024

jbusecke commented Jun 24, 2024

duncanwp commented Jun 24, 2024

SammyAgrawal commented Jun 24, 2024

jbusecke commented Aug 27, 2024

High-res zarr products - build tracking thread #38

High-res zarr products - build tracking thread #38

Comments

cisaacstern commented Aug 18, 2023 • edited Loading

cisaacstern commented Aug 21, 2023 • edited Loading

duncanwp commented Jun 22, 2024

cisaacstern commented Jun 22, 2024

jbusecke commented Jun 24, 2024

duncanwp commented Jun 24, 2024

SammyAgrawal commented Jun 24, 2024

jbusecke commented Aug 27, 2024

cisaacstern commented Aug 18, 2023 •

edited

Loading

cisaacstern commented Aug 21, 2023 •

edited

Loading