Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-res zarr products - build tracking thread #38

Open
cisaacstern opened this issue Aug 18, 2023 · 7 comments
Open

High-res zarr products - build tracking thread #38

cisaacstern opened this issue Aug 18, 2023 · 7 comments
Assignees

Comments

@cisaacstern
Copy link
Contributor

cisaacstern commented Aug 18, 2023

I am currently building zarr products for the high-res data. Opening this thread so we have a public place to track progress on these efforts. By way of background:

  • When complete, the output product will be two zarr data objects on the LEAP Google Cloud Storage, one for mli and one for mlo , totaling ~48 TB together. These datasets will be publicly available to everyone on the internet with no egress costs. If accessed from a cloud compute node (e,g., the LEAP JupyterHub), this will allow users of the data to access the full high res data product directly, without downloading anything.
  • Here is the data ingestion + transformation code I am using to create these zarr stores. This code leverages the pangeo-forge-recipes Python package, which uses Apache Beam as its distributed parallel computation framework, here's something I wrote recently on Beam, for those interested.
  • Once these zarr stores are complete (currently, I've been debugging the long-running compute jobs), I'll devote some effort to contributing some data loading code + examples to the github repo that demonstrates how to access them.

My second full-scale attempt at running these jobs has now been running for a little over 2 days:

image

The first time I tried this they crashed after 3 days, and I think I fixed the bug that caused that crash. So if this attempt just works, they'll be done by early next week I'd guess. If these jobs crash, I'll restart them early next week and then maybe the next shot we'd have is for end of next week (budgeting a couple days per attempt).

@cisaacstern cisaacstern self-assigned this Aug 18, 2023
@cisaacstern
Copy link
Contributor Author

cisaacstern commented Aug 21, 2023

Monday update: of the two jobs left running over the weekend, the mlo job apparently succeeded, whereas the mli job failed:

Screen Shot 2023-08-21 at 4 01 17 PM

Still working on debugging the cause of the mli failure. As for mlo, the output dataset can be opened as shown below. A few caveats:

  • 🙂 Please do not take this to be an official release of the Zarr dataset. This is an early preview, more validation work is required before we consider this canonical.
  • ⏳ Loading the dataset with xarray takes ~4 min (on my local laptop, maybe faster on a data-adjacent compute node, e.g. the LEAP hub). This is admittedly very fast compared to the alternative of downloading all ~13 TB, but not as fast as I'd like. I have some ideas as to why this is and will open/link related issues momentarily.

And a few notes on things that seem to have worked (please correct me if anything here seems inaccurate):

  • 📆 Time is parsed into an indexable coordinate (as opposed to a data variable, as it exists in the original NetCDF files). And 210240 timesteps are present, which is the expected number of time steps, as represented here.
  • 💾 Dataset totals ~13.4 TB (uncompressed), which is a plausible size for the aggregate mlo data: 210240 time steps x 61 MB per file = ~12.8 TB on disk, which gives a compression ratio of just under 1.05. This matches basically exactly with a compression ratio calculated for a single file of the mlo source data.
  • 📝 Any attributes listed in this google sheet have been added to the variables.
  • 🔢 As shown in the Details section below, chunksize is (2, 60, 21600) for time, ncol, and lev dimensions respectively. This means ~120 MB per chunk for mlo.

The mlo (prelim/preview only, no guarantees yet! 😄 ) dataset can be loaded as follows:

import xarray as xr
path = "gs://leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882522942-1/climsim-highres-mlo.zarr"
ds = xr.open_dataset(path, engine="zarr", chunks={})  # requires `gcsfs`, takes ~4 mins on my laptop
ds.nbytes / 1e12  # -> 13.36924905984 TB
len(ds.time)  # -> 210240
ds.state_t.attrs  # -> {'long_name': 'Air temperature', 'units': 'K'}
ds
<xarray.Dataset>
Dimensions:         (time: 210240, ncol: 21600, lev: 60)
Coordinates:
  * time            (time) object 0001-02-01 00:00:00 ... 0009-01-31 23:40:00
Dimensions without coordinates: ncol, lev
Data variables: (12/16)
    cam_out_FLWDS   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_NETSW   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECC   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECSC  (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLL    (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLLD   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    ...              ...
    state_q0003     (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_t         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_u         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_v         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    tod             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
    ymd             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
Attributes:
    calendar:  NO_LEAP
    fv_nphys:  2
    ne:        30

@duncanwp
Copy link

Hey @cisaacstern - I'd love to use this version of ClimSim so I can just grab a spatial slice of the data, but I can't seem to access the above URL. Did you resolve the issue in the end? Is there a new zarr url I can use (hopefully for both mlo and mli)?

@cisaacstern
Copy link
Contributor Author

Hi @duncanwp! I haven't been keeping up with this particular issue lately, @jbusecke may have some insight!

@jbusecke
Copy link
Collaborator

Howdie @duncanwp. I have moved all climsim related ingestion stuff to https://github.com/leap-stc/climsim_feedstock

As you can tell from leap-stc/climsim_feedstock#7, I am still struggling with ingestions the lowres data! I am hesitant to even try the highres data until then.

There is some ClimSim data in gs://leap-persistent-ro/sungdukyu, but I am unsure if it is the lowres or highres (maybe @sungdukyu or @SammyAgrawal can provide clarity).

Please let me know if this is urgent to you and I can shift priorities to try to get this to work.

I also opened a PR to add climsim into our catalog (https://catalog.leap.columbia.edu). We are not able to share links to specific datasets quite yet (tracking that in leap-stc/data-management#129), so for any future updates I recommend checking the catalog periodically!

@duncanwp
Copy link

Brilliant, thanks @jbusecke. I'll keep an eye on that repo, but it's not urgent as I can work around it for now.

@SammyAgrawal
Copy link

The gs://leap-persistent-ro/sungdukyu cloud bucket contains the low resolution data, specifically the first 8 years.

@jbusecke
Copy link
Collaborator

I think (at least for the low res data) we now have the opportunity to ingest the data as a virtual zarr reference directly from HF. See https://github.com/jbusecke/hugging_face_data_testing?tab=readme-ov-file as example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants