-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-res zarr products - build tracking thread #38
Comments
Monday update: of the two jobs left running over the weekend, the ![]() Still working on debugging the cause of the
And a few notes on things that seem to have worked (please correct me if anything here seems inaccurate):
The import xarray as xr
path = "gs://leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882522942-1/climsim-highres-mlo.zarr"
ds = xr.open_dataset(path, engine="zarr", chunks={}) # requires `gcsfs`, takes ~4 mins on my laptop
ds.nbytes / 1e12 # -> 13.36924905984 TB
len(ds.time) # -> 210240
ds.state_t.attrs # -> {'long_name': 'Air temperature', 'units': 'K'}
ds
|
Hey @cisaacstern - I'd love to use this version of ClimSim so I can just grab a spatial slice of the data, but I can't seem to access the above URL. Did you resolve the issue in the end? Is there a new zarr url I can use (hopefully for both |
Howdie @duncanwp. I have moved all climsim related ingestion stuff to https://github.com/leap-stc/climsim_feedstock As you can tell from leap-stc/climsim_feedstock#7, I am still struggling with ingestions the lowres data! I am hesitant to even try the highres data until then. There is some ClimSim data in Please let me know if this is urgent to you and I can shift priorities to try to get this to work. I also opened a PR to add climsim into our catalog (https://catalog.leap.columbia.edu). We are not able to share links to specific datasets quite yet (tracking that in leap-stc/data-management#129), so for any future updates I recommend checking the catalog periodically! |
Brilliant, thanks @jbusecke. I'll keep an eye on that repo, but it's not urgent as I can work around it for now. |
The |
I think (at least for the low res data) we now have the opportunity to ingest the data as a virtual zarr reference directly from HF. See https://github.com/jbusecke/hugging_face_data_testing?tab=readme-ov-file as example |
I am currently building zarr products for the high-res data. Opening this thread so we have a public place to track progress on these efforts. By way of background:
mli
and one formlo
, totaling ~48 TB together. These datasets will be publicly available to everyone on the internet with no egress costs. If accessed from a cloud compute node (e,g., the LEAP JupyterHub), this will allow users of the data to access the full high res data product directly, without downloading anything.pangeo-forge-recipes
Python package, which uses Apache Beam as its distributed parallel computation framework, here's something I wrote recently on Beam, for those interested.My second full-scale attempt at running these jobs has now been running for a little over 2 days:
The first time I tried this they crashed after 3 days, and I think I fixed the bug that caused that crash. So if this attempt just works, they'll be done by early next week I'd guess. If these jobs crash, I'll restart them early next week and then maybe the next shot we'd have is for end of next week (budgeting a couple days per attempt).
The text was updated successfully, but these errors were encountered: