[REQUEST]: MPI-ESM1-2-HR historical #116

kareed1 · 2024-03-28T23:11:31Z

List of requested idds

'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710',

Description

Hello,
On both Google and AWS, the above noted dataset shows that it only contains the years 1915-1959. I'm not sure if this was on purpose. I'd like to request data for Jan 1985-Dec 2014 to be added to the repositories. Thank you for making CMIP6 data easier to access!

jbusecke · 2024-03-29T15:12:56Z

Hi @kareed1,

thanks for raising an issue here!

I assume you are still using the 'old' catalog file here. Can you provide some more information (small code snipped) on how you are accessing the data currently?

The new current catalog (more info how to access) does not seem to have that iid:

def zstore_to_iid(zstore: str):
    # this is a bit whacky to account for the different way of storing old/new stores
    return '.'.join(zstore.replace('gs://','').replace('.zarr','').replace('.','/').split('/')[-11:-1])

iids_requested = [
'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710',
]

import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json"
col = intake.open_esm_datastore(url)

iids_all= [zstore_to_iid(z) for z in col.df['zstore'].tolist()]
iids_uploaded = [iid for iid in iids_all if iid in iids_requested]
iids_uploaded

gives an empty list.

I will add this to the ingestion and see what we get.

jbusecke · 2024-03-29T15:27:54Z

See my comments in #119: This seems to require some more deep debugging unfortunately. Well get to the bottom of this eventually!

kareed1 · 2024-03-29T17:31:36Z

Hi @jbusecke ,

Thank you for the updates and your assistance on this. It probably is from the old catalog. I had found some code online to get started, so I'm not sure how old that code was. Below is an example of the Python code I'm using.

import numpy as np
import pandas as pd
import xarray as xr
import zarr
import gcsfs

#available datasets on Google Cloud
df = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

#access to GC data sets
gcs = gcsfs.GCSFileSystem(token='anon')

#query the table
df_atm = df.query("table_id      == 'Amon' & \
                   source_id     ==  'MPI-ESM1-2-HR' & \
                   variable_id   == 'tas'  & \
                   experiment_id == 'historical' & \
                   member_id     == 'r1i1p1f1'")

#retrieve data from Google cloud
var_path = df_atm.zstore.values[0]             #pathway dataset on Google Cloud
mapper = gcs.get_mapper(var_path)              #dataset object
dat = xr.open_zarr(mapper)                     #open the dataset

jbusecke · 2024-03-29T18:53:47Z

Cool thanks for that info. That all looks good but I recommend using https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/pangeo_esgf_zarr_qc.csv (https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json points to that!) going forward.

jbusecke · 2024-05-01T02:08:27Z

I have high hopes that a solution to jbusecke/pangeo-forge-esgf#42 will address this issue too.

* Add requested data for #116 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update iids_pr.yaml * Update iids_pr.yaml * Update iids.yaml * Update iids_pr.yaml --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

jbusecke · 2024-05-08T22:21:28Z

Ok the dataset was ingested, but ended up in our non-qc catalog.

I did some digging:

# you need specific versions of the following libraries to reproduce the following on the LEAP-Pangeo hub
pip install leap-data-management-utils[pangeo-forge] git+https://github.com/jbusecke/pangeo-forge-esgf.git@new-request-scheme

Lets load the store and run out tests (failing these causes this to be put in our non-qc catalog)

import zarr
from pangeo_forge_esgf.utils import facets_from_iid
from leap_data_management_utils.cmip_testing import test_all
import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog_noqc.json" # Only stores that fail current
col = intake.open_esm_datastore(url)
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
facets = facets_from_iid(iid)
del facets['mip_era']
cat = col.search(**facets)
store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
test_all(store, iid)

gives

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[24], line 15
     12 store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
     13 # ds = xr.open_dataset(store, engine='zarr')
     14 # ds
---> 15 test_all(store, iid)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:72](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=71), in test_all(store, iid, verbose)
     70 def test_all(store: zarr.storage.FSStore, iid: str, verbose=True) -> zarr.storage.FSStore:
     71     ds = test_open_store(store, verbose=verbose)
---> 72     test_time(ds, verbose=verbose)
     73     test_attributes(ds, iid, verbose=verbose)
     74     return store

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:49](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=48), in test_time(ds, verbose)
     47 if verbose:
     48     print(time_diff)
---> 49 assert (time_diff > 0).all()
     51 # assert that there are no large time gaps
     52 mean_time_diff = time_diff.mean()

AssertionError:

so the time is not continous!

we can confirm that

import matplotlib.pyplot as plt
import xarray as xr
ds = xr.open_dataset(store, engine='zarr')
plt.plot(ds.time) # note do not use the built in plot since it will seem like the time is continous, because the time is plotted against itself not the array index

Yeah thats not great...but its fixable!

plt.plot(ds.sortby('time').time)

so @kareed1 you can use the above to work with the dataset for now.

I want to understand how this happened though...

from pangeo_forge_esgf.client import ESGFClient
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
client = ESGFClient()
dataset_id = client.get_instance_id_input([iid])[iid]['id']
file_dict = client.get_recipe_inputs_from_dataset_ids([dataset_id])
list(file_dict[iid].keys())

this seems fine

'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197501-197912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198001-198412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198501-198912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199001-199412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199501-199912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200001-200412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200501-200912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_201001-201412.nc'

My first suspicion was that the files are not correctly concatenated, but that might not be it.
Will dig some more and follow up.

jbusecke · 2024-05-08T22:28:19Z

Oh wait, this is not a complete set of files! How strange.

jbusecke · 2024-05-08T22:34:51Z

Ill move discussion over to jbusecke/pangeo-forge-esgf#46, but will close this for now. Feel free to use the non-qc data for now, but proceed with caution @kareed1

kareed1 added the request Requests for new data to be ingested to the cloud label Mar 28, 2024

kareed1 changed the title ~~[REQUEST]: MPI-ESM1-2-HR historial~~ [REQUEST]: MPI-ESM1-2-HR historical Mar 29, 2024

jbusecke added a commit that referenced this issue Mar 29, 2024

Add requested data for #116

d76f0a5

jbusecke mentioned this issue Mar 29, 2024

Add requested data for #116 #119

Merged

jbusecke added cant find urls blocked labels Mar 29, 2024

jbusecke mentioned this issue May 8, 2024

Incomplete file listings jbusecke/pangeo-forge-esgf#46

Open

1 task

jbusecke closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST]: MPI-ESM1-2-HR historical #116

[REQUEST]: MPI-ESM1-2-HR historical #116

kareed1 commented Mar 28, 2024

jbusecke commented Mar 29, 2024 •

edited

Loading

jbusecke commented Mar 29, 2024

kareed1 commented Mar 29, 2024

jbusecke commented Mar 29, 2024

jbusecke commented May 1, 2024

jbusecke commented May 8, 2024

jbusecke commented May 8, 2024

jbusecke commented May 8, 2024

[REQUEST]: MPI-ESM1-2-HR historical #116

[REQUEST]: MPI-ESM1-2-HR historical #116

Comments

kareed1 commented Mar 28, 2024

List of requested idds

Description

jbusecke commented Mar 29, 2024 • edited Loading

jbusecke commented Mar 29, 2024

kareed1 commented Mar 29, 2024

jbusecke commented Mar 29, 2024

jbusecke commented May 1, 2024

jbusecke commented May 8, 2024

jbusecke commented May 8, 2024

jbusecke commented May 8, 2024

jbusecke commented Mar 29, 2024 •

edited

Loading