Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST]: MPI-ESM1-2-HR historical #116

Closed
kareed1 opened this issue Mar 28, 2024 · 8 comments
Closed

[REQUEST]: MPI-ESM1-2-HR historical #116

kareed1 opened this issue Mar 28, 2024 · 8 comments
Labels
blocked cant find urls request Requests for new data to be ingested to the cloud

Comments

@kareed1
Copy link

kareed1 commented Mar 28, 2024

List of requested idds

'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710',

Description

Hello,
On both Google and AWS, the above noted dataset shows that it only contains the years 1915-1959. I'm not sure if this was on purpose. I'd like to request data for Jan 1985-Dec 2014 to be added to the repositories. Thank you for making CMIP6 data easier to access!

@kareed1 kareed1 added the request Requests for new data to be ingested to the cloud label Mar 28, 2024
@kareed1 kareed1 changed the title [REQUEST]: MPI-ESM1-2-HR historial [REQUEST]: MPI-ESM1-2-HR historical Mar 29, 2024
@jbusecke
Copy link
Collaborator

jbusecke commented Mar 29, 2024

Hi @kareed1,

thanks for raising an issue here!

I assume you are still using the 'old' catalog file here. Can you provide some more information (small code snipped) on how you are accessing the data currently?

The new current catalog (more info how to access) does not seem to have that iid:

def zstore_to_iid(zstore: str):
    # this is a bit whacky to account for the different way of storing old/new stores
    return '.'.join(zstore.replace('gs://','').replace('.zarr','').replace('.','/').split('/')[-11:-1])

iids_requested = [
'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710',
]

import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json"
col = intake.open_esm_datastore(url)

iids_all= [zstore_to_iid(z) for z in col.df['zstore'].tolist()]
iids_uploaded = [iid for iid in iids_all if iid in iids_requested]
iids_uploaded

gives an empty list.

I will add this to the ingestion and see what we get.

@jbusecke
Copy link
Collaborator

See my comments in #119: This seems to require some more deep debugging unfortunately. Well get to the bottom of this eventually!

@kareed1
Copy link
Author

kareed1 commented Mar 29, 2024

Hi @jbusecke ,

Thank you for the updates and your assistance on this. It probably is from the old catalog. I had found some code online to get started, so I'm not sure how old that code was. Below is an example of the Python code I'm using.

import numpy as np
import pandas as pd
import xarray as xr
import zarr
import gcsfs

#available datasets on Google Cloud
df = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

#access to GC data sets
gcs = gcsfs.GCSFileSystem(token='anon')

#query the table
df_atm = df.query("table_id      == 'Amon' & \
                   source_id     ==  'MPI-ESM1-2-HR' & \
                   variable_id   == 'tas'  & \
                   experiment_id == 'historical' & \
                   member_id     == 'r1i1p1f1'")

#retrieve data from Google cloud
var_path = df_atm.zstore.values[0]             #pathway dataset on Google Cloud
mapper = gcs.get_mapper(var_path)              #dataset object
dat = xr.open_zarr(mapper)                     #open the dataset

@jbusecke
Copy link
Collaborator

Cool thanks for that info. That all looks good but I recommend using https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/pangeo_esgf_zarr_qc.csv (https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json points to that!) going forward.

@jbusecke
Copy link
Collaborator

jbusecke commented May 1, 2024

I have high hopes that a solution to jbusecke/pangeo-forge-esgf#42 will address this issue too.

jbusecke added a commit that referenced this issue May 7, 2024
* Add requested data for #116

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update iids_pr.yaml

* Update iids_pr.yaml

* Update iids.yaml

* Update iids_pr.yaml

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@jbusecke
Copy link
Collaborator

jbusecke commented May 8, 2024

Ok the dataset was ingested, but ended up in our non-qc catalog.

I did some digging:

# you need specific versions of the following libraries to reproduce the following on the LEAP-Pangeo hub
pip install leap-data-management-utils[pangeo-forge] git+https://github.com/jbusecke/pangeo-forge-esgf.git@new-request-scheme

Lets load the store and run out tests (failing these causes this to be put in our non-qc catalog)

import zarr
from pangeo_forge_esgf.utils import facets_from_iid
from leap_data_management_utils.cmip_testing import test_all
import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog_noqc.json" # Only stores that fail current
col = intake.open_esm_datastore(url)
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
facets = facets_from_iid(iid)
del facets['mip_era']
cat = col.search(**facets)
store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
test_all(store, iid)

gives

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[24], line 15
     12 store = zarr.storage.FSStore(cat.df['zstore'].tolist()[0])
     13 # ds = xr.open_dataset(store, engine='zarr')
     14 # ds
---> 15 test_all(store, iid)

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:72](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=71), in test_all(store, iid, verbose)
     70 def test_all(store: zarr.storage.FSStore, iid: str, verbose=True) -> zarr.storage.FSStore:
     71     ds = test_open_store(store, verbose=verbose)
---> 72     test_time(ds, verbose=verbose)
     73     test_attributes(ds, iid, verbose=verbose)
     74     return store

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py:49](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/leap_data_management_utils/cmip_testing.py#line=48), in test_time(ds, verbose)
     47 if verbose:
     48     print(time_diff)
---> 49 assert (time_diff > 0).all()
     51 # assert that there are no large time gaps
     52 mean_time_diff = time_diff.mean()

AssertionError:

so the time is not continous!

we can confirm that

import matplotlib.pyplot as plt
import xarray as xr
ds = xr.open_dataset(store, engine='zarr')
plt.plot(ds.time) # note do not use the built in plot since it will seem like the time is continous, because the time is plotted against itself not the array index
image

Yeah thats not great...but its fixable!

plt.plot(ds.sortby('time').time)

so @kareed1 you can use the above to work with the dataset for now.

I want to understand how this happened though...

from pangeo_forge_esgf.client import ESGFClient
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
client = ESGFClient()
dataset_id = client.get_instance_id_input([iid])[iid]['id']
file_dict = client.get_recipe_inputs_from_dataset_ids([dataset_id])
list(file_dict[iid].keys())

this seems fine

'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_197501-197912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198001-198412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_198501-198912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199001-199412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_199501-199912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200001-200412.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_200501-200912.nc',
 'tas_Amon_MPI-ESM1-2-HR_historical_r1i1p1f1_gn_201001-201412.nc'

My first suspicion was that the files are not correctly concatenated, but that might not be it.
Will dig some more and follow up.

@jbusecke
Copy link
Collaborator

jbusecke commented May 8, 2024

Oh wait, this is not a complete set of files! How strange.

@jbusecke
Copy link
Collaborator

jbusecke commented May 8, 2024

Ill move discussion over to jbusecke/pangeo-forge-esgf#46, but will close this for now. Feel free to use the non-qc data for now, but proceed with caution @kareed1

@jbusecke jbusecke closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked cant find urls request Requests for new data to be ingested to the cloud
Projects
None yet
Development

No branches or pull requests

2 participants