-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete file listings #46
Comments
Yup I just confirmed that the distributed search does not work properly 😡:
Which means I will now have to query every index node separately and combine results, what a pain. Ill stop here for now, but list the options I have going forwards
😩 |
Example how to maybe use intake-esgf: !pip install git+https://github.com/jbusecke/intake-esgf.git@http-links
import intake_esgf
from intake_esgf import ESGFCatalog
from intake_esgf.base import NoSearchResults
from pangeo_forge_esgf.utils import facets_from_iid
intake_esgf.conf.set(indices={
"esgf-node.llnl.gov":True,
"esg-dn1.nsc.liu.se":True,
"esgf-data.dkrz.de":True,
"esgf-node.ipsl.upmc.fr":True,
"esgf-node.ornl.gov":True,
"esgf.ceda.ac.uk":True,
# "esgf.nci.org.au":True,
})
cat = ESGFCatalog()
def get_urls_from_intake_esgf(iid:str, cat:ESGFCatalog):
print(iid)
facets = facets_from_iid(iid)
facets['version'] = facets['version'].replace('v','') # shouldn't be necessary once https://github.com/jbusecke/pangeo-forge-esgf/pull/41 is merged
try:
res = cat.search(**facets)
return res.to_http_link_dict()
except NoSearchResults:
return None
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
a = get_urls_from_intake_esgf(iid, cat)
[i['path'] for i in a] |
Ah here is a way to fail out these instances of incomplete filenames: from pangeo_forge_esgf.client import ESGFClient
import json
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
client = ESGFClient()
d = client.get_instance_id_input([iid])
print(json.dumps(d, indent=4)) This produces
{
"CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710": {
"id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710|esgf.nci.org.au",
"version": "20190710",
"access": [
"HTTPServer",
"GridFTP",
"OPENDAP",
"Globus"
],
"activity_drs": [
"CMIP"
],
"activity_id": [
"CMIP"
],
"cf_standard_name": [
"air_temperature"
],
"citation_url": [
"http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.json"
],
"data_node": "esgf.nci.org.au",
"data_specs_version": [
"01.00.30"
],
"dataset_id_template_": [
"%(mip_era)s.%(activity_drs)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s"
],
"datetime_start": "1975-01-16T12:00:00Z",
"datetime_stop": "2014-12-16T12:00:00Z",
"directory_format_template_": [
"%(root)s/%(mip_era)s/%(activity_drs)s/%(institution_id)s/%(source_id)s/%(experiment_id)s/%(member_id)s/%(table_id)s/%(variable_id)s/%(grid_label)s/%(version)s"
],
"east_degrees": 359.0625,
"experiment_id": [
"historical"
],
"experiment_title": [
"all-forcing simulation of the recent past"
],
"frequency": [
"mon"
],
"further_info_url": [
"https://furtherinfo.es-doc.org/CMIP6.MPI-M.MPI-ESM1-2-HR.historical.none.r1i1p1f1"
],
"geo": [
"ENVELOPE(-180.0, -0.9375, 89.284225, -89.284225)",
"ENVELOPE(0.0, 180.0, 89.284225, -89.284225)"
],
"geo_units": [
"degrees_east"
],
"grid": [
"gn"
],
"grid_label": [
"gn"
],
"index_node": "esgf.nci.org.au",
"instance_id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710",
"institution_id": [
"MPI-M"
],
"latest": true,
"master_id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn",
"member_id": [
"r1i1p1f1"
],
"mip_era": [
"CMIP6"
],
"model_cohort": [
"Registered"
],
"nominal_resolution": [
"100 km"
],
"north_degrees": 89.284225,
"number_of_aggregations": 1,
"number_of_files": 8,
"pid": [
"hdl:21.14100/e7de3c1e-2c48-3470-ba5e-f97a62a1878c"
],
"product": [
"model-output"
],
"project": [
"CMIP6"
],
"realm": [
"atmos"
],
"replica": true,
"size": 56793078,
"source_id": [
"MPI-ESM1-2-HR"
],
"source_type": [
"AOGCM"
],
"south_degrees": -89.284225,
"sub_experiment_id": [
"none"
],
"table_id": [
"Amon"
],
"title": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn",
"type": "Dataset",
"url": [
"http://esgf.nci.org.au/thredds/catalog/esgcet/CMIP6/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/Amon/tas/gn/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.xml#CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710|application/xml+thredds|THREDDS"
],
"variable": [
"tas"
],
"variable_id": [
"tas"
],
"variable_long_name": [
"Near-Surface Air Temperature"
],
"variable_units": [
"K"
],
"variant_label": [
"r1i1p1f1"
],
"west_degrees": 0.0,
"xlink": [
"http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.json|Citation|citation",
"http://hdl.handle.net/hdl:21.14100/e7de3c1e-2c48-3470-ba5e-f97a62a1878c|PID|pid"
],
"_version_": 1689449850470400000,
"retracted": false,
"_timestamp": "2021-01-20T23:22:11.250Z",
"score": 1.0
}
}
My idea is to use inject them as dataset attributes, and then run a check against the actual dataset time data to confirm that the dataset covers this (or at least close to this). |
leap-stc/cmip6-leap-feedstock#116 (comment) describes a case where I get a nice list of files back, but they are not complete!
How do we detect this case before ingesting?
The text was updated successfully, but these errors were encountered: