-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atlite ESGF interface for downloading and preparing CMIP6 data #180
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Ovewh this is great! I just had a brief look through without testing the code. You should also write a guide on how to introduce new datasets ;)
Here are only some general comments. I have to go through it all when I have time (which still might happen this week). Everything looks good. The only thing I do not like too much is having a yaml for the model config. In contrast to the windturbine/solar config, these are parameters which should be easily adjustable and not that hidden. I would rather prefer a slim version where only esgf_params are supported (no model argument and no yaml file) and in the documentation (or example) we provide some reasonable sets of params. Sounds sensible?
features = { | ||
"wind": ["wnd10m"], | ||
"influx": ["influx", "outflux"], | ||
"temperature": ["temperature"], | ||
"runoff": ["runoff"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these features are all the features provided by cmip? Just wondering because they look very similar to era5 (Idea is that a feature is a umbrella term for the xarray variable which will be retrieved)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's all the features. Though I'm not sure I get your question, isn't the point that the features should have the same name across the different datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, totally correct. I just wanted to be sure this isn't a copy/paste artefact
if v in datavars: | ||
ds[v].attrs["feature"] = [k for k, l in fd if v in l].pop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give me a hint why this is necessary? It is hard to comprehend without the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CMIP data have longitude bins, latitude bins and time bins included as variables and these aren't defined as variables in the features dict. If I don't include if statement it will try to pop an empty list. Suggestion on a better way to do this?
Data variables:
time_bnds (time, bnds) datetime64[ns] 2021-01-01 ... 2021-02-01
lat_bnds (y, bnds) float64 31.65 32.35 32.65 33.35 ... 82.35 82.65 83.35
lon_bnds (x, bnds) float64 346.6 347.4 347.6 348.4 ... 44.35 44.65 45.35
wnd10m (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
influx (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
outflux (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
temperature (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
runoff (time, y, x) float32 dask.array<chunksize=(100, 52, 59), meta=np.ndarray>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather question whether we want to carry these variables along? If not needed, I'd argue to delete them before (in get_data) since they not requested from the feature list anyway
atlite/datasets/cmip.py
Outdated
CMIP sometimes specify the time in the center of the output intervall this shifted | ||
to the beginning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather shift it to the end of the period, see https://atlite.readthedocs.io/en/master/conventions.html#time-points
atlite/datasets/cmip.py
Outdated
cutout : atlite.Cutout | ||
feature : str | ||
Name of the feature data to retrieve. Must be in | ||
`atlite.datasets.era5.features` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be cmip features
@@ -126,6 +126,14 @@ def __init__(self, path, **cutoutparams): | |||
gebco_path: str | |||
Path to find the gebco netcdf file. Only necessary when including | |||
the gebco module. | |||
esgf_params: dict | |||
Parameters to be used in search on the ESGF database. | |||
model: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep that we should rename this keyword argument into something more specific, like esgf_model
Reuse compliance requires a comment with the license at the beginning of each new file (can just be copied from the other .py/.yaml files) |
@FabianHofmann Something I would like your thoughts on. As I mentioned the surface roughness isn't available from CMIP, I have yet to address this issue in my code. My idea is to just have a keyword argument path with some external roughness dataset, however I could also make atlite prepare a static roughness dataset from ERA5? Atleast requiring the path to roughness dataset to be provided during the creation of the cutout would avoid any confusion of the data source. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@Ovewh for retrieving features from other sources one has to add the module to the cutout. |
No, only a few models provide wind speed at 100m, most models provide only the surface wind speed. So the windspeed has to be extrapolated. |
Some models have the output stored in files containing 10-15 years of data rather than one year. In addition not the time index isn't either always the same this has been fixed by converting dataset following cftime to numpy datetime64
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Okay then the roughness data has to come from the era5 dataset. atlite allows to mix datasources. So the best way would be to retrieve all variable from ESGF and fill up with era5 data which is principally done with cutout = atlite.Cutout('my_cutout', module=['esgf', 'era5'], time=....) Then it will retrieve all availabe features from esgf and the fill up missing variables (in that case the roughness data) from era5. Could you try that out? |
@FabianHofmann Yes ,so the issue is that CMIP contains future climate projections, and ERA5 is a reanalysis. It only makes sense to take the averaged roughness from ERA5, either based on one year or a single month. I did some sensitivity tests calculating capacity factors using constant and forecasted roughness for ERA5. There where only a slight difference in the offshore capacities. |
for more information, see https://pre-commit.ci
Let's also have a look at https://py-cordex.readthedocs.io/en/stable/index.html |
@FabianHofmann It doesn't look like py-cordex have an interface for downloading data, but I did not work with the CORDEX data. The first attempt on creating a CMIP interface I made turned out to be bit of a dead end. Integrating downloading of the CMIP data directly in atlite did not work out that well, since the CMIP datafiles are formated slightly different from model to model (e.g. some models provide yearly files or 10 years in one file, and then the models also use different calendars). It is probably simpler and more robust to make a very general interface for sideloading locally stored climate and weather data into atlite. Then it would be up to the user to preprocess the data into a format that atlite can understand. |
Interesting! This would be similar to what we have for the (Just a comment) Outsider question (I'm not familiar with CMIP/COREDEX datasets): Is there like a central repository from which one can manually downloaded the data? |
Yes, that's my idea, though perhaps even more general, instead of path, it would be a xarray.Dataset.
Yes, all the CMIP6/CORDEX data is stored at ESGF data nodes. It provides different ways of downloading the data e..g. OPeNDAP and wget scripts. |
Change proposed in this Pull Request
Add a interface in atlite to the ESGF CMIP database for downloading and preparing CORDEX and CMIP6 data.
Description
An interface in atlite for working with Climate model output have been developed. There is an example on how to use this interface available in
examples/cmip_interface_example.ipynb
. The search parameters for the ESGF database can be specified as either as dictionary when setting up the cutout or in theatlite/datasets/cmip.ymal
file. Determining the search parameters have to be done manually by searching the ESGF database.Variables required by atlite are:
This similar to the variables required by the old CORDEX interface, however surface roughness doesn't seem to be available from CMIP.
Based on the provided search parameters, it uses the pyesf-search python api to find matching results, if there are more than one result it will take the most resent result. Then the OPeNDAP urls for that result are obtained and which can be loaded lazily using xarray. This means that the data can be subset according to the cutout and the download and computation will be triggered by cutout.prepare(). Be aware that some models and dataservers doesn't provide OPeNDAP urls, which means that you might have to try different ensamble to find a model that has. The current example uses data from the EC-Earth3 model. There is also a possibility to download netCDF files with 1 year of data individually, however this haven't been implemented in atlite. That might be more robust, but atleast so far the OPeNDAP interface in xarray have been working flawlessly.
Caveats:
The highest temporal resolution that are available CMIP is 3hr, however some models only have some of variables required by atlite at 3hr resolution while others are at 6hr resolution. CMIP6 also have quite coarse resolution ~ 100km, CORDEX has higher resolution. The surface roughness is not available in CMIP, currently averaged roughness is taken from ERA5.
Motivation and Context
Explore influence of future climate change on energy systems. Related issue #59
How Has This Been Tested?
The functionally have been tested for calculating wind and pv capacities. Tested with python > 3.9.
Type of change
Checklist
pytest
inside the repository and no unexpected problems came up.doc/
.environment.yaml
file.doc/release_notes.rst
.pre-commit run --all
to lint/format/check my contribution