Add option to load selected or all years available in an experiment #1120

sloosvel · 2021-05-11T09:09:43Z

Is your feature request related to a problem? Please describe.
In #771 we tried to add the functionality to load all DCPP data without having to specify the start_year and the end_year, as the current time range handling is not ideal (#345). To do so, a new tag ` to load all the data available was introduced. @zklaus pointed out that it could work for all experiments.

Would you be able to help out?
Yes

The text was updated successfully, but these errors were encountered:

zklaus · 2021-05-18T08:32:39Z

This issue is not about a general wildcard mechanism, though. @sloosvel, perhaps you can elaborate on the use case a little bit? Why is it impossible or very difficult to use the normal start_year, end_year mechanism in this case?

sloosvel · 2021-05-18T09:56:32Z

Thank you @zklaus. As you mention, it's not so much about the readability of recipes with wildcards. It's just that the current way of specifying start and end years, while clipping the dates from January 1st to December 31st, is not very convenient for certain datasets

Below you can see how files for a DCPP experiment can look like. These are files from the same experiment (dcppA-hindcast). They have an extra sub-experiment tag to indicate the initialization of the run (s1960, s1961, ..., s2018) and each one of these sub-experiments ranges a different time period (November 1960 - October 1971 for s1960, November 1961 - October 1972 for s1961, and so on for all sub-experiments until s2018)

tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196011-196110.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196111-196210.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196211-196310.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196311-196410.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196411-196510.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196511-196610.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196611-196710.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196711-196810.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196811-196910.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196911-197010.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_197011-197110.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196111-196210.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196211-196310.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196311-196410.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196411-196510.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196511-196610.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196611-196710.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196711-196810.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196811-196910.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196911-197010.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_197011-197110.nc  
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_197111-197210.nc

So what would happen when someone tries to work with the full dataset with ESMValTool? First they would have to define 58 entries in the recipe to call all the sub-experiments:

- {sub_experiment: 's1960', start_year: 1960, end_year: 1971}
- {sub_experiment: 's1961', start_year: 1961, end_year: 1972}

... 56 lines later we are done loading one dataset (in a tool that is meant to compare multiple datasets).

I think it would be much more user friendly to have an interface like that:

-{sub_experiment: 's(1960:2018)', select_years: all}

And what happens if we only want to take into account, let's say, a subset of years from each sub_experiment? (I don't know if this a real-life application)

- {sub_experiment: 's1960', start_year: 1960, end_year: 1963}
- {sub_experiment: 's1961', start_year: 1961, end_year: 1964}

Wouldn't it be more user friendly to do so?

- {sub_experiment: 's(1960:2018)', select_years:  first 3}

But the problem here is another one. The clip_start_end_year function makes the cube lose points because it automatically sets the range to a period that does not align with the period of the dataset:

cube after loading: <iris 'Cube' of air_temperature / (K) (time: 48; latitude: 256; longitude: 512)>
cube after `clip_start_end_year`:  <iris 'Cube' of air_temperature / (K) (time: 38; latitude: 256; longitude: 512)>

So now it's not only a problem of user-friendliness, but rather that a preprocessing function is returning wrong values.

In summary, what I was trying to address here is trying to find a compact way of loading these types of datasets, as well as fixing the clipping of the time ranges. Whether this should be done with wildcards, compiling recipes or not accepting this kind of general recipes in the repository because of readability, I'm all for doing whatever is most convenient to everyone. But the main issue is not this one.

bouweandela · 2021-05-19T10:26:22Z

Thanks for explaining the feature in more detail @sloosvel, I did not get this from the issue text at the top. I created a new issue about the use of wildcards in #1138 and hid the discussion on the that topic here.

sloosvel · 2021-05-19T10:30:44Z

I did not get this from the issue text at the top

My bad! I was not explaining things very well.

bouweandela · 2021-05-20T15:11:06Z

Wouldn't it be more user friendly to do so?
- {sub_experiment: 's(1960:2018)', select_years:  first 3}

I'm not sure if it's a good idea to move preprocessor functionality to the dataset section. Wouldn't it make more sense to create a preprocessor that selects the first or last data from all datasets, if that's something that's required?

But the problem here is another one. The clip_start_end_year function makes the cube lose points because it automatically sets the range to a period that does not align with the period of the dataset:

Would it be possible to set the range to the correct values when expanding the sub_experiment: 's(1960:2018)'? Maybe you could just omit specifying the start and end year?

sloosvel · 2021-05-25T07:00:30Z

I'm not sure if it's a good idea to move preprocessor functionality to the dataset section. Wouldn't it make more sense to create a preprocessor that selects the first or last data from all datasets, if that's something that's required?

The start_year and end_year tags work in the same way as this tag would work. They load the data looking at the filenames, but the clipping does not happen until the corresponding preprocessing step . So in a way the start_year and end_year tags are also a preprocessor functionality specified in the dataset section.

Would it be possible to set the range to the correct values when expanding the sub_experiment: 's(1960:2018)'? Maybe you could just omit specifying the start and end year?

@zklaus did not let me do that in #771, because we should allow people to work with a subset of years for each sub-experiment. And that means that the clipping should be generalised for these type of cases.

zklaus · 2021-05-25T08:35:02Z

My point was that there is nothing special about the sub-experiment. I think of every sub-experiment as its own experiment. So if we want to say that leaving out the start year means "take everything from the beginning of the experiment", and leaving out the end year means "take everything to the end of the experiment", that could work, but there is no reason to tie this to the sub-experiment. Indeed, this functionality would be useful also to compare, for example, spin-ups of varying lengths, etc.

So I like the feature very much. I am a bit concerned about using just the absence of the tag as a marker since that seems to invite accidentally voluminous analysis. Hence, my suggestion to have a different syntax in the recipe.

tl;dr
I like the feature. Three questions:

Do you agree that it isn't really tied to sub-experiments?
Do we want the feature in general?
Do we need more expressive syntax?

sloosvel · 2021-05-26T13:57:43Z

Maybe renaming the tag to timerange as proposed in #345 would be more clear?

bouweandela · 2021-05-27T11:58:28Z

So in a way the start_year and end_year tags are also a preprocessor functionality specified in the dataset section.

I know, I think it's a not very good design and not something we should encourage.

Do we need more expressive syntax?

Could we use a wildcard '*' for specifying any start year (see also #1138)? And make a separate preprocessor function for cutting out certain time ranges?

sloosvel · 2021-05-31T07:33:37Z

So all recipes would need to be modified?

jvegreg · 2021-05-31T11:18:57Z

Could we use a wildcard '*' for specifying any start year (see also #1138)? And make a separate preprocessor function for cutting out certain time ranges?

I think @ledm asked years ago for a feature to allow users easily load all available years but also the first/last X years and such. I would prefer a syntax that fixes all those cases at once.

Regarding the preprocessor function: it will usually go after the checker, so users can potentially be affected by issues in files that are really not required. Also, bear in mind that one use case for this will be inspecting the last years (30, 50 or so) of spin-up experiments that can run for 200 years: that may imply a lot of files to load that we do not really require.

My suggestion:

Deprecate start_year and end_year but do not remove them right now: give users a couple of releases at least
Create a new timespan or similar that will allow us to be more rich when asking for data, based on the ISO 8601:

Some examples (no need to support all of them from the start)

# Our current case
timespan: 1980/2020

# More granular options
timespan: 198012/202011

# Start and duration
timespan: 1980/P3Y
timespan: 198005/P3M

# The next ones are really important for us in the decadal / seasonal applications

# The full period
timespan: *

# Period at the start / end of data availability
 
timespan: P10Y # or P24M P300D later on if we reach seasonal timescales
timespan: P-10y # No way to represent the from the end in the standard, so I used the Python -

# Relative periods 
timespan: P10Y/P3M
timespan: P0y/-P5M

We may also support replacing the standard / for a space for readability

sloosvel · 2021-06-04T10:11:21Z

So which option would you go for? The timespan thing can fit what is already half started in #1133

bouweandela · 2021-06-04T13:54:34Z

We would need to make sure that the proposed solution works with #345 (comment), though of course there is no need to exactly match the specification from the CMIP filenames with what we write in a recipe.

ledm · 2021-06-07T11:54:36Z

I really like this idea and I am all for it! Despite the fact that I hate it when we change recipe interface and it breaks everything - usually with no notice. Please keep the previous input functional or at least add an error message which provides a command to switch from the old standard to the new standard. If not, we'd be needlessly frustrating our users - again.

zklaus · 2021-06-08T12:14:22Z

Let's not hijack this issue. @ledm, I have created a new issue for your use-case at #1161, copying the part of your comment that I have removed here.

bouweandela · 2021-06-15T10:09:00Z

Thanks for a fruitful discussion everyone! @sloosvel Could you please add a short summary to this issue, so we don't forget?

If only a duration or wildcard is specified, it would be best if this could be expanded to a start_time/end_time and written to the resulting recipe as proposed in #1138. That way it is completely specified to users outside your own institute what data needs to be obtained to run a recipe.

sloosvel · 2021-06-15T13:50:12Z

A new timerange tag will be introduced, and will at some point deprecate the use of the start_year and end_year tags.
The tag will follow the ISO 8601 format, which can be parsed with pyiso8601 and will be specified by separating the start and end of the range with /. Possible values will be:

start/end : Specifying both dates
start/duration: Specifying a start date and getting the end date from a duration period (such as P3Y)
duration/end: Specifying an end date and getting the start date from a duration period
/duration: Finding first year available and getting the end date from a duration period
duration/: Finding last year available and getting the start date from a duration period
*: All years available

Furthermore:

If only a duration or wildcard is specified, it would be best if this could be expanded to a start_time/end_time and written to the resulting recipe as proposed in #1138. That way it is completely specified to users outside your own institute what data needs to be obtained to run a recipe.

And finally, these changes will also affect the clip_start_year function, which currently clips from January 1st to December 31st. It will be generalised to clip from the date that has been specified in the timerange tag. If the timerange only specifies years, the clipping will remain from Jan 1st to Dec 31st.

bouweandela · 2021-06-15T14:11:56Z

Regarding the domain that is selected, we would like to keep the current behaviour, e.g. if in the recipe

timerange: 1990/2000, this will select data with time coordinate >=1990-01-01 00:00:00 and <2001-01-01 00:00:00
timerange: 199004/200010, this will select data with time coordinate >=1990-04-01 00:00:00 and <2000-11-01 00:00:00
etc

so the end time is inclusive, i.e. 1 is added to the year/month/day/hour/minute/second that is the precision and then all data smaller than that is selected.

sloosvel added the enhancement New feature or request label May 11, 2021

sloosvel mentioned this issue May 11, 2021

Add tag all_years: True #1122

Closed

10 tasks

sloosvel changed the title ~~Add option to load all years available in an experiment~~ Add option to load selected or all years available in an experiment May 17, 2021

sloosvel mentioned this issue May 17, 2021

Allow to load all files, first X years or last X years in an experiment #1133

Merged

10 tasks

This comment has been minimized.

Sign in to view

bouweandela mentioned this issue Jun 7, 2021

Monthly ESMValTool meeting June ESMValGroup/ESMValTool#2173

Closed

zklaus mentioned this issue Jun 8, 2021

Link times of related experiments (parent - child) #1161

Open

zklaus added this to the v2.4.0 milestone Jun 8, 2021

Peter9192 mentioned this issue Jun 15, 2021

Switch to using time ranges eWaterCycle/ewatercycle#105

Closed

sloosvel mentioned this issue Jul 6, 2021

Generalise clip_timerange to custom time periods. #1214

Closed

10 tasks

jvegreg added the ISENES label Jul 22, 2021

jvegreg assigned sloosvel Jul 22, 2021

zklaus modified the milestones: v2.4.0, v2.5.0 Oct 15, 2021

schlunma closed this as completed in #1133 Dec 21, 2021

sloosvel mentioned this issue Jan 13, 2022

Adapt annual_statistics to DCPP data #1422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to load selected or all years available in an experiment #1120

Add option to load selected or all years available in an experiment #1120

sloosvel commented May 11, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

zklaus commented May 18, 2021

sloosvel commented May 18, 2021

bouweandela commented May 19, 2021 •

edited

Loading

sloosvel commented May 19, 2021

bouweandela commented May 20, 2021 •

edited by zklaus

Loading

sloosvel commented May 25, 2021

zklaus commented May 25, 2021

sloosvel commented May 26, 2021

bouweandela commented May 27, 2021

sloosvel commented May 31, 2021

jvegreg commented May 31, 2021 •

edited

Loading

sloosvel commented Jun 4, 2021

bouweandela commented Jun 4, 2021

ledm commented Jun 7, 2021 •

edited by zklaus

Loading

zklaus commented Jun 8, 2021

bouweandela commented Jun 15, 2021

sloosvel commented Jun 15, 2021

bouweandela commented Jun 15, 2021 •

edited

Loading

Add option to load selected or all years available in an experiment #1120

Add option to load selected or all years available in an experiment #1120

Comments

sloosvel commented May 11, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

zklaus commented May 18, 2021

sloosvel commented May 18, 2021

bouweandela commented May 19, 2021 • edited Loading

sloosvel commented May 19, 2021

bouweandela commented May 20, 2021 • edited by zklaus Loading

sloosvel commented May 25, 2021

zklaus commented May 25, 2021

sloosvel commented May 26, 2021

bouweandela commented May 27, 2021

sloosvel commented May 31, 2021

jvegreg commented May 31, 2021 • edited Loading

sloosvel commented Jun 4, 2021

bouweandela commented Jun 4, 2021

ledm commented Jun 7, 2021 • edited by zklaus Loading

zklaus commented Jun 8, 2021

bouweandela commented Jun 15, 2021

sloosvel commented Jun 15, 2021

bouweandela commented Jun 15, 2021 • edited Loading

sloosvel commented May 11, 2021 •

edited

Loading

bouweandela commented May 19, 2021 •

edited

Loading

bouweandela commented May 20, 2021 •

edited by zklaus

Loading

jvegreg commented May 31, 2021 •

edited

Loading

ledm commented Jun 7, 2021 •

edited by zklaus

Loading

bouweandela commented Jun 15, 2021 •

edited

Loading