Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to load selected or all years available in an experiment #1120

Closed
sloosvel opened this issue May 11, 2021 · 22 comments · Fixed by #1133
Closed

Add option to load selected or all years available in an experiment #1120

sloosvel opened this issue May 11, 2021 · 22 comments · Fixed by #1133
Assignees
Labels
enhancement New feature or request ISENES

Comments

@sloosvel
Copy link
Contributor

sloosvel commented May 11, 2021

Is your feature request related to a problem? Please describe.
In #771 we tried to add the functionality to load all DCPP data without having to specify the start_year and the end_year, as the current time range handling is not ideal (#345). To do so, a new tag ` to load all the data available was introduced. @zklaus pointed out that it could work for all experiments.

Would you be able to help out?
Yes

@sloosvel sloosvel added the enhancement New feature or request label May 11, 2021
@sloosvel sloosvel mentioned this issue May 11, 2021
10 tasks
@sloosvel sloosvel changed the title Add option to load all years available in an experiment Add option to load selected or all years available in an experiment May 17, 2021
@bouweandela

This comment has been minimized.

@Peter9192

This comment has been minimized.

@axel-lauer

This comment has been minimized.

@schlunma

This comment has been minimized.

@zklaus
Copy link

zklaus commented May 18, 2021

This issue is not about a general wildcard mechanism, though. @sloosvel, perhaps you can elaborate on the use case a little bit? Why is it impossible or very difficult to use the normal start_year, end_year mechanism in this case?

@sloosvel
Copy link
Contributor Author

Thank you @zklaus. As you mention, it's not so much about the readability of recipes with wildcards. It's just that the current way of specifying start and end years, while clipping the dates from January 1st to December 31st, is not very convenient for certain datasets

Below you can see how files for a DCPP experiment can look like. These are files from the same experiment (dcppA-hindcast). They have an extra sub-experiment tag to indicate the initialization of the run (s1960, s1961, ..., s2018) and each one of these sub-experiments ranges a different time period (November 1960 - October 1971 for s1960, November 1961 - October 1972 for s1961, and so on for all sub-experiments until s2018)

tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196011-196110.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196111-196210.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196211-196310.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196311-196410.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196411-196510.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196511-196610.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196611-196710.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196711-196810.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196811-196910.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_196911-197010.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1960-r1i1p1f1_gr_197011-197110.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196111-196210.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196211-196310.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196311-196410.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196411-196510.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196511-196610.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196611-196710.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196711-196810.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196811-196910.nc
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_196911-197010.nc 
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_197011-197110.nc  
tas_Amon_EC-Earth3_dcppA-hindcast_s1961-r1i1p1f1_gr_197111-197210.nc  

So what would happen when someone tries to work with the full dataset with ESMValTool? First they would have to define 58 entries in the recipe to call all the sub-experiments:

- {sub_experiment: 's1960', start_year: 1960, end_year: 1971}
- {sub_experiment: 's1961', start_year: 1961, end_year: 1972}

... 56 lines later we are done loading one dataset (in a tool that is meant to compare multiple datasets).

I think it would be much more user friendly to have an interface like that:

-{sub_experiment: 's(1960:2018)', select_years: all}

And what happens if we only want to take into account, let's say, a subset of years from each sub_experiment? (I don't know if this a real-life application)

- {sub_experiment: 's1960', start_year: 1960, end_year: 1963}
- {sub_experiment: 's1961', start_year: 1961, end_year: 1964}

Wouldn't it be more user friendly to do so?

- {sub_experiment: 's(1960:2018)', select_years:  first 3}

But the problem here is another one. The clip_start_end_year function makes the cube lose points because it automatically sets the range to a period that does not align with the period of the dataset:

cube after loading: <iris 'Cube' of air_temperature / (K) (time: 48; latitude: 256; longitude: 512)>
cube after `clip_start_end_year`:  <iris 'Cube' of air_temperature / (K) (time: 38; latitude: 256; longitude: 512)>

So now it's not only a problem of user-friendliness, but rather that a preprocessing function is returning wrong values.

In summary, what I was trying to address here is trying to find a compact way of loading these types of datasets, as well as fixing the clipping of the time ranges. Whether this should be done with wildcards, compiling recipes or not accepting this kind of general recipes in the repository because of readability, I'm all for doing whatever is most convenient to everyone. But the main issue is not this one.

@bouweandela
Copy link
Member

bouweandela commented May 19, 2021

Thanks for explaining the feature in more detail @sloosvel, I did not get this from the issue text at the top. I created a new issue about the use of wildcards in #1138 and hid the discussion on the that topic here.

@sloosvel
Copy link
Contributor Author

I did not get this from the issue text at the top

My bad! I was not explaining things very well.

@bouweandela
Copy link
Member

bouweandela commented May 20, 2021

Wouldn't it be more user friendly to do so?

- {sub_experiment: 's(1960:2018)', select_years:  first 3}

I'm not sure if it's a good idea to move preprocessor functionality to the dataset section. Wouldn't it make more sense to create a preprocessor that selects the first or last data from all datasets, if that's something that's required?

But the problem here is another one. The clip_start_end_year function makes the cube lose points because it automatically sets the range to a period that does not align with the period of the dataset:

Would it be possible to set the range to the correct values when expanding the sub_experiment: 's(1960:2018)'? Maybe you could just omit specifying the start and end year?

@sloosvel
Copy link
Contributor Author

I'm not sure if it's a good idea to move preprocessor functionality to the dataset section. Wouldn't it make more sense to create a preprocessor that selects the first or last data from all datasets, if that's something that's required?

The start_year and end_year tags work in the same way as this tag would work. They load the data looking at the filenames, but the clipping does not happen until the corresponding preprocessing step . So in a way the start_year and end_year tags are also a preprocessor functionality specified in the dataset section.

Would it be possible to set the range to the correct values when expanding the sub_experiment: 's(1960:2018)'? Maybe you could just omit specifying the start and end year?

@zklaus did not let me do that in #771, because we should allow people to work with a subset of years for each sub-experiment. And that means that the clipping should be generalised for these type of cases.

@zklaus
Copy link

zklaus commented May 25, 2021

My point was that there is nothing special about the sub-experiment. I think of every sub-experiment as its own experiment. So if we want to say that leaving out the start year means "take everything from the beginning of the experiment", and leaving out the end year means "take everything to the end of the experiment", that could work, but there is no reason to tie this to the sub-experiment. Indeed, this functionality would be useful also to compare, for example, spin-ups of varying lengths, etc.

So I like the feature very much. I am a bit concerned about using just the absence of the tag as a marker since that seems to invite accidentally voluminous analysis. Hence, my suggestion to have a different syntax in the recipe.

tl;dr
I like the feature. Three questions:

  • Do you agree that it isn't really tied to sub-experiments?
  • Do we want the feature in general?
  • Do we need more expressive syntax?

@sloosvel
Copy link
Contributor Author

Maybe renaming the tag to timerange as proposed in #345 would be more clear?

@bouweandela
Copy link
Member

So in a way the start_year and end_year tags are also a preprocessor functionality specified in the dataset section.

I know, I think it's a not very good design and not something we should encourage.

Do we need more expressive syntax?

Could we use a wildcard '*' for specifying any start year (see also #1138)? And make a separate preprocessor function for cutting out certain time ranges?

@sloosvel
Copy link
Contributor Author

So all recipes would need to be modified?

@jvegreg
Copy link
Contributor

jvegreg commented May 31, 2021

Could we use a wildcard '*' for specifying any start year (see also #1138)? And make a separate preprocessor function for cutting out certain time ranges?

I think @ledm asked years ago for a feature to allow users easily load all available years but also the first/last X years and such. I would prefer a syntax that fixes all those cases at once.

Regarding the preprocessor function: it will usually go after the checker, so users can potentially be affected by issues in files that are really not required. Also, bear in mind that one use case for this will be inspecting the last years (30, 50 or so) of spin-up experiments that can run for 200 years: that may imply a lot of files to load that we do not really require.

My suggestion:

  • Deprecate start_year and end_year but do not remove them right now: give users a couple of releases at least
  • Create a new timespan or similar that will allow us to be more rich when asking for data, based on the ISO 8601:

Some examples (no need to support all of them from the start)

# Our current case
timespan: 1980/2020

# More granular options
timespan: 198012/202011

# Start and duration
timespan: 1980/P3Y
timespan: 198005/P3M

# The next ones are really important for us in the decadal / seasonal applications

# The full period
timespan: *

# Period at the start / end of data availability
 
timespan: P10Y # or P24M P300D later on if we reach seasonal timescales
timespan: P-10y # No way to represent the from the end in the standard, so I used the Python -

# Relative periods 
timespan: P10Y/P3M
timespan: P0y/-P5M

We may also support replacing the standard / for a space for readability

@sloosvel
Copy link
Contributor Author

sloosvel commented Jun 4, 2021

So which option would you go for? The timespan thing can fit what is already half started in #1133

@bouweandela
Copy link
Member

We would need to make sure that the proposed solution works with #345 (comment), though of course there is no need to exactly match the specification from the CMIP filenames with what we write in a recipe.

@ledm
Copy link
Contributor

ledm commented Jun 7, 2021

I really like this idea and I am all for it! Despite the fact that I hate it when we change recipe interface and it breaks everything - usually with no notice. Please keep the previous input functional or at least add an error message which provides a command to switch from the old standard to the new standard. If not, we'd be needlessly frustrating our users - again.

@zklaus
Copy link

zklaus commented Jun 8, 2021

Let's not hijack this issue. @ledm, I have created a new issue for your use-case at #1161, copying the part of your comment that I have removed here.

@bouweandela
Copy link
Member

Thanks for a fruitful discussion everyone! @sloosvel Could you please add a short summary to this issue, so we don't forget?

If only a duration or wildcard is specified, it would be best if this could be expanded to a start_time/end_time and written to the resulting recipe as proposed in #1138. That way it is completely specified to users outside your own institute what data needs to be obtained to run a recipe.

@sloosvel
Copy link
Contributor Author

A new timerange tag will be introduced, and will at some point deprecate the use of the start_year and end_year tags.
The tag will follow the ISO 8601 format, which can be parsed with pyiso8601 and will be specified by separating the start and end of the range with /. Possible values will be:

  • start/end : Specifying both dates
  • start/duration: Specifying a start date and getting the end date from a duration period (such as P3Y)
  • duration/end: Specifying an end date and getting the start date from a duration period
  • /duration: Finding first year available and getting the end date from a duration period
  • duration/: Finding last year available and getting the start date from a duration period
  • *: All years available

Furthermore:

If only a duration or wildcard is specified, it would be best if this could be expanded to a start_time/end_time and written to the resulting recipe as proposed in #1138. That way it is completely specified to users outside your own institute what data needs to be obtained to run a recipe.

And finally, these changes will also affect the clip_start_year function, which currently clips from January 1st to December 31st. It will be generalised to clip from the date that has been specified in the timerange tag. If the timerange only specifies years, the clipping will remain from Jan 1st to Dec 31st.

@bouweandela
Copy link
Member

bouweandela commented Jun 15, 2021

Regarding the domain that is selected, we would like to keep the current behaviour, e.g. if in the recipe

  • timerange: 1990/2000, this will select data with time coordinate >=1990-01-01 00:00:00 and <2001-01-01 00:00:00
  • timerange: 199004/200010, this will select data with time coordinate >=1990-04-01 00:00:00 and <2000-11-01 00:00:00
  • etc

so the end time is inclusive, i.e. 1 is added to the year/month/day/hour/minute/second that is the precision and then all data smaller than that is selected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ISENES
Projects
None yet
8 participants