Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make indices and indicators easier to maintain and parse #364

Closed
aulemahal opened this issue Feb 12, 2020 · 18 comments · Fixed by #435
Closed

Make indices and indicators easier to maintain and parse #364

aulemahal opened this issue Feb 12, 2020 · 18 comments · Fixed by #435
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@aulemahal
Copy link
Collaborator

Description

I think the structure of xclim needs a major refactoring to better organise the indices and indicators and make it easier for automated tools to parse the available methods.

Right now we define indices that take care of the computation and of the units. Then we create an indicator that adds checks and metadata. A few of those indicators called wrapped versions of the indices, but mostly the relationships are 1-to-1.

The problem:

  • There is a lot of duplicated code for very similar algorithms.
  • Both xclim.indices and xclim.atmos are crowded and hard to maintain.

I suggest the following structure:
Indices: Base algorithms, with some options and general names and doc
Indicator: Wrapped partial of indices, setting some options and defaults (+ metadata)
Aliases : For creating indicator that have different metadata and defaults for the same computation.

Aliases would simply update an existing Indicator with new metadata and new default values for parameters.

Ex:
Indice : degree_days(temp, ref_temp, max_temp, min_temp, sign)
Indicator : heating_degree_days -> degree_days(tas, 17 degC, 17 degC, None, -1)
Aliases : freezing_index -> heating_degree_days (ref_temp = 0 degC)

Caveats:
The parsed docstring would be less relevent to the actual indicator than right now, but there could be an easy way to modify this as we modify the metadata.
This will take time.

@aulemahal aulemahal added the enhancement New feature or request label Feb 12, 2020
@aulemahal aulemahal added this to the v1.0 milestone Feb 12, 2020
@aulemahal aulemahal self-assigned this Feb 12, 2020
@huard
Copy link
Collaborator

huard commented Feb 12, 2020

One thing we probably want to keep are the custom docstrings. I agree that computationally, heating and cooling degree days are similar, but ideally the docstrings would be different to explain what each function is doing, provide examples and references.

So totally agree we could replace similar code with generic functions, but I would preserve the individual indices to keep those docstrings.

@aulemahal
Copy link
Collaborator Author

Makes sense!

Decorator idea

We could implement this idea by having a Indicator decorator?

Example:

@create_indicator(
    identifier="blabla",
   **other metadata)
def heating_degree_days(tas, thresh='17 degC'):
    """Specific doc-string"""
    return degree_days(tas, ref_temp=thresh, max_temp=thresh, min_temp=None, sign=-1)

Pros:

  • Easy to implement
  • Signature is correctly seen by interactive editors (ipython, pycharm)
  • Doc-string easily overrided

Cons:

  • Contact with the compute function is lost
  • Unit handling has to be done again

Third level idea

Instead, there could be a xclim/indices/core.py file with the base algos. Then, in xclim/indices/threshold.py, we would have the multiple *_degree_days indices. This adds a 3rd level so we get:

  1. Algorithms : only computation
  2. Indices : docstring, unit handling
  3. Indicator and Aliases : CF metadata, NaN- and incomplete period handling

Pros:

  • More structure

Cons:

  • Too much structure?

@huard
Copy link
Collaborator

huard commented Feb 13, 2020

I think we want to maintain access to the "pure computation" layer for users, so -1 for me for the decorator option.

We're already in some ways in the 3rd level state, where generic.py has a few functions (e.g. select_resample_op) that are re-used by indices.

I'm +1 on renaming generic.py to core.py and adding to it.

In terms of refactoring, I think we could also split utils.py and create a units.py for all the units handing stuff, and a base.py for the Indicator class related stuff.

@tlogan2000
Copy link
Collaborator

I think this is a good idea as well (not sure I can give an opinion on exactly how to implement) but the idea is sound especially for a v1.x release

One thing we need to consider is that accessing xclim.indices directly is currently the only option for bypassing the very strict missing data checks

I think we would need to minimally add a 'skipna'=True option to the indicator classes.
Moving forward we may want to implement options for the various flavours WMO Data completeness checks : see section 4 https://library.wmo.int/doc_num.php?explnum_id=4166

Note - These are mainly wrt to calculating valid months. I would assume a valid year requires 12 valid months ?

We could then set a default check method depending on the indicator calculation (sum, mean, extreme, count,) but users could still mix and match depending on their need.

@huard
Copy link
Collaborator

huard commented Feb 13, 2020

Agree that we need more flexibility for missing data checks. For example, a typical use case is to say I'll accept results as long as there is less than 5% of missing values.

We need to think of an API for indicators for this. My suggestion would be to modify missing_any to support a % threshold, and then add an argument to the __call__ function to specify the threshold of missing data. That can be part of another issue though.

@tlogan2000
Copy link
Collaborator

Agreed percent is a good / felxible compromise . However, is a little dangerous for annual in my opinion ... i.e. even 5% mssing could be ~19 days in a single month. We probably could apply the % check monthly then infer yearly based on having 12 acceptable months?

@huard
Copy link
Collaborator

huard commented Feb 13, 2020

This probably needs more thinking (do we interpolate missing values before averaging for example), but it's out of scope for this issue.

@tlogan2000
Copy link
Collaborator

This probably needs more thinking (do we interpolate missing values before averaging for example), but it's out of scope for this issue.

Will create a separate issue for this

@Zeitsperre
Copy link
Collaborator

Just wanted to chime in here with a suggestion for dealing with docstrings and potentially call signatures (this also flows into the capabilities we want for a command-line tool).

Within Miranda, I've had a few instances where multi-level calls to functions necessitated wrapping many functions and in order to deal with the call signatures and docstrings, I've used a few libraries, I have a few examples within the GIS tools of the library that work effectively (https://github.com/Ouranosinc/miranda/blob/master/miranda/gis/vector.py), that depend on inspect and another library called custom_inherit (https://github.com/rsokl/custom_inherit).

@huard huard modified the milestones: v1.0, v0.18 Apr 16, 2020
@huard
Copy link
Collaborator

huard commented Apr 22, 2020

Another issue to consider is that some indicator arguments cannot take negative values, and this restriction should ideally be propagated to Finch and the climatedata.ca web frontend. Thoughts on how to solve this ?

@aulemahal
Copy link
Collaborator Author

aulemahal commented Apr 22, 2020

I'm not sure what you mean here.
Is it: 1) We should add some input argument checks so that indicators fail on invalid input (e.g: negative Kelvin temps)
or 2) Right now some indicator fail on value that should be valid. If 2, any examples?

edit:
Thoughts on 1 : It would be a bit unpythonic to do that in the indices/indicator themselves. AFAI understand it, duck-typing implies that validity of argument is the caller's responsability. However, we could implement our own typing objects so that this information is made explicitly available and the WPS processes could use it. Or may be the indicators, but I would argue for keeping the ability to switch this off.

@huard
Copy link
Collaborator

huard commented Apr 22, 2020

And not only do a check, but encode the information that this argument requires positive values down to Finch, so it can enforce checking itself.

Yes, using Typing is something I've considered.

@aulemahal
Copy link
Collaborator Author

What would people think of this:
We implement our own type hints, like "PositiveInt", "NumberInRange[0, 10]", "Unit['temperature']", etc. Those would be new classes with at least a check(arg) function, returning True or False.
As we already check the units on indices, we would need to change declare_units to read the annotations instead of arguments. (And maybe rename it check_units).
Further argument checks could be done by another function in xclim.core.typing, or by the process/indicator themselves as the annotation itself already provides the checking function. This would limit us to testing only arguments annotated with our own types, which I think is ok.

@huard
Copy link
Collaborator

huard commented Apr 22, 2020

Is this an abuse of type hints ?
Otherwise I think this would work.

@aulemahal
Copy link
Collaborator Author

Python does not enforce any dynamic use of type hints, but we would need to look into common packages (ex: mypy) to see if this approach is compatible!

@Zeitsperre
Copy link
Collaborator

I don't see mypy being useful to us as it requires that all dependencies be using it (or have defined types in call signatures throughout) as well. Perhaps once the GIS-based libraries are moved from subset and the number of dependencies is reduced (significantly), mypy might be a good option.

Something similar would be worth looking into.

@huard
Copy link
Collaborator

huard commented Apr 22, 2020

Just to be clear, here is what I understand from this proposal:

@declare_units
def my_computation(tas: X[T], thresh: NonNeg[T]) -> X[T]:
...

And then check does something like:

  1. inspect the signature,
  2. get the type of each argument
  3. map the type to an Input class
  4. run the parameter through the class, which checks an input is valid

We'd probably want to remove the part that handles output unit conversion, or at least think about it.

@aulemahal
Copy link
Collaborator Author

aulemahal commented Apr 22, 2020

That's what I am proposing!
Minor clarifications:
0. Would be the argument binding, like what is already done in the indicator class.
3. -> The annotation attribute of the signature object is the input class, so the check method would do a isinstance(annotation, BaseXclimType) before checking. Also, important point : I do not want to check the type but the value of the input argument. Let's keep xclim duck-typed.
5. We could have an option to either warn or raise on invalid inputs.

@huard huard added this to the v0.18 milestone May 13, 2020
@huard huard linked a pull request Jun 25, 2020 that will close this issue
6 tasks
@huard huard modified the milestones: v0.18, v0.19 Jun 26, 2020
@huard huard closed this as completed in #435 Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants