Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic time subset selection #1987

Open
2 tasks done
baptistehamon opened this issue Nov 7, 2024 · 9 comments
Open
2 tasks done

Dynamic time subset selection #1987

baptistehamon opened this issue Nov 7, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@baptistehamon
Copy link

baptistehamon commented Nov 7, 2024

Addressing a Problem?

Currently, the xclim.indices.generic.select_time() supports several indexers types. However, all of them are statics and compute indicators using the same bounds over space while computing them between "dynamic" bounds can be powerful. For example, in agroclimatology is often useful to compute indicators between two phenological stages that change across space (and time).

I think it would be really interesting to have this option implemented.

Potential Solution

I'm thinking of providing doy_bounds as a tuple of np.ndarray or xr.DataArray but I'm not sure about the feasibility and the complexity of such implementation.

Contribution

  • I would be willing/able to open a Pull Request to contribute this feature.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@baptistehamon baptistehamon added the enhancement New feature or request label Nov 7, 2024
@tlogan2000
Copy link
Collaborator

tlogan2000 commented Nov 7, 2024

Hi @baptistehamon thanks for the question: This seems similar but not identical to xclim.indices.generic.aggregate_between_dates https://xclim.readthedocs.io/en/stable/indices.html#xclim.indices.generic.aggregate_between_dates

In this case there is an obligatory aggregation operation applied to the date ranges but the dynamic indexing ideas seem close. From memory this was implemented to sum values (e.g.degrees days or precipitation) within a growing season which can have different start end dates every year as well as varying spatially

@aulemahal
Copy link
Collaborator

That would be cool!

My intuition tells me that a basic implementation of this wouldn't be to difficult, but that it would start to get tough when the full 3D mask (spatial x time) begins to be large and requires dask.

A non-dask implementation could be a good first shot.

doy_bounds as a tuple of np.ndarray or xr.DataArray

I think xclim-style would be xr.DataArray objects, so we can use xarray automated alignment (non-temporal dimensions would be the shared between the bounds and the input).

@baptistehamon
Copy link
Author

Indeed, xclim.indices.generic.aggregate_between_dates used to compute xclim.indices.effective_growing_degree_days is very similar and could be a good starting point. However, I've seen one weakness, the function doesn't support a start later than end (returning np.nan) necessary when working on phenology in the Southern Hemisphere. I think this makes the implementation much harder but maybe I'm wrong.

@tlogan2000
Copy link
Collaborator

Hmm, yes we do tend to have a northern hemisphere bias... It might be possible to have this work with a more flexible use doy_to_days_since here:

start = doy_to_days_since(start)

Right now there is no user parameter for the doy_to_days_since call so it simply uses the default of the beginning of the time axis (I think this is how it works at least). A more explicit ability to have a use provided date string "MM-DD" could be helpful

@aulemahal do you see any major flaws in my thinking? Right now with the hard coded default doy_to_days_since would it be possible to make this work in southern hemisphere context simply by ensuring the the input dataset starts say july or august 01?

@aulemahal
Copy link
Collaborator

it simply uses the default of the beginning of the time axis

Not exactly, in the case of yearly data, it uses the timestamp.

The idea of the default is that start and end are annual timeseries, and the doy gets converted to a count of days since the beginning of "the year". Emphasis here on "the year", as it does not mean the calendar year, but rater the YS-MM decided when computing the bounds.

So. If you compute start and end dates based on xclim indicators with freq YS-JUL, the aggregate_between_dates should work for southern hemisphere stuff.

@baptistehamon
Copy link
Author

I had a quick look and aggregate_between_dates works for the SH with YS-JUL. I'll see how I can use it to improve the select time method in the coming days!

@baptistehamon
Copy link
Author

I implemented this option based on aggregate_between_dates code and made some changes to support doy_bounds as a tuple of int and xr.DataArray. I've tested that with some data and it seems to work well even with dask (except with drop=True). However, I had a complexity error with pre-commit but I think I could fix this by splitting the code into another function (e.g., mask_between_doy) and this will also allow to avoid redundancy in code with aggregate_between_dates. What do you think about this?

@aulemahal
Copy link
Collaborator

Seems good to me!

@baptistehamon
Copy link
Author

I'm ready to do the PR but haven't done one before so I'm not familiar with this, especially the tests mentioned in the contributing documentation. Should I do that before the PR ? I've only done the pre-commit check so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants