Skip to content

first try at a function to break 1D-daily into 2D for seasons #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 13, 2021

Conversation

remicousin
Copy link
Contributor

Also includes seasonal_sum to illustrate how could be used to make "seasonal" functions.

The original plan was to make a generic function that will split 1D daily data into 2D daily data by days and years and select a subset of days defining a season. Thus allowing to perform analysis on the days dimension to obtain yearly series of a seasonal quantity and thus be able to reproduce a lot of popular Ingrid functions (e.g. and including onset date).

It turns out that xarray offers a grouping function and more specifically by bins that then allow operations by group. This seems nicer to me as there is then no need to create a new dimension, worry about indexing and bins can have different sizes (making leap year management easier). The idea being to give the edges of the season of interest, group, then drop every other group and then that can be fed to any other operator.

So I went with that. Because bins are defined by their edges, I made the inputs for the season be start and end rather than start and length. Now that fits better the groupby_bins function and it's typically easier for a user to think of seasons in these terms. That may mean changing onset date function inputs accordingly, or maybe cover the 2 cases (ie also offer inputs to be start and length).

Now it also turns out that it's not easy to drop groups from a grouped object. Or it was but someone complained about automatically dropped empty bins so this PR brings them back automatically and recommends using dropna to actually drop. But one can do dropna only after applying an operator to the grouped data, so that is something to be aware of when creating something like seasonal_sum.

That makes me doubt whether this is a good route or not. The grouping thing is seducing but the dropping business not so much. Other options could be:

  • use the pandas versions of those things (groupby, cut) and use filter for dropping;
  • go a totally different route, TBD

There is a run_test_season_stuff() to help with testing and one can run calc.py with
CONFIG=config-sample.yaml python calc.py

Includes seasonal_sum to illustrate how could be used to make "seasonal" functions
@remicousin
Copy link
Contributor Author

And of course there could be other problems that the ones I laid down in introduction

@ikhomyakov
Copy link
Contributor

That makes me doubt whether this is a good route or not. The grouping thing is seducing but the dropping business not so much. Other options could be:

@remicousin, I also doubt that the grouping is a good approach. Let me look into this more and I will try to update by Monday.

@remicousin
Copy link
Contributor Author

That makes me doubt whether this is a good route or not. The grouping thing is seducing but the dropping business not so much. Other options could be:

@remicousin, I also doubt that the grouping is a good approach. Let me look into this more and I will try to update by Monday.

ok!

That label-grouping concept seems to be all over the place though... Maybe I am not using it right... or maybe we need a more sophisticated version... or maybe it's not the way to go at all...

@aaron-kaplan
Copy link
Collaborator

My first thought for extracting a single season would be to use where with drop=True instead of groupby_bins. But if you're planning to generalize beyond the current use case to support multiple seasons, groupby_bins might be the way to go. Did you consider and reject the option of using where?

@remicousin
Copy link
Contributor Author

My first thought for extracting a single season would be to use where with drop=True instead of groupby_bins. But if you're planning to generalize beyond the current use case to support multiple seasons, groupby_bins might be the way to go. Did you consider and reject the option of using where?

I didn't think of using where. I am definitely thinking of using this as a generalization to do seasonal calculations from daily data. It turns out that this is how Ingrid does it too (~groups according to edges and drops every other group so I figured that was probably a good way to do things). Regardless, how would you do it with where?

  1. find the edges like done here
  2. use where to keep days within the right edges
  3. then how do you make it "2D." Using some selection and then concatenating on a new dimension? One advantage of groupby_bins is that the bins need not be the same size, which is bound to happen with leap years, since the grouping doesn't create another dimension, but rather labels the grouped elements of interest.

@aaron-kaplan
Copy link
Collaborator

Rémi and I had a convo and decided to try using where to drop the dates that are out of season, followed by groupby_bins to do the aggregation over each season. There will then be one group per year rather than two, eliminating the complicated and not entirely correct concatenation and dropna steps.

First drops data not within seasons of interest
Then uses groupby
@remicousin
Copy link
Contributor Author

remicousin commented Sep 1, 2021

Latest commits improves on this. However, can't use groupby_bins, even after dropping days outside seasons, because the empty interval for the wrong seasons is still considered. Thus used groupby according to a group description. If that is the right route, then need to figure out how to give "group" better metadata (which is a bit sad because groupby_bins was creating nice labels -- not that I am sure they could be reused in a meaningful manner, though...).

Or could use groupby_bins withe edges as the first day of each season, but then the "nice" groupby_bins labels would have lost the end point of the seasons. I figured it might be easier for us to define "group" in a better fashion than the crude one I have now.

Maybe diff or differentiate applied to the edges of the seasons and keeping every other one could help build a time dimension to replace group. I haven't tried. And I don't know if it works on time. But that would be easy to figure out. But let's validate the last commit first.

…aningless integers

Turns out I can't put a GroupBy object in a Dataset (or can I?)
so instead, the function returns what groupby needs to be run
then the functions that rely on it like seasonal_sum actually run the groupby
@remicousin
Copy link
Contributor Author

In this last commit I also attempted to clean up: code formatting; naming conventions... but there might be more to do.

Copy link
Collaborator

@aaron-kaplan aaron-kaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're doing some weird shit with array broadcasting that I find difficult to follow. I made it most of the way through but gave up near the end. If you think the result is right, let's move ahead for now. I would like to get the onset date function working, write a comprehensive set of automated tests for it (I will help with that), and then we can refine the implementation with confidence because the tests will tell us if we break anything.

Previously, the case where the input data ended in 28 or 29 Feb
was dropping the last complete year (because was lookign for March 1st)
The change also makes end edges always right and not some times March 1st
aaron-kaplan
aaron-kaplan previously approved these changes Sep 9, 2021
Copy link
Collaborator

@aaron-kaplan aaron-kaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

At some point I would like to see if we can express the season starts and ends using the CF conventions for cell boundaries. Xarray doesn't understand cell bounds, so this isn't relevant to our immediate goals, but other existing tools (e.g. ingrid) do understand cell bounds, and xarray might in the future, so I think we should generate CF-compliant metadata if we can. If you want to look into that now, feel free; if you want to put it off for another time I'm ok with that too.

@remicousin
Copy link
Contributor Author

See that I made a last commit as I realized my dealing of picking 29 Feb as end of season was not quite right. I explain in the commit comment. Can explain further if need be. I think now it is square.

For the CF conventions business, shall we address the other PR first to wrap up onset_date function? Once this is merged, I'll rebase the other branch on master and I'll try use what we created here to apply the onset_index function with it to make an onset_date function (for multiple years basically). Then if I have time, I am happy to check back out at CF stuff.

and no need for end_day2/_month2
I hadnot understood where that extra dim was coming from...
@remicousin remicousin merged commit 36f42c0 into master Sep 13, 2021
@remicousin remicousin deleted the remic/onset3 branch September 13, 2021 15:36
@Agro-Climate Agro-Climate mentioned this pull request Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants