-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change climatologies to use xtime_start and xtime_end #165
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some preliminary comments to get you started.
My biggest concern is that I think we should be able to compute climatologies on datasets and dataarrays. I'm concerned this functionality is being designed away.
|
||
dsObs = monthlyClimatology.transpose('month', 'lat', 'lon') | ||
dsObs.rename({'time': 'Time'}, inplace=True) | ||
dsObs = dsObs.transpose('Time', 'lat', 'lon') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar, why is a transpose used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwolfram, the transpose is because the observational data is not in the same order as MPAS output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwolfram, what's your concern about the transpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transpose are notorious for being a bad idea, e.g., they typically indicate things weren't set up correctly and they are expensive to execute. It doesn't sound like it is a huge problem here but I just wanted to double-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are pretty small data sets and we've been transposing them the whole time. I just revised the code a bit.
@@ -303,7 +303,8 @@ def _compute_moc_climo_postprocess(config, runStreams, variableMap, calendar, | |||
dictClimo['endYearClimo'] = endYear | |||
|
|||
# Compute annual climatology | |||
annualClimatology = ds.mean('Time') | |||
annualClimatology = \ | |||
climatology.compute_climatology(ds, np.arange(1,13), calendar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to store np.arange(1,13)
as a constant and not hard code here-- maybe something like months_in_a_year
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I can take it from constants. I just figured it was obvious...
mpas_analysis/ocean/nino34_index.py
Outdated
@@ -122,15 +122,15 @@ def compute_nino34_index(regionSST, calendar): # {{{ | |||
|
|||
Parameters | |||
---------- | |||
regionSST : xarray.dataArray object | |||
ds : xarray.DataArray object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset
object, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, correct.
mpas_analysis/ocean/nino34_index.py
Outdated
if not isinstance(ds, xr.core.dataset.Dataset): | ||
raise ValueError('ds should be an xarray Dataset') | ||
|
||
ds = climatology.add_months_and_days_in_month(ds, calendar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be hidden from the user and not in the analysis-level code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem here is that we explicitly do a groupby('month')
below. It is not obvious where month
got added to the data set if this is not made explicit with an exposed call. In fact, I changed the behavior so month
doesn't get added to ds
behind the scenes in compute*_climatology
because I realized this behavior would be unexpected. As a result, any code that needs month
to be added to ds
for use outside of climatology
really needs to do it manually themselves.
mpas_analysis/ocean/nino34_index.py
Outdated
anomalySST = regionSST.groupby('month') - monthlyClimatology | ||
monthlyClimatology = climatology.compute_monthly_climatology(ds) | ||
|
||
print monthlyClimatology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want to print this out? If so, shouldn't it go to a file or have some type of comment around it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that was for debugging. I've removed it.
mpas_analysis/ocean/nino34_index.py
Outdated
|
||
return _running_mean(anomalySST.to_pandas()) # }}} | ||
return _running_mean(anomaly.avgSurfaceTemperature.to_pandas()) # }}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this should only be done for avgSurfaceTemperature
I would recommend making this segmentation the first step, e.g., if there wasn't lazy evaluation you would compute a nino34 index on all the variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avgSurfaceTemperature
is explicitly selected as the only variable in ds
when ds
is constructed, so I don't think that's a major concern. I agree that it would be clearer if we simply operated on a data array ('regionSST`) as before. I will expound on why I didn't have any success with that approach below and maybe we can find a solution.
if 'startTime' in ds.data_vars and 'endTime' in ds.data_vars: | ||
ds['daysInMonth'] = ds.endTime - ds.startTime | ||
else: | ||
# TODO: support calendar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have elaborated. I do not feel like writing special code for handling this case when the calendar is gregorian
. We would need to figure out whether each year is a leap year and whether the month is February, and change the number of days to 29 if so. We don't know of any runs that will need this capability. The else
case here is to support backward compatibility to ACME alpha runs where we used daysSinceStartOfSim
to compute the time coordinate (none of these runs use the gregorian
calendar). So I have added a warning but it doesn't seem like it's worth the effort of handling this case. If you want to, feel free...
|
||
def _compute_masked_mean(ds): | ||
''' | ||
Compute the time average of data set, masked out where the variables in ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The means in xarray are always computed with nanmean
unless otherwise specified. I'm a little confused as to the purpose of this function based on this descriptor for the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are 2 things xarray's mean calculations don't seem to do that this is meant to handle. First, there's not an obvious way to weight a mean by another variable (daysInMonth
in this case). Second, we want the normalization to be based on the number of valid (non-null) entries, so we need to also sum up how many valid entries there were and multiply them by the weights to determine how to do our weighted average. For observational data sets, the mask changes with time, so it is very important to only average over those dates that are valid. If one were to compute an annual mean using valid values from only Jan. and Feb. but were to divide by 365 days to get the weighted average, the resulting value would be off by a factor of about 6. This function attempts to properly take that into account.
If you would like to rewrite the function to work in a more xarray-like way, feel free. I have written many iterations of it and still haven't found a perfect solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes back to the problems with accounting for time properly. There probably aren't too many "perfect" solutions until that bigger, more fundamental issue gets fixed (pydata/xarray#789). Thanks for slogging through these complexities in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwolfram, I don't think a proper accounting of time would necessarily help with this problem because xarray doesn't have a sense of the data corresponding to a time span as opposed to a single point in time. Their example of computing seasonal climatologies (without masking) is not all that different from my approach:
http://xarray.pydata.org/en/stable/examples/monthly-means.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there should be an easier way to handle this on the xarray side-- I'm going to keep this in mind. It looks like this type of functionality isn't quite yet available (pydata/xarray#422) and the discussion has gone somewhat stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwolfram, I was unaware of pydata/xarray#422 but I think my approach is rather similar. Let's keep an eye on it and update if/when average
or weighted
is added to xarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar from what I can tell this doesn't change the core functionality of the nino or ocean modelvobs routines, so I'm fine here if your testing has shown consistent behavior with develop.
|
||
dsObs = monthlyClimatology.transpose('month', 'lat', 'lon') | ||
dsObs.rename({'time': 'Time'}, inplace=True) | ||
dsObs = dsObs.transpose('Time', 'lat', 'lon') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwolfram, the transpose is because the observational data is not in the same order as MPAS output.
So I agree with this concern. The simple problem is that we need a way of storing |
@pwolfram, looking further, |
fefb7aa
to
d61a180
Compare
@pwolfram, I have done a pretty significant rewrite based on your comments. The relevant functions in I have rewritten Thank you for the excellent review so far. Let me know if you have further concerns. |
d61a180
to
8af611d
Compare
@pwolfram, I just rebased. Should be ready for a re-review. I am aware that it would be good to add some CI to test the climatology computations. I don't have time to do that now and I would like this fix to but part of v0.2 but will put it on by to-do list for a follow-up PR. |
@xylar, did you want me to do another review in preparation for merging with the understanding that you will add CI testing for climatology later? |
@pwolfram, given that it doesn't look like this will get merged today, hold off a little bit and I'll add CI to it tomorrow. |
@pwolfram, CI has been added. I think you'll be pleased, but let me know if you'd like anymore changes, both to the CI and to the rest of the PR. |
fff2f9f
to
fa1cd35
Compare
If the time coordinate is defined based on the average of a start time and and end time (e.g xtime_start and xtime_end), also store the start and end time for later use in accurately performing time averages (e.g. computing climatologies).
There is no longer an advantage to computing monthly climatologies first, then seasonal climatologies (as was done in several analysis tasks), so seasonal climatologies are now computed directly. The MOC time average is now computed using compute_climatology (rather than just .mean('Time')) so that the number of days in a month is properly accounted for.
fa1cd35
to
dbbf1d2
Compare
name='mask') | ||
seasonalClimatology = (monthlyClimatology * daysInMonth).sum( | ||
dim='month', keep_attrs=True) | ||
climatologyMonths = ds.where(mask, drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This should help below when _compute_masked_mean
is used for the computation. 👍
|
||
# test compute_climatology on a data array | ||
mldClimatology = climatology.compute_climatology(ds.mld, monthValues, | ||
calendar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make sure mldClimatology and dsClimatology produce similar results? This just tests that Dataset vs DataArray computations work, which this test function implies anyway.
Is there some way that we can test the actually computed values, at least in an approximate way? This may be overboard but a climatology is a key computation so this is something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make sure mldClimatology and dsClimatology produce similar results? This just tests that Dataset vs DataArray computations work, which this test function implies anyway.
I have updated the test to do this
Is there some way that we can test the actually computed values, at least in an approximate way? This may be overboard but a climatology is a key computation so this is something to consider.
I went ahead and addressed this, too. Should be good to go.
'JFM.nc'.format(self.test_dir) | ||
self.assertEqual(regriddedFileName, expectedRegriddedFileName) | ||
|
||
def open_test_ds(self, config, calendar): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good because it will verify that the time selection will produce the appropriate dataset in terms of length-- great!
Unit test infrastructure for the generalized_reader. | ||
|
||
Xylar Asay-Davis | ||
02/16/2017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit-- Does the date need updated?
@xylar, I'm happy to merge unless you or @milenaveneziani have objections. I just want to give @milenaveneziani a chance to review and comment if desired. |
dbbf1d2
to
b8a5b6b
Compare
b8a5b6b
to
d65c74e
Compare
@pwolfram, I updated the CI testing according to your requests. This included adding |
refClimatology = xarray.open_dataset(climFileName) | ||
|
||
self.assertArrayApproxEqual(monthlyClimatology.mld.values, | ||
refClimatology.mld.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xylar, this will be really helpful to make sure we (or others!) don't accidentally break the key computation in MPAS-Analysis!
@xylar, @pwolfram: I am back and slowly catching up on things. I read the very first comment of this PR and only have a quick question: will we support xtime_start and xtime_end with this, no daysSinceStart? Sorry if I didn't go through all your comments, perhaps my question is anwered in one of them.. |
@milenaveneziani, this PR stores |
@milenaveneziani and @pwolfram, do we want to include this PR in v0.2? I'm fine with not including it and having it be the first merge after that tag. |
@xylar: great. Thanks for clarifying. |
Last call before a merge then... |
I'll merge in about an hour or so @xylar and @milenaveneziani. |
Climatologies were being weighted by the number of days in a month without regard to which calendar. With this merge, they are being weighted by the actual number of days associated with each time entry, as determined from
xtime_start
andxtime_end
(or equivalent for monthly averages).Note: if only
daysSinceStartOfSim
is available as a time variable, climatologies will still be computed but the number of days in February might be inaccurate on the Gregorian calendar. In practice, we don't anticipate having simulations withoutxtime_start
andxtime_end
but using the Gregorian calendar.The MOC climatology was previously being computed with a straight-up time average (
.mean('Time')
), which was a change I had made to try to speed up the code. In this merge, it is now being computed as a weighted average based on the number of days associated with each time index.The seasonal climatologies in several analysis tasks were being computed inefficiently and slightly inaccurately for the
gregorian
calendar by first computing a monthly climatology and then averaging again to get a seasonal or annual climatology.To support these changes,
mpas_xarray
has been modified to store thestartTime
andendTime
(if available) of each time index as coordinates. This is only possible if the time coordinate is defined based on the average of a start time and and end time (e.gxtime_start
andxtime_end
).