Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change climatologies to use xtime_start and xtime_end #165

Merged
merged 4 commits into from
Apr 3, 2017

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Mar 29, 2017

Climatologies were being weighted by the number of days in a month without regard to which calendar. With this merge, they are being weighted by the actual number of days associated with each time entry, as determined from xtime_start and xtime_end (or equivalent for monthly averages).

Note: if only daysSinceStartOfSim is available as a time variable, climatologies will still be computed but the number of days in February might be inaccurate on the Gregorian calendar. In practice, we don't anticipate having simulations without xtime_start and xtime_end but using the Gregorian calendar.

The MOC climatology was previously being computed with a straight-up time average (.mean('Time')), which was a change I had made to try to speed up the code. In this merge, it is now being computed as a weighted average based on the number of days associated with each time index.

The seasonal climatologies in several analysis tasks were being computed inefficiently and slightly inaccurately for the gregorian calendar by first computing a monthly climatology and then averaging again to get a seasonal or annual climatology.

To support these changes, mpas_xarray has been modified to store the startTime and endTime (if available) of each time index as coordinates. This is only possible if the time coordinate is defined based on the average of a start time and and end time (e.g xtime_start and xtime_end).

Copy link
Contributor

@pwolfram pwolfram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some preliminary comments to get you started.

My biggest concern is that I think we should be able to compute climatologies on datasets and dataarrays. I'm concerned this functionality is being designed away.


dsObs = monthlyClimatology.transpose('month', 'lat', 'lon')
dsObs.rename({'time': 'Time'}, inplace=True)
dsObs = dsObs.transpose('Time', 'lat', 'lon')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar, why is a transpose used here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwolfram, the transpose is because the observational data is not in the same order as MPAS output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwolfram, what's your concern about the transpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transpose are notorious for being a bad idea, e.g., they typically indicate things weren't set up correctly and they are expensive to execute. It doesn't sound like it is a huge problem here but I just wanted to double-check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pretty small data sets and we've been transposing them the whole time. I just revised the code a bit.

@@ -303,7 +303,8 @@ def _compute_moc_climo_postprocess(config, runStreams, variableMap, calendar,
dictClimo['endYearClimo'] = endYear

# Compute annual climatology
annualClimatology = ds.mean('Time')
annualClimatology = \
climatology.compute_climatology(ds, np.arange(1,13), calendar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to store np.arange(1,13) as a constant and not hard code here-- maybe something like months_in_a_year.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I can take it from constants. I just figured it was obvious...

@@ -122,15 +122,15 @@ def compute_nino34_index(regionSST, calendar): # {{{

Parameters
----------
regionSST : xarray.dataArray object
ds : xarray.DataArray object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset object, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, correct.

if not isinstance(ds, xr.core.dataset.Dataset):
raise ValueError('ds should be an xarray Dataset')

ds = climatology.add_months_and_days_in_month(ds, calendar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be hidden from the user and not in the analysis-level code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem here is that we explicitly do a groupby('month') below. It is not obvious where month got added to the data set if this is not made explicit with an exposed call. In fact, I changed the behavior so month doesn't get added to ds behind the scenes in compute*_climatology because I realized this behavior would be unexpected. As a result, any code that needs month to be added to ds for use outside of climatology really needs to do it manually themselves.

anomalySST = regionSST.groupby('month') - monthlyClimatology
monthlyClimatology = climatology.compute_monthly_climatology(ds)

print monthlyClimatology
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to print this out? If so, shouldn't it go to a file or have some type of comment around it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that was for debugging. I've removed it.


return _running_mean(anomalySST.to_pandas()) # }}}
return _running_mean(anomaly.avgSurfaceTemperature.to_pandas()) # }}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this should only be done for avgSurfaceTemperature I would recommend making this segmentation the first step, e.g., if there wasn't lazy evaluation you would compute a nino34 index on all the variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avgSurfaceTemperature is explicitly selected as the only variable in ds when ds is constructed, so I don't think that's a major concern. I agree that it would be clearer if we simply operated on a data array ('regionSST`) as before. I will expound on why I didn't have any success with that approach below and maybe we can find a solution.

if 'startTime' in ds.data_vars and 'endTime' in ds.data_vars:
ds['daysInMonth'] = ds.endTime - ds.startTime
else:
# TODO: support calendar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean exactly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have elaborated. I do not feel like writing special code for handling this case when the calendar is gregorian. We would need to figure out whether each year is a leap year and whether the month is February, and change the number of days to 29 if so. We don't know of any runs that will need this capability. The else case here is to support backward compatibility to ACME alpha runs where we used daysSinceStartOfSim to compute the time coordinate (none of these runs use the gregorian calendar). So I have added a warning but it doesn't seem like it's worth the effort of handling this case. If you want to, feel free...


def _compute_masked_mean(ds):
'''
Compute the time average of data set, masked out where the variables in ds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The means in xarray are always computed with nanmean unless otherwise specified. I'm a little confused as to the purpose of this function based on this descriptor for the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are 2 things xarray's mean calculations don't seem to do that this is meant to handle. First, there's not an obvious way to weight a mean by another variable (daysInMonth in this case). Second, we want the normalization to be based on the number of valid (non-null) entries, so we need to also sum up how many valid entries there were and multiply them by the weights to determine how to do our weighted average. For observational data sets, the mask changes with time, so it is very important to only average over those dates that are valid. If one were to compute an annual mean using valid values from only Jan. and Feb. but were to divide by 365 days to get the weighted average, the resulting value would be off by a factor of about 6. This function attempts to properly take that into account.

If you would like to rewrite the function to work in a more xarray-like way, feel free. I have written many iterations of it and still haven't found a perfect solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to the problems with accounting for time properly. There probably aren't too many "perfect" solutions until that bigger, more fundamental issue gets fixed (pydata/xarray#789). Thanks for slogging through these complexities in the meantime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwolfram, I don't think a proper accounting of time would necessarily help with this problem because xarray doesn't have a sense of the data corresponding to a time span as opposed to a single point in time. Their example of computing seasonal climatologies (without masking) is not all that different from my approach:

http://xarray.pydata.org/en/stable/examples/monthly-means.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there should be an easier way to handle this on the xarray side-- I'm going to keep this in mind. It looks like this type of functionality isn't quite yet available (pydata/xarray#422) and the discussion has gone somewhat stale.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwolfram, I was unaware of pydata/xarray#422 but I think my approach is rather similar. Let's keep an eye on it and update if/when average or weighted is added to xarray.

Copy link
Collaborator

@vanroekel vanroekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar from what I can tell this doesn't change the core functionality of the nino or ocean modelvobs routines, so I'm fine here if your testing has shown consistent behavior with develop.


dsObs = monthlyClimatology.transpose('month', 'lat', 'lon')
dsObs.rename({'time': 'Time'}, inplace=True)
dsObs = dsObs.transpose('Time', 'lat', 'lon')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwolfram, the transpose is because the observational data is not in the same order as MPAS output.

@xylar
Copy link
Collaborator Author

xylar commented Mar 30, 2017

My biggest concern is that I think we should be able to compute climatologies on datasets and dataarrays. I'm concerned this functionality is being designed away.

So I agree with this concern. The simple problem is that we need a way of storing month and daysInMonth arrays next to the data array in question. This is most logically done by adding data arrays to a data set. One could add them as coordinates to the data array instead of as separate data arrays. I tried this before and got strange behavior I didn't fully understand. But I am willing to give it another try.

@xylar
Copy link
Collaborator Author

xylar commented Mar 30, 2017

@pwolfram, looking further, startTime and endTime would also need to be coordinates. I will see if that even works...

@xylar xylar force-pushed the fix_climatology_dates branch from fefb7aa to d61a180 Compare March 30, 2017 14:53
@xylar
Copy link
Collaborator Author

xylar commented Mar 30, 2017

@pwolfram, I have done a pretty significant rewrite based on your comments. The relevant functions in climatology now work on both Dataset and DataArray objects (made possible by having month and daysInMonth as coordinates, rather than data arrays).

I have rewritten _compute_masked_mean in a way that I hope better utilizes xarray's capabilities and also better demonstrates what it hopes to accomplish that existing xarray functionality can't do.

Thank you for the excellent review so far. Let me know if you have further concerns.

@xylar xylar force-pushed the fix_climatology_dates branch from d61a180 to 8af611d Compare March 30, 2017 16:01
@xylar
Copy link
Collaborator Author

xylar commented Mar 30, 2017

@pwolfram, I just rebased. Should be ready for a re-review. I am aware that it would be good to add some CI to test the climatology computations. I don't have time to do that now and I would like this fix to but part of v0.2 but will put it on by to-do list for a follow-up PR.

@pwolfram
Copy link
Contributor

@xylar, did you want me to do another review in preparation for merging with the understanding that you will add CI testing for climatology later?

@xylar
Copy link
Collaborator Author

xylar commented Mar 30, 2017

@xylar, did you want me to do another review in preparation for merging with the understanding that you will add CI testing for climatology later?

@pwolfram, yep, sorry if I wasn't clear. That's it exactly

@xylar
Copy link
Collaborator Author

xylar commented Mar 30, 2017

@pwolfram, given that it doesn't look like this will get merged today, hold off a little bit and I'll add CI to it tomorrow.

@xylar
Copy link
Collaborator Author

xylar commented Mar 31, 2017

@pwolfram, CI has been added. I think you'll be pleased, but let me know if you'd like anymore changes, both to the CI and to the rest of the PR.

@xylar xylar force-pushed the fix_climatology_dates branch from fff2f9f to fa1cd35 Compare March 31, 2017 08:48
xylar added 2 commits April 1, 2017 20:54
If the time coordinate is defined based on the average of a
start time and and end time (e.g xtime_start and xtime_end),
also store the start and end time for later use in accurately
performing time averages (e.g. computing climatologies).
There is no longer an advantage to computing monthly climatologies
first, then seasonal climatologies (as was done in several analysis
tasks), so seasonal climatologies are now computed directly.

The MOC time average is now computed using compute_climatology
(rather than just .mean('Time')) so that the number of days in a
month is properly accounted for.
@xylar xylar force-pushed the fix_climatology_dates branch from fa1cd35 to dbbf1d2 Compare April 1, 2017 18:54
name='mask')
seasonalClimatology = (monthlyClimatology * daysInMonth).sum(
dim='month', keep_attrs=True)
climatologyMonths = ds.where(mask, drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This should help below when _compute_masked_mean is used for the computation. 👍


# test compute_climatology on a data array
mldClimatology = climatology.compute_climatology(ds.mld, monthValues,
calendar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to make sure mldClimatology and dsClimatology produce similar results? This just tests that Dataset vs DataArray computations work, which this test function implies anyway.

Is there some way that we can test the actually computed values, at least in an approximate way? This may be overboard but a climatology is a key computation so this is something to consider.

Copy link
Collaborator Author

@xylar xylar Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to make sure mldClimatology and dsClimatology produce similar results? This just tests that Dataset vs DataArray computations work, which this test function implies anyway.

I have updated the test to do this

Is there some way that we can test the actually computed values, at least in an approximate way? This may be overboard but a climatology is a key computation so this is something to consider.

I went ahead and addressed this, too. Should be good to go.

'JFM.nc'.format(self.test_dir)
self.assertEqual(regriddedFileName, expectedRegriddedFileName)

def open_test_ds(self, config, calendar):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good because it will verify that the time selection will produce the appropriate dataset in terms of length-- great!

Unit test infrastructure for the generalized_reader.

Xylar Asay-Davis
02/16/2017
Copy link
Contributor

@pwolfram pwolfram Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit-- Does the date need updated?

@pwolfram
Copy link
Contributor

pwolfram commented Apr 2, 2017

@xylar, I'm happy to merge unless you or @milenaveneziani have objections. I just want to give @milenaveneziani a chance to review and comment if desired.

@xylar xylar force-pushed the fix_climatology_dates branch from dbbf1d2 to b8a5b6b Compare April 3, 2017 06:48
@xylar xylar force-pushed the fix_climatology_dates branch from b8a5b6b to d65c74e Compare April 3, 2017 07:19
@xylar
Copy link
Collaborator Author

xylar commented Apr 3, 2017

@pwolfram, I updated the CI testing according to your requests. This included adding assertArrayApproxEqual for testing if 2 numpy arrays are close.

refClimatology = xarray.open_dataset(climFileName)

self.assertArrayApproxEqual(monthlyClimatology.mld.values,
refClimatology.mld.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xylar, this will be really helpful to make sure we (or others!) don't accidentally break the key computation in MPAS-Analysis!

@milenaveneziani
Copy link
Collaborator

@xylar, @pwolfram: I am back and slowly catching up on things. I read the very first comment of this PR and only have a quick question: will we support xtime_start and xtime_end with this, no daysSinceStart? Sorry if I didn't go through all your comments, perhaps my question is anwered in one of them..

@xylar
Copy link
Collaborator Author

xylar commented Apr 3, 2017

@milenaveneziani, this PR stores xtime_start and xtime_end if they're available for later use. If they're available, it uses them to compute the number of days in a month (meaning it doesn't have to be smart about the calendar when computing climatologies, it just takes MPAS's word for how long the month was). If a file has daysSinceStartOfSim but not xtime_start and xtime_end (e.g. alpha7 and maybe alpha8?) it will use that and will resort to looking up how many days are in a given month (but will ignore leap years). Since we didn't run any alpha8 runs with the Gregorian calendar, I don't think this will be a problem. Even if we did, the error would be quite small.

@xylar
Copy link
Collaborator Author

xylar commented Apr 3, 2017

@milenaveneziani and @pwolfram, do we want to include this PR in v0.2? I'm fine with not including it and having it be the first merge after that tag.

@milenaveneziani
Copy link
Collaborator

milenaveneziani commented Apr 3, 2017

@xylar: great. Thanks for clarifying.
I am OK with including this in v0.2 (since it seems it is ready to merge).

@pwolfram
Copy link
Contributor

pwolfram commented Apr 3, 2017

Last call before a merge then...

@pwolfram
Copy link
Contributor

pwolfram commented Apr 3, 2017

I'll merge in about an hour or so @xylar and @milenaveneziani.

@pwolfram pwolfram merged commit a38db9d into MPAS-Dev:develop Apr 3, 2017
@xylar xylar deleted the fix_climatology_dates branch April 3, 2017 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants