Change climatologies to use xtime_start and xtime_end #165

xylar · 2017-03-29T18:56:07Z

Climatologies were being weighted by the number of days in a month without regard to which calendar. With this merge, they are being weighted by the actual number of days associated with each time entry, as determined from xtime_start and xtime_end (or equivalent for monthly averages).

Note: if only daysSinceStartOfSim is available as a time variable, climatologies will still be computed but the number of days in February might be inaccurate on the Gregorian calendar. In practice, we don't anticipate having simulations without xtime_start and xtime_end but using the Gregorian calendar.

The MOC climatology was previously being computed with a straight-up time average (.mean('Time')), which was a change I had made to try to speed up the code. In this merge, it is now being computed as a weighted average based on the number of days associated with each time index.

The seasonal climatologies in several analysis tasks were being computed inefficiently and slightly inaccurately for the gregorian calendar by first computing a monthly climatology and then averaging again to get a seasonal or annual climatology.

To support these changes, mpas_xarray has been modified to store the startTime and endTime (if available) of each time index as coordinates. This is only possible if the time coordinate is defined based on the average of a start time and and end time (e.g xtime_start and xtime_end).

pwolfram

Here are some preliminary comments to get you started.

My biggest concern is that I think we should be able to compute climatologies on datasets and dataarrays. I'm concerned this functionality is being designed away.

pwolfram · 2017-03-29T18:59:42Z

mpas_analysis/ocean/ocean_modelvsobs.py

-
-            dsObs = monthlyClimatology.transpose('month', 'lat', 'lon')
+            dsObs.rename({'time': 'Time'}, inplace=True)
+            dsObs = dsObs.transpose('Time', 'lat', 'lon')


@xylar, why is a transpose used here?

@pwolfram, the transpose is because the observational data is not in the same order as MPAS output.

@pwolfram, what's your concern about the transpose?

Transpose are notorious for being a bad idea, e.g., they typically indicate things weren't set up correctly and they are expensive to execute. It doesn't sound like it is a huge problem here but I just wanted to double-check.

These are pretty small data sets and we've been transposing them the whole time. I just revised the code a bit.

pwolfram · 2017-03-29T19:27:54Z

mpas_analysis/ocean/meridional_overturning_circulation.py

@@ -303,7 +303,8 @@ def _compute_moc_climo_postprocess(config, runStreams, variableMap, calendar,
            dictClimo['endYearClimo'] = endYear

        # Compute annual climatology
-        annualClimatology = ds.mean('Time')
+        annualClimatology = \
+            climatology.compute_climatology(ds, np.arange(1,13), calendar)


It would be better to store np.arange(1,13) as a constant and not hard code here-- maybe something like months_in_a_year.

Okay I can take it from constants. I just figured it was obvious...

pwolfram · 2017-03-29T19:28:51Z

mpas_analysis/ocean/nino34_index.py

@@ -122,15 +122,15 @@ def compute_nino34_index(regionSST, calendar):  # {{{

    Parameters
    ----------
-    regionSST : xarray.dataArray object
+    ds : xarray.DataArray object


Dataset object, right?

Oops, correct.

pwolfram · 2017-03-29T19:30:25Z

mpas_analysis/ocean/nino34_index.py

+    if not isinstance(ds, xr.core.dataset.Dataset):
+        raise ValueError('ds should be an xarray Dataset')
+
+    ds = climatology.add_months_and_days_in_month(ds, calendar)


I think this should be hidden from the user and not in the analysis-level code.

So the problem here is that we explicitly do a groupby('month') below. It is not obvious where month got added to the data set if this is not made explicit with an exposed call. In fact, I changed the behavior so month doesn't get added to ds behind the scenes in compute*_climatology because I realized this behavior would be unexpected. As a result, any code that needs month to be added to ds for use outside of climatology really needs to do it manually themselves.

pwolfram · 2017-03-29T19:31:04Z

mpas_analysis/ocean/nino34_index.py

-    anomalySST = regionSST.groupby('month') - monthlyClimatology
+    monthlyClimatology = climatology.compute_monthly_climatology(ds)
+
+    print monthlyClimatology


Are you sure you want to print this out? If so, shouldn't it go to a file or have some type of comment around it?

Oops, that was for debugging. I've removed it.

pwolfram · 2017-03-29T19:32:35Z

mpas_analysis/ocean/nino34_index.py


-    return _running_mean(anomalySST.to_pandas())  # }}}
+    return _running_mean(anomaly.avgSurfaceTemperature.to_pandas())  # }}}


If this should only be done for avgSurfaceTemperature I would recommend making this segmentation the first step, e.g., if there wasn't lazy evaluation you would compute a nino34 index on all the variables.

avgSurfaceTemperature is explicitly selected as the only variable in ds when ds is constructed, so I don't think that's a major concern. I agree that it would be clearer if we simply operated on a data array ('regionSST`) as before. I will expound on why I didn't have any success with that approach below and maybe we can find a solution.

pwolfram · 2017-03-29T23:23:54Z

mpas_analysis/shared/climatology/climatology.py

+        if 'startTime' in ds.data_vars and 'endTime' in ds.data_vars:
+            ds['daysInMonth'] = ds.endTime - ds.startTime
+        else:
+            # TODO: support calendar


What does this comment mean exactly?

I have elaborated. I do not feel like writing special code for handling this case when the calendar is gregorian. We would need to figure out whether each year is a leap year and whether the month is February, and change the number of days to 29 if so. We don't know of any runs that will need this capability. The else case here is to support backward compatibility to ACME alpha runs where we used daysSinceStartOfSim to compute the time coordinate (none of these runs use the gregorian calendar). So I have added a warning but it doesn't seem like it's worth the effort of handling this case. If you want to, feel free...

pwolfram · 2017-03-29T23:25:17Z

mpas_analysis/shared/climatology/climatology.py

+
+def _compute_masked_mean(ds):
+    '''
+    Compute the time average of data set, masked out where the variables in ds


The means in xarray are always computed with nanmean unless otherwise specified. I'm a little confused as to the purpose of this function based on this descriptor for the function.

So there are 2 things xarray's mean calculations don't seem to do that this is meant to handle. First, there's not an obvious way to weight a mean by another variable (daysInMonth in this case). Second, we want the normalization to be based on the number of valid (non-null) entries, so we need to also sum up how many valid entries there were and multiply them by the weights to determine how to do our weighted average. For observational data sets, the mask changes with time, so it is very important to only average over those dates that are valid. If one were to compute an annual mean using valid values from only Jan. and Feb. but were to divide by 365 days to get the weighted average, the resulting value would be off by a factor of about 6. This function attempts to properly take that into account.

If you would like to rewrite the function to work in a more xarray-like way, feel free. I have written many iterations of it and still haven't found a perfect solution.

This goes back to the problems with accounting for time properly. There probably aren't too many "perfect" solutions until that bigger, more fundamental issue gets fixed (pydata/xarray#789). Thanks for slogging through these complexities in the meantime.

@pwolfram, I don't think a proper accounting of time would necessarily help with this problem because xarray doesn't have a sense of the data corresponding to a time span as opposed to a single point in time. Their example of computing seasonal climatologies (without masking) is not all that different from my approach:

http://xarray.pydata.org/en/stable/examples/monthly-means.html

It seems like there should be an easier way to handle this on the xarray side-- I'm going to keep this in mind. It looks like this type of functionality isn't quite yet available (pydata/xarray#422) and the discussion has gone somewhat stale.

@pwolfram, I was unaware of pydata/xarray#422 but I think my approach is rather similar. Let's keep an eye on it and update if/when average or weighted is added to xarray.

vanroekel

@xylar from what I can tell this doesn't change the core functionality of the nino or ocean modelvobs routines, so I'm fine here if your testing has shown consistent behavior with develop.

vanroekel · 2017-03-30T03:28:56Z

mpas_analysis/ocean/ocean_modelvsobs.py

-
-            dsObs = monthlyClimatology.transpose('month', 'lat', 'lon')
+            dsObs.rename({'time': 'Time'}, inplace=True)
+            dsObs = dsObs.transpose('Time', 'lat', 'lon')


@pwolfram, the transpose is because the observational data is not in the same order as MPAS output.

xylar · 2017-03-30T08:45:36Z

My biggest concern is that I think we should be able to compute climatologies on datasets and dataarrays. I'm concerned this functionality is being designed away.

So I agree with this concern. The simple problem is that we need a way of storing month and daysInMonth arrays next to the data array in question. This is most logically done by adding data arrays to a data set. One could add them as coordinates to the data array instead of as separate data arrays. I tried this before and got strange behavior I didn't fully understand. But I am willing to give it another try.

xylar · 2017-03-30T09:53:01Z

@pwolfram, looking further, startTime and endTime would also need to be coordinates. I will see if that even works...

xylar · 2017-03-30T14:59:24Z

@pwolfram, I have done a pretty significant rewrite based on your comments. The relevant functions in climatology now work on both Dataset and DataArray objects (made possible by having month and daysInMonth as coordinates, rather than data arrays).

I have rewritten _compute_masked_mean in a way that I hope better utilizes xarray's capabilities and also better demonstrates what it hopes to accomplish that existing xarray functionality can't do.

Thank you for the excellent review so far. Let me know if you have further concerns.

xylar · 2017-03-30T16:05:15Z

@pwolfram, I just rebased. Should be ready for a re-review. I am aware that it would be good to add some CI to test the climatology computations. I don't have time to do that now and I would like this fix to but part of v0.2 but will put it on by to-do list for a follow-up PR.

pwolfram · 2017-03-30T18:19:37Z

@xylar, did you want me to do another review in preparation for merging with the understanding that you will add CI testing for climatology later?

xylar · 2017-03-30T19:04:35Z

@xylar, did you want me to do another review in preparation for merging with the understanding that you will add CI testing for climatology later?

@pwolfram, yep, sorry if I wasn't clear. That's it exactly

xylar · 2017-03-30T23:09:50Z

@pwolfram, given that it doesn't look like this will get merged today, hold off a little bit and I'll add CI to it tomorrow.

xylar · 2017-03-31T08:40:41Z

@pwolfram, CI has been added. I think you'll be pleased, but let me know if you'd like anymore changes, both to the CI and to the rest of the PR.

If the time coordinate is defined based on the average of a start time and and end time (e.g xtime_start and xtime_end), also store the start and end time for later use in accurately performing time averages (e.g. computing climatologies).

There is no longer an advantage to computing monthly climatologies first, then seasonal climatologies (as was done in several analysis tasks), so seasonal climatologies are now computed directly. The MOC time average is now computed using compute_climatology (rather than just .mean('Time')) so that the number of days in a month is properly accounted for.

pwolfram · 2017-04-02T22:16:47Z

mpas_analysis/shared/climatology/climatology.py

-                        name='mask')
-    seasonalClimatology = (monthlyClimatology * daysInMonth).sum(
-        dim='month', keep_attrs=True)
+    climatologyMonths = ds.where(mask, drop=True)


Nice! This should help below when _compute_masked_mean is used for the computation. 👍

pwolfram · 2017-04-02T22:23:57Z

mpas_analysis/test/test_climatology.py

+
+        # test compute_climatology on a data array
+        mldClimatology = climatology.compute_climatology(ds.mld, monthValues,
+                                                         calendar)


Do you want to make sure mldClimatology and dsClimatology produce similar results? This just tests that Dataset vs DataArray computations work, which this test function implies anyway.

Is there some way that we can test the actually computed values, at least in an approximate way? This may be overboard but a climatology is a key computation so this is something to consider.

Do you want to make sure mldClimatology and dsClimatology produce similar results? This just tests that Dataset vs DataArray computations work, which this test function implies anyway.

I have updated the test to do this

Is there some way that we can test the actually computed values, at least in an approximate way? This may be overboard but a climatology is a key computation so this is something to consider.

I went ahead and addressed this, too. Should be good to go.

pwolfram · 2017-04-02T22:24:41Z

mpas_analysis/test/test_climatology.py

+                                    'JFM.nc'.format(self.test_dir)
+        self.assertEqual(regriddedFileName, expectedRegriddedFileName)
+
+    def open_test_ds(self, config, calendar):


This is good because it will verify that the time selection will produce the appropriate dataset in terms of length-- great!

pwolfram · 2017-04-02T22:25:33Z

mpas_analysis/test/test_climatology.py

+Unit test infrastructure for the generalized_reader.
+
+Xylar Asay-Davis
+02/16/2017


A nit-- Does the date need updated?

pwolfram · 2017-04-02T22:27:40Z

@xylar, I'm happy to merge unless you or @milenaveneziani have objections. I just want to give @milenaveneziani a chance to review and comment if desired.

xylar · 2017-04-03T07:41:11Z

@pwolfram, I updated the CI testing according to your requests. This included adding assertArrayApproxEqual for testing if 2 numpy arrays are close.

pwolfram · 2017-04-03T14:02:39Z

mpas_analysis/test/test_climatology.py

+        refClimatology = xarray.open_dataset(climFileName)
+
+        self.assertArrayApproxEqual(monthlyClimatology.mld.values,
+                                    refClimatology.mld.values)


Thanks @xylar, this will be really helpful to make sure we (or others!) don't accidentally break the key computation in MPAS-Analysis!

milenaveneziani · 2017-04-03T17:29:47Z

@xylar, @pwolfram: I am back and slowly catching up on things. I read the very first comment of this PR and only have a quick question: will we support xtime_start and xtime_end with this, no daysSinceStart? Sorry if I didn't go through all your comments, perhaps my question is anwered in one of them..

xylar · 2017-04-03T17:48:34Z

@milenaveneziani, this PR stores xtime_start and xtime_end if they're available for later use. If they're available, it uses them to compute the number of days in a month (meaning it doesn't have to be smart about the calendar when computing climatologies, it just takes MPAS's word for how long the month was). If a file has daysSinceStartOfSim but not xtime_start and xtime_end (e.g. alpha7 and maybe alpha8?) it will use that and will resort to looking up how many days are in a given month (but will ignore leap years). Since we didn't run any alpha8 runs with the Gregorian calendar, I don't think this will be a problem. Even if we did, the error would be quite small.

xylar · 2017-04-03T17:52:56Z

@milenaveneziani and @pwolfram, do we want to include this PR in v0.2? I'm fine with not including it and having it be the first merge after that tag.

milenaveneziani · 2017-04-03T18:02:24Z

@xylar: great. Thanks for clarifying.
I am OK with including this in v0.2 (since it seems it is ready to merge).

pwolfram · 2017-04-03T18:08:42Z

Last call before a merge then...

pwolfram · 2017-04-03T18:17:47Z

I'll merge in about an hour or so @xylar and @milenaveneziani.

xylar added bug mpas_xarray priority labels Mar 29, 2017

xylar assigned pwolfram Mar 29, 2017

xylar requested review from pwolfram, milenaveneziani and vanroekel March 29, 2017 18:56

pwolfram suggested changes Mar 29, 2017

View reviewed changes

vanroekel approved these changes Mar 30, 2017

View reviewed changes

xylar force-pushed the fix_climatology_dates branch from fefb7aa to d61a180 Compare March 30, 2017 14:53

xylar force-pushed the fix_climatology_dates branch from d61a180 to 8af611d Compare March 30, 2017 16:01

xylar added in progress and removed in progress labels Mar 30, 2017

xylar force-pushed the fix_climatology_dates branch from fff2f9f to fa1cd35 Compare March 31, 2017 08:48

xylar added 2 commits April 1, 2017 20:54

xylar force-pushed the fix_climatology_dates branch from fa1cd35 to dbbf1d2 Compare April 1, 2017 18:54

pwolfram reviewed Apr 2, 2017

View reviewed changes

pwolfram approved these changes Apr 2, 2017

View reviewed changes

xylar force-pushed the fix_climatology_dates branch from dbbf1d2 to b8a5b6b Compare April 3, 2017 06:48

xylar added 2 commits April 3, 2017 09:19

Add CI testing for climatology

c8b0d13

White space

d65c74e

xylar force-pushed the fix_climatology_dates branch from b8a5b6b to d65c74e Compare April 3, 2017 07:19

pwolfram reviewed Apr 3, 2017

View reviewed changes

pwolfram merged commit a38db9d into MPAS-Dev:develop Apr 3, 2017

xylar deleted the fix_climatology_dates branch April 3, 2017 20:21

pwolfram mentioned this pull request Apr 7, 2017

Improve chunking capability #175

Merged


		return _running_mean(anomalySST.to_pandas()) # }}}
		return _running_mean(anomaly.avgSurfaceTemperature.to_pandas()) # }}}

Change climatologies to use xtime_start and xtime_end #165

Change climatologies to use xtime_start and xtime_end #165

Conversation

xylar commented Mar 29, 2017 • edited Loading

pwolfram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanroekel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xylar commented Mar 30, 2017

xylar commented Mar 30, 2017

xylar commented Mar 30, 2017

xylar commented Mar 30, 2017

pwolfram commented Mar 30, 2017

xylar commented Mar 30, 2017

xylar commented Mar 30, 2017

xylar commented Mar 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xylar Apr 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwolfram Apr 2, 2017 • edited Loading

Choose a reason for hiding this comment

pwolfram commented Apr 2, 2017

xylar commented Apr 3, 2017

Choose a reason for hiding this comment

milenaveneziani commented Apr 3, 2017

xylar commented Apr 3, 2017

xylar commented Apr 3, 2017

milenaveneziani commented Apr 3, 2017 • edited Loading

pwolfram commented Apr 3, 2017

pwolfram commented Apr 3, 2017

xylar commented Mar 29, 2017 •

edited

Loading

xylar Apr 3, 2017 •

edited

Loading

pwolfram Apr 2, 2017 •

edited

Loading

milenaveneziani commented Apr 3, 2017 •

edited

Loading