Add functions to support conversion of a pandas DataFrame into an iris cube #1582

gavinevans · 2021-10-11T10:34:03Z

Addresses part of #1538

Dependent on #1572. Please note that this currently contains the same commit as in #1572 to facilitate GitHub Actions.

Description
This PR adds functionality for converting a pandas DataFrame into an iris cube for the purposes of providing forecast and truth cubes for EMOS. The forecast and truth tables are expected to contain the following columns:

Forecast: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, percentile, diagnostic, latitude, longitude, period, height, cf_name, units.

Truth: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude

Further information about table formatting is in https://github.com/MetOffice/improver_suite/issues/961.

Testing:

Ran tests and they passed OK
Added new tests for the new feature(s)

codecov · 2021-10-11T10:40:10Z

Codecov Report

Merging #1582 (ed31c06) into master (f63415a) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1582      +/-   ##
==========================================
+ Coverage   98.03%   98.05%   +0.01%     
==========================================
  Files         109      109              
  Lines        9817     9914      +97     
==========================================
+ Hits         9624     9721      +97     
  Misses        193      193

Impacted Files	Coverage Δ
improver/calibration/__init__.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f63415a...ed31c06. Read the comment docs.

btrotta-bom · 2021-10-13T00:08:46Z

improver/calibration/__init__.py

+        ValueError: Only one unique value within the specifed column
+            is expected.
+    """
+    if len(table[column].unique()) > 1:


This could be simplified slightly with `if table[column].nunique() > 1'

Actually, it seems these give different results if there are nans. E.g. if we have df = pd.DataFrame({'a': [np.nan, 2, 2]}) then len(df["a"].unique()) is 2, but df["a"].nunique() is 1. I'm not sure which is more appropriate here.

Thanks for the suggestion. I wasn't aware of nunique. I think df["a"].nunique(dropna=False) would give a value of 2 following your example, which I think would be what I expect here, so I could implement that?

I've used df[column].nunique(dropna=False) as suggested.

btrotta-bom · 2021-10-13T00:19:09Z

improver/calibration/__init__.py

+            units="m",
+        )
+
+        for percentile in table["percentile"].unique():


Do we need to sort the result of table["percentile"].unique(), to ensure the percentiles in the final cube will be ordered?

OK, I hadn't thought about this because the percentile column was sorted in the table that I was using. For robustness, it might make sense to sort the percentiles here. I'll take a look at it.

I've added a sort of the percentiles to ensure the order is as expected.

btrotta-bom · 2021-10-13T00:23:39Z

improver/calibration/__init__.py

+
+    Args:
+        forecast_table:
+            DataFrame expected to contain the following columns: .


Is list of columns missing here?

Yes, I should correct this.

I've corrected this docstring and added some extended documentation related to Lucy's comment.

btrotta-bom · 2021-10-13T00:23:47Z

improver/calibration/__init__.py

+        forecast_table:
+            DataFrame expected to contain the following columns: .
+        truth_table:
+            DataFrame expected to contain the following columns: .


Corrected as for forecast_table.

lucyleeow

Thanks for this @gavinevans ! I will pull your branch and check how it works with our site forecast and observation data tomorrow. I did have some questions too.

There seems to be a variety of dtypes used for dates, here we have used pandas (via pd.date_range), numpy (np.timedelta64 and datetime64) and int64 (np.int64). Elsewhere in IMPROVER, the datetime package is commonly used. Should we try to standardise the data type used for dates?

Just a question (I have no idea) - is it faster/more efficient to build a cube for each time & percentile and merge, rather than shape the data and create a single cube for the date range?

I noticed that in #1581 'df' is used and 'table' is used here to refer to the dataframes. Consistency might be good. I have some preference towards 'df' because python also has a datatable package.

lucyleeow · 2021-10-13T01:16:53Z

improver/calibration/__init__.py

+
+
+def _unique_check(table: DataFrame, column: str) -> Any:
+    """Check whether the value in the column is unique.


nitpick

Suggested change

"""Check whether the value in the column is unique.

"""Check whether the values in the column is unique.

lucyleeow · 2021-10-13T03:04:14Z

improver/calibration/__init__.py

+            DataFrame expected to contain the following columns: forecast,
+            blend_time, forecast_period, forecast_reference_time, time,
+            wmo_id, percentile, diagnostic, latitude, longitude, period,
+            height, cf_name, units.


Are other, optional columns allowed? If so could this be documented somewhere?

Also it might be nice to have an explanation of the columns/example table documented somewhere.

Yes, I'll try to add some extended documentation for this. In the meantime, if you have any comments related to this comment that would be good to know.

I've amended this docstring and added some extended documentation to provide information about the forecast and truth tables.

lucyleeow · 2021-10-13T06:02:35Z

improver/calibration/__init__.py

+    for coord in ["time", "forecast_reference_time"]:
+        table[coord] = table[coord].dt.tz_localize(None)
+


Just a question, why do we want to drop the time zone information from forecasts? Also would they all be UTC?

This was done initially for ease but I think these lines should be removable.

I've removed this. All times are UTC.

Yes, all our forecasts will be UTC, tzinfo just if someone wants to do a join on the obs table (which will also be explicit)

lucyleeow · 2021-10-13T06:03:15Z

improver/calibration/__init__.py

+        frt_point = np.datetime64(
+            time_table["forecast_reference_time"].values[0], "s"
+        ).astype(np.int64)


Question, why do we want to use int64 here?

A follow-up question to this: if we do want this value to be an integer, should time_point also be an integer?

All datetimes are now using pandas datetime functionality, where possible, I think. Times are only converted to integers as part of the coordinate creation.

lucyleeow · 2021-10-13T06:24:03Z

improver/calibration/__init__.py

+        table[coord] = table[coord].dt.tz_localize(None)
+
+    cubelist = CubeList()
+    for adate in date_range:


I wonder if it is a good idea to name a variable the same name as a function (pd.date_range) ..?

Good idea. I've updated this training_dates.

lucyleeow · 2021-10-13T06:47:58Z

improver/calibration/__init__.py

+    return RebadgePercentilesAsRealizations()(cube)
+
+
+def truth_table_to_cube(


It seems like some of the actions of truth_table_to_cube and forecast_table_to_cube are common. In the interest of DRY (don't repeat yourself), I wonder if we could use some functions to avoid repetition?

Yes, once I've done some modifications related to all the other comments, I'll review the commonality between these functions.

I've tried to factor out a few functions, where possible.

lucyleeow · 2021-10-13T06:54:44Z

improver/calibration/__init__.py

+
+    Returns:
+        Cube containing the forecasts from the training period.
+    """


I wonder if some checking to ensure that the input dataframe has the correct columns before we do anything else would be a good idea?

Yes, I'll add this.

I've added a function to check this upfront for the forecasts and truths.

lucyleeow · 2021-10-13T07:22:08Z

improver/calibration/__init__.py

+            time_table[col] = table.groupby(by="wmo_id", sort=False)[col].agg(
+                lambda x: pd.Series.mode(x, dropna=not table[col].isna().all())
+            )


As above, not sure how this would work for sites that don't have a wmo id. Also is it common for alt/lat/lon to have different values, at a single time?

Regarding different values for alt/lat/lon, note that I'm computing the mode from the whole table, rather than the time_table subset. Over the course of a training dataset, the alt/lat/lon for a site can change, which would prevent the cubes merging. Maybe that helps?

Thanks @gavinevans, but I'm curious to as to how/why alt/lat/lon can change for a site? What kind of observation sources are you using here?

@LaurenceBeard has suggested that this is just small corrections to the recorded alt/lat/lon of sites following re-assessment.

We are only using synoptic land observations (which shouldn't move!) but sometimes our master station list does get updated (usually to correct, sometimes adjust). As a result we should prefer the forecast table's position data - for a trial this should not change - but for a long-term time series over releases, this may have differences.

Nonetheless these values are just being added to the netCDF for consistency, they are not being used

So I would be tempted to drop this calculation, and just use the forecast tables values (ignore these cols from the truth table)

I've had a go at doing some refactoring to avoid using the alt/lat/lon from the truths at all and replace with those from the forecasts in this commit: 6b2ab0f. However, I've realised that I currently still need this calculation to handle missing observations. I'll look into this more tomorrow.

lucyleeow · 2021-10-13T07:25:55Z

improver/calibration/__init__.py

+            )
+            # Replace empty arrays generated by the mode with NaNs.
+            time_table[col] = time_table[col].apply(
+                lambda x: np.nan if isinstance(x, np.ndarray) else x


If x is an array, will it always be empty?

I've actually removed this line now because previously I was anticipating that I would have columns that were potentially all NaNs e.g. a period column for an instantaneous diagnostic but as we now haven't put a period column on the truth table, this line isn't required because I'm not anticipating computing the mode for a column full of NaNs, which was previously resulting in an empty array.

lucyleeow · 2021-10-13T07:27:27Z

improver/calibration/__init__.py

+        Forecasts and truths for the training period that have been filtered
+        only include sites that are present in both the forecasts and truths


Suggested change

Forecasts and truths for the training period that have been filtered

only include sites that are present in both the forecasts and truths

Forecasts and truths for the training period that have been filtered to

only include sites that are present in both the forecasts and truths.

nitpick

fionaRust

A couple of comment.
The other thing we discussed last night is that we are likely to pass in a the same list of leadtimes for all cycles, although some leadtimes will not find valid data in the table. In this case we probably want to return nothing and the Estimate EMOS CLI will not produce a coefficients file.

improver/calibration/__init__.py

gavinevans · 2021-10-13T16:44:22Z

@lucyleeow Regarding the following points:

There seems to be a variety of dtypes used for dates, here we have used pandas (via pd.date_range), numpy (np.timedelta64 and datetime64) and int64 (np.int64). Elsewhere in IMPROVER, the datetime package is commonly used. Should we try to standardise the data type used for dates?

Yes, I've had a go at using primarily using the datetime package. This is related.

I noticed that in #1581 'df' is used and 'table' is used here to refer to the dataframes. Consistency might be good. I have some preference towards 'df' because python also has a datatable package.

Good idea. I'm going to use df for consistency.

benowen-bom

Hi @gavinevans, I've had a look and have some additional comments. I've also put up a comment on https://github.com/MetOffice/improver_suite/issues/961 regarding forecast/truth columns.

benowen-bom · 2021-10-13T22:17:51Z

improver/calibration/__init__.py

+        frt_point = np.datetime64(
+            time_table["forecast_reference_time"].values[0], "s"
+        ).astype(np.int64)


A follow-up question to this: if we do want this value to be an integer, should time_point also be an integer?

improver/calibration/__init__.py

benowen-bom · 2021-10-14T05:20:20Z

improver/calibration/__init__.py

+            time_table[col] = table.groupby(by="wmo_id", sort=False)[col].agg(
+                lambda x: pd.Series.mode(x, dropna=not table[col].isna().all())
+            )


Thanks @gavinevans, but I'm curious to as to how/why alt/lat/lon can change for a site? What kind of observation sources are you using here?

fionaRust

Some initial comments on documentation, before I stopped reviewing for the night. I'll take another look tomorrow morning

doc/source/extended_documentation/calibration/forecast_dataframe_metadata_info.csv

doc/source/extended_documentation/calibration/calibration_data_ingestion.rst

doc/source/extended_documentation/calibration/forecast_dataframe_example.csv

doc/source/extended_documentation/calibration/forecast_dataframe_metadata_info.csv

lucyleeow

Thanks for addressing all our coments. This looks fine and workable for us. Happy to take a look again on Monday, especially at the tests as they are failing now, or if you have time constraits feel free to merge before then.

The only other comment I would make is that I note that 'site' information (e.g., lat/lon/height) is stored in the forecast df, whereas we may be more interested in observation site information e.g., the height of the observation site. I may not full grasp what data is stored in the forecast df though.

lucyleeow · 2021-10-15T05:57:37Z

improver/calibration/__init__.py

+        A DataFrame without numpy datetime dtypes.
+    """
+    for col in [c for c in df.columns if df[c].dtype == "datetime64[ns]"]:
+        df[col] = df[col].dt.tz_localize("UTC").astype("O")


Question - why are we wanting dates to be 'object' data type?

I've extended the docstring for clarity. Converting the columns to "object" dtype results in the values within the columns being pandas datetime objects (Timestamp and Timedelta), rather than numpy datetime objects. Later functions then only need to handle pandas datetime objects.

lucyleeow · 2021-10-15T06:01:17Z

improver/calibration/__init__.py

+    at least one day prior to the cycletime. The final validity time
+    within the training dataset is additionally offset by the number
+    of days within the forecast period to ensure that the dates defined
+    by the training dataset are in the past relative to the cycletime.


Suggestion only, feel free to ignore. Could we maybe give an example here?

I've added an example here.

lucyleeow · 2021-10-15T06:06:12Z

improver/calibration/__init__.py

+        cycletime:
+            Cycletime of a format similar to 20170109T0000Z.


Is this the last cycle time of the training period.. ? If so could we note it down?

This is the cycletime of the current cycle, rather than the final validity time within the training dataset, which is what this function calculates. I've expanded the docstring to try to make it clearer.

lucyleeow · 2021-10-15T06:12:43Z

improver/calibration/__init__.py

+    forecast_period: int,
+    training_length: int,
+) -> Tuple[Cube, Cube]:
+    """Convert a truth DataFrame into an iris Cube.


I think this docstring needs updating?

Yes, I've updated this docstring.

BelligerG

Just a few queries/comments from me, it's looking good though!

doc/source/extended_documentation/calibration/forecast_dataframe_example.csv

improver/calibration/__init__.py

improver_tests/calibration/test_init.py

fionaRust

I'm now happy with these changes. Thanks Gavin!

LaurenceBeard

Happy for this to go in, testing related issues can be addressed at a later date.

LaurenceBeard · 2021-10-15T10:47:14Z

improver_tests/calibration/test_init.py

+            self.forecast_period,
+            self.training_length,
+        )
+        self.assertEqual(len(result), 2)


This is good functionality to ensure (though station_id is one we may end up using), maybe a 'blah' column or something that would never exist?

LaurenceBeard · 2021-10-18T11:15:09Z

improver/calibration/__init__.py

+        return
+    cube = cubelist.merge_cube()
+
+    return RebadgePercentilesAsRealizations()(cube)


A check for equally spaced percentiles is done within this Plugin?

…a into iris cubes.

… of forecast and observation tables.

…pared.

… latitude and longitude at all and replace with those from the forecast.

Merging.

…s cube (metoppv#1582) * Modifications to functions required to support converting tabular data into iris cubes. * Modifications to the columns expected within the truth table. * Modify environments in preparation for changes required for ingestion of forecast and observation tables. * Edits to expect the period column to be a timedelta64 dtype. * Corrections to __init__.py * Modifications to use assertCubeEqual to ensure the full cubes are compared. * Modifications following review comments. * Minor updates to docstrings. * Add missing unit tests. * Correction to csv files. * Sort lists. * Minor docstring amendment. * NOT WORKING: Working commit to try to avoid using the truth altitude, latitude and longitude at all and replace with those from the forecast. * Extend docstrings. * Extended documentation updates. * Refinement and addition of tests for column name checking. * Modifications to tidy up dataframe preparation. * Minor extended documentation edits. * Further minor docstring edit. * Correct test class naming. * Fix isort.

gavinevans added the FY21/22 Temperature calibration Owned by Gavin label Oct 11, 2021

gavinevans added this to the 1.0.0 milestone Oct 11, 2021

gavinevans self-assigned this Oct 11, 2021

gavinevans changed the title ~~Improver1538 tabular ingestion functions~~ Add functions to support conversion of a pandas DataFrame into an iris cube Oct 11, 2021

gavinevans removed their assignment Oct 11, 2021

gavinevans marked this pull request as draft October 11, 2021 16:25

gavinevans force-pushed the improver1538_tabular_ingestion_functions branch 2 times, most recently from 0177a77 to 25005bb Compare October 12, 2021 07:31

gavinevans marked this pull request as ready for review October 12, 2021 10:40

This was referenced Oct 12, 2021

Add CLI for ingesting tabular forecasts and observations into EMOS #1584

Closed

Support ingestion of tabular site forecasts and observations into EMOS #1538

Closed

btrotta-bom reviewed Oct 13, 2021

View reviewed changes

lucyleeow reviewed Oct 13, 2021

View reviewed changes

fionaRust reviewed Oct 13, 2021

View reviewed changes

improver/calibration/__init__.py Outdated Show resolved Hide resolved

improver/calibration/__init__.py Outdated Show resolved Hide resolved

improver/calibration/__init__.py Outdated Show resolved Hide resolved

benowen-bom reviewed Oct 14, 2021

View reviewed changes

gavinevans force-pushed the improver1538_tabular_ingestion_functions branch from 684ce01 to 8974924 Compare October 14, 2021 14:57

fionaRust reviewed Oct 14, 2021

View reviewed changes

lucyleeow reviewed Oct 15, 2021

View reviewed changes

BelligerG previously requested changes Oct 15, 2021

View reviewed changes

gavinevans force-pushed the improver1538_tabular_ingestion_functions branch 2 times, most recently from e97bd1c to e1a762c Compare October 15, 2021 10:21

fionaRust reviewed Oct 18, 2021

View reviewed changes

improver/calibration/__init__.py Show resolved Hide resolved

improver/calibration/__init__.py Show resolved Hide resolved

improver_tests/calibration/test_init.py Outdated Show resolved Hide resolved

fionaRust previously approved these changes Oct 18, 2021

View reviewed changes

LaurenceBeard previously approved these changes Oct 19, 2021

View reviewed changes

gavinevans added the don't merge yet label Oct 19, 2021

gavinevans dismissed stale reviews from LaurenceBeard and fionaRust via 8b134fa October 19, 2021 14:32

fionaRust previously approved these changes Oct 19, 2021

View reviewed changes

gavinevans added 21 commits October 21, 2021 18:23

Modifications to functions required to support converting tabular dat…

34268bb

…a into iris cubes.

Modifications to the columns expected within the truth table.

b019ea7

Modify environments in preparation for changes required for ingestion…

55cd1c1

… of forecast and observation tables.

Edits to expect the period column to be a timedelta64 dtype.

45bf9de

Corrections to __init__.py

f83ec4f

Modifications to use assertCubeEqual to ensure the full cubes are com…

6f381d1

…pared.

Modifications following review comments.

c67aceb

Minor updates to docstrings.

61b20a1

Add missing unit tests.

1c5811b

Correction to csv files.

a8f8ebd

Sort lists.

48ef7f8

Minor docstring amendment.

5037d3c

NOT WORKING: Working commit to try to avoid using the truth altitude,…

edfdc1c

… latitude and longitude at all and replace with those from the forecast.

Extend docstrings.

b9886e1

Extended documentation updates.

fcdd7d3

Refinement and addition of tests for column name checking.

968e10c

Modifications to tidy up dataframe preparation.

86f50bc

Minor extended documentation edits.

c6061d1

Further minor docstring edit.

8fcb05f

Correct test class naming.

0a81cc2

Fix isort.

ed31c06

gavinevans dismissed fionaRust’s stale review via ed31c06 October 21, 2021 17:35

gavinevans force-pushed the improver1538_tabular_ingestion_functions branch from 8b134fa to ed31c06 Compare October 21, 2021 17:35

fionaRust self-assigned this Oct 22, 2021

fionaRust approved these changes Oct 22, 2021

View reviewed changes

bayliffe merged commit f6f90ce into metoppv:master Oct 22, 2021

This was referenced Oct 22, 2021

Add CLI for ingesting tabular forecasts and observations into EMOS #1592

Merged

Move dataframe to cube utilities #1593

Merged



		def _unique_check(table: DataFrame, column: str) -> Any:
		"""Check whether the value in the column is unique.

	"""Check whether the value in the column is unique.
	"""Check whether the values in the column is unique.

		for coord in ["time", "forecast_reference_time"]:
		table[coord] = table[coord].dt.tz_localize(None)

		return RebadgePercentilesAsRealizations()(cube)


		def truth_table_to_cube(

		Forecasts and truths for the training period that have been filtered
		only include sites that are present in both the forecasts and truths

Add functions to support conversion of a pandas DataFrame into an iris cube #1582

Add functions to support conversion of a pandas DataFrame into an iris cube #1582

Conversation

gavinevans commented Oct 11, 2021

codecov bot commented Oct 11, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fionaRust left a comment

Choose a reason for hiding this comment

gavinevans commented Oct 13, 2021

benowen-bom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fionaRust left a comment

Choose a reason for hiding this comment

lucyleeow left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BelligerG left a comment

Choose a reason for hiding this comment

fionaRust left a comment

Choose a reason for hiding this comment

LaurenceBeard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 11, 2021 •

edited

Loading

lucyleeow left a comment •

edited

Loading