-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functions to support conversion of a pandas DataFrame into an iris cube #1582
Add functions to support conversion of a pandas DataFrame into an iris cube #1582
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1582 +/- ##
==========================================
+ Coverage 98.03% 98.05% +0.01%
==========================================
Files 109 109
Lines 9817 9914 +97
==========================================
+ Hits 9624 9721 +97
Misses 193 193
Continue to review full report at Codecov.
|
0177a77
to
25005bb
Compare
improver/calibration/__init__.py
Outdated
ValueError: Only one unique value within the specifed column | ||
is expected. | ||
""" | ||
if len(table[column].unique()) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified slightly with `if table[column].nunique() > 1'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems these give different results if there are nans. E.g. if we have df = pd.DataFrame({'a': [np.nan, 2, 2]})
then len(df["a"].unique())
is 2, but df["a"].nunique()
is 1. I'm not sure which is more appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I wasn't aware of nunique
. I think df["a"].nunique(dropna=False) would give a value of 2 following your example, which I think would be what I expect here, so I could implement that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used df[column].nunique(dropna=False)
as suggested.
improver/calibration/__init__.py
Outdated
units="m", | ||
) | ||
|
||
for percentile in table["percentile"].unique(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to sort the result of table["percentile"].unique()
, to ensure the percentiles in the final cube will be ordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I hadn't thought about this because the percentile column was sorted in the table that I was using. For robustness, it might make sense to sort the percentiles here. I'll take a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a sort of the percentiles to ensure the order is as expected.
improver/calibration/__init__.py
Outdated
|
||
Args: | ||
forecast_table: | ||
DataFrame expected to contain the following columns: . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is list of columns missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should correct this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've corrected this docstring and added some extended documentation related to Lucy's comment.
improver/calibration/__init__.py
Outdated
forecast_table: | ||
DataFrame expected to contain the following columns: . | ||
truth_table: | ||
DataFrame expected to contain the following columns: . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected as for forecast_table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @gavinevans ! I will pull your branch and check how it works with our site forecast and observation data tomorrow. I did have some questions too.
There seems to be a variety of dtypes used for dates, here we have used pandas (via pd.date_range
), numpy (np.timedelta64
and datetime64
) and int64 (np.int64
). Elsewhere in IMPROVER, the datetime package is commonly used. Should we try to standardise the data type used for dates?
Just a question (I have no idea) - is it faster/more efficient to build a cube for each time & percentile and merge, rather than shape the data and create a single cube for the date range?
I noticed that in #1581 'df' is used and 'table' is used here to refer to the dataframes. Consistency might be good. I have some preference towards 'df' because python also has a datatable package.
improver/calibration/__init__.py
Outdated
|
||
|
||
def _unique_check(table: DataFrame, column: str) -> Any: | ||
"""Check whether the value in the column is unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
"""Check whether the value in the column is unique. | |
"""Check whether the values in the column is unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected.
improver/calibration/__init__.py
Outdated
DataFrame expected to contain the following columns: forecast, | ||
blend_time, forecast_period, forecast_reference_time, time, | ||
wmo_id, percentile, diagnostic, latitude, longitude, period, | ||
height, cf_name, units. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are other, optional columns allowed? If so could this be documented somewhere?
Also it might be nice to have an explanation of the columns/example table documented somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll try to add some extended documentation for this. In the meantime, if you have any comments related to this comment that would be good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've amended this docstring and added some extended documentation to provide information about the forecast and truth tables.
improver/calibration/__init__.py
Outdated
for coord in ["time", "forecast_reference_time"]: | ||
table[coord] = table[coord].dt.tz_localize(None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, why do we want to drop the time zone information from forecasts? Also would they all be UTC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done initially for ease but I think these lines should be removable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this. All times are UTC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all our forecasts will be UTC, tzinfo just if someone wants to do a join on the obs table (which will also be explicit)
improver/calibration/__init__.py
Outdated
frt_point = np.datetime64( | ||
time_table["forecast_reference_time"].values[0], "s" | ||
).astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, why do we want to use int64 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up question to this: if we do want this value to be an integer, should time_point
also be an integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All datetimes are now using pandas datetime functionality, where possible, I think. Times are only converted to integers as part of the coordinate creation.
improver/calibration/__init__.py
Outdated
table[coord] = table[coord].dt.tz_localize(None) | ||
|
||
cubelist = CubeList() | ||
for adate in date_range: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is a good idea to name a variable the same name as a function (pd.date_range
) ..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I've updated this training_dates
.
improver/calibration/__init__.py
Outdated
return RebadgePercentilesAsRealizations()(cube) | ||
|
||
|
||
def truth_table_to_cube( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like some of the actions of truth_table_to_cube
and forecast_table_to_cube
are common. In the interest of DRY (don't repeat yourself), I wonder if we could use some functions to avoid repetition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, once I've done some modifications related to all the other comments, I'll review the commonality between these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to factor out a few functions, where possible.
|
||
Returns: | ||
Cube containing the forecasts from the training period. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if some checking to ensure that the input dataframe has the correct columns before we do anything else would be a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a function to check this upfront for the forecasts and truths.
improver/calibration/__init__.py
Outdated
time_table[col] = table.groupby(by="wmo_id", sort=False)[col].agg( | ||
lambda x: pd.Series.mode(x, dropna=not table[col].isna().all()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, not sure how this would work for sites that don't have a wmo id. Also is it common for alt/lat/lon to have different values, at a single time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding different values for alt/lat/lon, note that I'm computing the mode from the whole table, rather than the time_table
subset. Over the course of a training dataset, the alt/lat/lon for a site can change, which would prevent the cubes merging. Maybe that helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gavinevans, but I'm curious to as to how/why alt/lat/lon can change for a site? What kind of observation sources are you using here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LaurenceBeard has suggested that this is just small corrections to the recorded alt/lat/lon of sites following re-assessment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only using synoptic land observations (which shouldn't move!) but sometimes our master station list does get updated (usually to correct, sometimes adjust). As a result we should prefer the forecast table's position data - for a trial this should not change - but for a long-term time series over releases, this may have differences.
Nonetheless these values are just being added to the netCDF for consistency, they are not being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I would be tempted to drop this calculation, and just use the forecast tables values (ignore these cols from the truth table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a go at doing some refactoring to avoid using the alt/lat/lon from the truths at all and replace with those from the forecasts in this commit: 6b2ab0f. However, I've realised that I currently still need this calculation to handle missing observations. I'll look into this more tomorrow.
improver/calibration/__init__.py
Outdated
) | ||
# Replace empty arrays generated by the mode with NaNs. | ||
time_table[col] = time_table[col].apply( | ||
lambda x: np.nan if isinstance(x, np.ndarray) else x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If x is an array, will it always be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've actually removed this line now because previously I was anticipating that I would have columns that were potentially all NaNs e.g. a period
column for an instantaneous diagnostic but as we now haven't put a period
column on the truth table, this line isn't required because I'm not anticipating computing the mode for a column full of NaNs, which was previously resulting in an empty array.
improver/calibration/__init__.py
Outdated
Forecasts and truths for the training period that have been filtered | ||
only include sites that are present in both the forecasts and truths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forecasts and truths for the training period that have been filtered | |
only include sites that are present in both the forecasts and truths | |
Forecasts and truths for the training period that have been filtered to | |
only include sites that are present in both the forecasts and truths. |
nitpick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comment.
The other thing we discussed last night is that we are likely to pass in a the same list of leadtimes for all cycles, although some leadtimes will not find valid data in the table. In this case we probably want to return nothing and the Estimate EMOS CLI will not produce a coefficients file.
@lucyleeow Regarding the following points:
Yes, I've had a go at using primarily using the datetime package. This is related.
Good idea. I'm going to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gavinevans, I've had a look and have some additional comments. I've also put up a comment on https://github.com/MetOffice/improver_suite/issues/961 regarding forecast/truth columns.
improver/calibration/__init__.py
Outdated
frt_point = np.datetime64( | ||
time_table["forecast_reference_time"].values[0], "s" | ||
).astype(np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up question to this: if we do want this value to be an integer, should time_point
also be an integer?
improver/calibration/__init__.py
Outdated
time_table[col] = table.groupby(by="wmo_id", sort=False)[col].agg( | ||
lambda x: pd.Series.mode(x, dropna=not table[col].isna().all()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gavinevans, but I'm curious to as to how/why alt/lat/lon can change for a site? What kind of observation sources are you using here?
684ce01
to
8974924
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments on documentation, before I stopped reviewing for the night. I'll take another look tomorrow morning
doc/source/extended_documentation/calibration/forecast_dataframe_metadata_info.csv
Outdated
Show resolved
Hide resolved
doc/source/extended_documentation/calibration/calibration_data_ingestion.rst
Show resolved
Hide resolved
doc/source/extended_documentation/calibration/forecast_dataframe_example.csv
Outdated
Show resolved
Hide resolved
doc/source/extended_documentation/calibration/forecast_dataframe_metadata_info.csv
Outdated
Show resolved
Hide resolved
doc/source/extended_documentation/calibration/forecast_dataframe_metadata_info.csv
Outdated
Show resolved
Hide resolved
doc/source/extended_documentation/calibration/forecast_dataframe_metadata_info.csv
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all our coments. This looks fine and workable for us. Happy to take a look again on Monday, especially at the tests as they are failing now, or if you have time constraits feel free to merge before then.
The only other comment I would make is that I note that 'site' information (e.g., lat/lon/height) is stored in the forecast df, whereas we may be more interested in observation site information e.g., the height of the observation site. I may not full grasp what data is stored in the forecast df though.
A DataFrame without numpy datetime dtypes. | ||
""" | ||
for col in [c for c in df.columns if df[c].dtype == "datetime64[ns]"]: | ||
df[col] = df[col].dt.tz_localize("UTC").astype("O") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - why are we wanting dates to be 'object' data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extended the docstring for clarity. Converting the columns to "object" dtype results in the values within the columns being pandas datetime objects (Timestamp and Timedelta), rather than numpy datetime objects. Later functions then only need to handle pandas datetime objects.
at least one day prior to the cycletime. The final validity time | ||
within the training dataset is additionally offset by the number | ||
of days within the forecast period to ensure that the dates defined | ||
by the training dataset are in the past relative to the cycletime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion only, feel free to ignore. Could we maybe give an example here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an example here.
cycletime: | ||
Cycletime of a format similar to 20170109T0000Z. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the last cycle time of the training period.. ? If so could we note it down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the cycletime of the current cycle, rather than the final validity time within the training dataset, which is what this function calculates. I've expanded the docstring to try to make it clearer.
improver/calibration/__init__.py
Outdated
forecast_period: int, | ||
training_length: int, | ||
) -> Tuple[Cube, Cube]: | ||
"""Convert a truth DataFrame into an iris Cube. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this docstring needs updating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've updated this docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few queries/comments from me, it's looking good though!
doc/source/extended_documentation/calibration/forecast_dataframe_example.csv
Outdated
Show resolved
Hide resolved
e97bd1c
to
e1a762c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now happy with these changes. Thanks Gavin!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy for this to go in, testing related issues can be addressed at a later date.
self.forecast_period, | ||
self.training_length, | ||
) | ||
self.assertEqual(len(result), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good functionality to ensure (though station_id is one we may end up using), maybe a 'blah' column or something that would never exist?
return | ||
cube = cubelist.merge_cube() | ||
|
||
return RebadgePercentilesAsRealizations()(cube) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A check for equally spaced percentiles is done within this Plugin?
8b134fa
…a into iris cubes.
… of forecast and observation tables.
… latitude and longitude at all and replace with those from the forecast.
8b134fa
to
ed31c06
Compare
…s cube (metoppv#1582) * Modifications to functions required to support converting tabular data into iris cubes. * Modifications to the columns expected within the truth table. * Modify environments in preparation for changes required for ingestion of forecast and observation tables. * Edits to expect the period column to be a timedelta64 dtype. * Corrections to __init__.py * Modifications to use assertCubeEqual to ensure the full cubes are compared. * Modifications following review comments. * Minor updates to docstrings. * Add missing unit tests. * Correction to csv files. * Sort lists. * Minor docstring amendment. * NOT WORKING: Working commit to try to avoid using the truth altitude, latitude and longitude at all and replace with those from the forecast. * Extend docstrings. * Extended documentation updates. * Refinement and addition of tests for column name checking. * Modifications to tidy up dataframe preparation. * Minor extended documentation edits. * Further minor docstring edit. * Correct test class naming. * Fix isort.
Addresses part of #1538
Dependent on #1572. Please note that this currently contains the same commit as in #1572 to facilitate GitHub Actions.
Description
This PR adds functionality for converting a pandas DataFrame into an iris cube for the purposes of providing forecast and truth cubes for EMOS. The forecast and truth tables are expected to contain the following columns:
Forecast: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, percentile, diagnostic, latitude, longitude, period, height, cf_name, units.
Truth: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude
Further information about table formatting is in https://github.com/MetOffice/improver_suite/issues/961.
Testing: