Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions to support conversion of a pandas DataFrame into an iris cube #1582

Merged

Conversation

gavinevans
Copy link
Contributor

Addresses part of #1538

Dependent on #1572. Please note that this currently contains the same commit as in #1572 to facilitate GitHub Actions.

Description
This PR adds functionality for converting a pandas DataFrame into an iris cube for the purposes of providing forecast and truth cubes for EMOS. The forecast and truth tables are expected to contain the following columns:

Forecast: forecast, blend_time, forecast_period, forecast_reference_time, time, wmo_id, percentile, diagnostic, latitude, longitude, period, height, cf_name, units.

Truth: ob_value, time, wmo_id, diagnostic, latitude, longitude and altitude

Further information about table formatting is in https://github.com/MetOffice/improver_suite/issues/961.

Testing:

  • Ran tests and they passed OK
  • Added new tests for the new feature(s)

@gavinevans gavinevans added this to the 1.0.0 milestone Oct 11, 2021
@gavinevans gavinevans self-assigned this Oct 11, 2021
@gavinevans gavinevans changed the title Improver1538 tabular ingestion functions Add functions to support conversion of a pandas DataFrame into an iris cube Oct 11, 2021
@gavinevans gavinevans removed their assignment Oct 11, 2021
@codecov
Copy link

codecov bot commented Oct 11, 2021

Codecov Report

Merging #1582 (ed31c06) into master (f63415a) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1582      +/-   ##
==========================================
+ Coverage   98.03%   98.05%   +0.01%     
==========================================
  Files         109      109              
  Lines        9817     9914      +97     
==========================================
+ Hits         9624     9721      +97     
  Misses        193      193              
Impacted Files Coverage Δ
improver/calibration/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f63415a...ed31c06. Read the comment docs.

@gavinevans gavinevans marked this pull request as draft October 11, 2021 16:25
@gavinevans gavinevans force-pushed the improver1538_tabular_ingestion_functions branch 2 times, most recently from 0177a77 to 25005bb Compare October 12, 2021 07:31
@gavinevans gavinevans marked this pull request as ready for review October 12, 2021 10:40
ValueError: Only one unique value within the specifed column
is expected.
"""
if len(table[column].unique()) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified slightly with `if table[column].nunique() > 1'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems these give different results if there are nans. E.g. if we have df = pd.DataFrame({'a': [np.nan, 2, 2]}) then len(df["a"].unique()) is 2, but df["a"].nunique() is 1. I'm not sure which is more appropriate here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I wasn't aware of nunique. I think df["a"].nunique(dropna=False) would give a value of 2 following your example, which I think would be what I expect here, so I could implement that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used df[column].nunique(dropna=False) as suggested.

units="m",
)

for percentile in table["percentile"].unique():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to sort the result of table["percentile"].unique(), to ensure the percentiles in the final cube will be ordered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I hadn't thought about this because the percentile column was sorted in the table that I was using. For robustness, it might make sense to sort the percentiles here. I'll take a look at it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a sort of the percentiles to ensure the order is as expected.


Args:
forecast_table:
DataFrame expected to contain the following columns: .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is list of columns missing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should correct this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've corrected this docstring and added some extended documentation related to Lucy's comment.

forecast_table:
DataFrame expected to contain the following columns: .
truth_table:
DataFrame expected to contain the following columns: .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected as for forecast_table.

Copy link
Contributor

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @gavinevans ! I will pull your branch and check how it works with our site forecast and observation data tomorrow. I did have some questions too.

There seems to be a variety of dtypes used for dates, here we have used pandas (via pd.date_range), numpy (np.timedelta64 and datetime64) and int64 (np.int64). Elsewhere in IMPROVER, the datetime package is commonly used. Should we try to standardise the data type used for dates?

Just a question (I have no idea) - is it faster/more efficient to build a cube for each time & percentile and merge, rather than shape the data and create a single cube for the date range?

I noticed that in #1581 'df' is used and 'table' is used here to refer to the dataframes. Consistency might be good. I have some preference towards 'df' because python also has a datatable package.



def _unique_check(table: DataFrame, column: str) -> Any:
"""Check whether the value in the column is unique.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
"""Check whether the value in the column is unique.
"""Check whether the values in the column is unique.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

Comment on lines 192 to 195
DataFrame expected to contain the following columns: forecast,
blend_time, forecast_period, forecast_reference_time, time,
wmo_id, percentile, diagnostic, latitude, longitude, period,
height, cf_name, units.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are other, optional columns allowed? If so could this be documented somewhere?

Also it might be nice to have an explanation of the columns/example table documented somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll try to add some extended documentation for this. In the meantime, if you have any comments related to this comment that would be good to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've amended this docstring and added some extended documentation to provide information about the forecast and truth tables.

Comment on lines 204 to 482
for coord in ["time", "forecast_reference_time"]:
table[coord] = table[coord].dt.tz_localize(None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, why do we want to drop the time zone information from forecasts? Also would they all be UTC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done initially for ease but I think these lines should be removable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this. All times are UTC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all our forecasts will be UTC, tzinfo just if someone wants to do a join on the obs table (which will also be explicit)

Comment on lines 224 to 226
frt_point = np.datetime64(
time_table["forecast_reference_time"].values[0], "s"
).astype(np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, why do we want to use int64 here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up question to this: if we do want this value to be an integer, should time_point also be an integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All datetimes are now using pandas datetime functionality, where possible, I think. Times are only converted to integers as part of the coordinate creation.

table[coord] = table[coord].dt.tz_localize(None)

cubelist = CubeList()
for adate in date_range:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is a good idea to name a variable the same name as a function (pd.date_range) ..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I've updated this training_dates.

return RebadgePercentilesAsRealizations()(cube)


def truth_table_to_cube(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like some of the actions of truth_table_to_cube and forecast_table_to_cube are common. In the interest of DRY (don't repeat yourself), I wonder if we could use some functions to avoid repetition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, once I've done some modifications related to all the other comments, I'll review the commonality between these functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to factor out a few functions, where possible.


Returns:
Cube containing the forecasts from the training period.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if some checking to ensure that the input dataframe has the correct columns before we do anything else would be a good idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a function to check this upfront for the forecasts and truths.

Comment on lines 349 to 351
time_table[col] = table.groupby(by="wmo_id", sort=False)[col].agg(
lambda x: pd.Series.mode(x, dropna=not table[col].isna().all())
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, not sure how this would work for sites that don't have a wmo id. Also is it common for alt/lat/lon to have different values, at a single time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding different values for alt/lat/lon, note that I'm computing the mode from the whole table, rather than the time_table subset. Over the course of a training dataset, the alt/lat/lon for a site can change, which would prevent the cubes merging. Maybe that helps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gavinevans, but I'm curious to as to how/why alt/lat/lon can change for a site? What kind of observation sources are you using here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LaurenceBeard has suggested that this is just small corrections to the recorded alt/lat/lon of sites following re-assessment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only using synoptic land observations (which shouldn't move!) but sometimes our master station list does get updated (usually to correct, sometimes adjust). As a result we should prefer the forecast table's position data - for a trial this should not change - but for a long-term time series over releases, this may have differences.

Nonetheless these values are just being added to the netCDF for consistency, they are not being used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would be tempted to drop this calculation, and just use the forecast tables values (ignore these cols from the truth table)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a go at doing some refactoring to avoid using the alt/lat/lon from the truths at all and replace with those from the forecasts in this commit: 6b2ab0f. However, I've realised that I currently still need this calculation to handle missing observations. I'll look into this more tomorrow.

)
# Replace empty arrays generated by the mode with NaNs.
time_table[col] = time_table[col].apply(
lambda x: np.nan if isinstance(x, np.ndarray) else x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If x is an array, will it always be empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually removed this line now because previously I was anticipating that I would have columns that were potentially all NaNs e.g. a period column for an instantaneous diagnostic but as we now haven't put a period column on the truth table, this line isn't required because I'm not anticipating computing the mode for a column full of NaNs, which was previously resulting in an empty array.

Comment on lines 411 to 412
Forecasts and truths for the training period that have been filtered
only include sites that are present in both the forecasts and truths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Forecasts and truths for the training period that have been filtered
only include sites that are present in both the forecasts and truths
Forecasts and truths for the training period that have been filtered to
only include sites that are present in both the forecasts and truths.

nitpick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

Copy link
Contributor

@fionaRust fionaRust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comment.
The other thing we discussed last night is that we are likely to pass in a the same list of leadtimes for all cycles, although some leadtimes will not find valid data in the table. In this case we probably want to return nothing and the Estimate EMOS CLI will not produce a coefficients file.

improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
@gavinevans
Copy link
Contributor Author

@lucyleeow Regarding the following points:

There seems to be a variety of dtypes used for dates, here we have used pandas (via pd.date_range), numpy (np.timedelta64 and datetime64) and int64 (np.int64). Elsewhere in IMPROVER, the datetime package is commonly used. Should we try to standardise the data type used for dates?

Yes, I've had a go at using primarily using the datetime package. This is related.

I noticed that in #1581 'df' is used and 'table' is used here to refer to the dataframes. Consistency might be good. I have some preference towards 'df' because python also has a datatable package.

Good idea. I'm going to use df for consistency.

Copy link
Contributor

@benowen-bom benowen-bom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gavinevans, I've had a look and have some additional comments. I've also put up a comment on https://github.com/MetOffice/improver_suite/issues/961 regarding forecast/truth columns.

Comment on lines 224 to 226
frt_point = np.datetime64(
time_table["forecast_reference_time"].values[0], "s"
).astype(np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up question to this: if we do want this value to be an integer, should time_point also be an integer?

improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
Comment on lines 349 to 351
time_table[col] = table.groupby(by="wmo_id", sort=False)[col].agg(
lambda x: pd.Series.mode(x, dropna=not table[col].isna().all())
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gavinevans, but I'm curious to as to how/why alt/lat/lon can change for a site? What kind of observation sources are you using here?

@gavinevans gavinevans force-pushed the improver1538_tabular_ingestion_functions branch from 684ce01 to 8974924 Compare October 14, 2021 14:57
Copy link
Contributor

@fionaRust fionaRust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments on documentation, before I stopped reviewing for the night. I'll take another look tomorrow morning

Copy link
Contributor

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all our coments. This looks fine and workable for us. Happy to take a look again on Monday, especially at the tests as they are failing now, or if you have time constraits feel free to merge before then.

The only other comment I would make is that I note that 'site' information (e.g., lat/lon/height) is stored in the forecast df, whereas we may be more interested in observation site information e.g., the height of the observation site. I may not full grasp what data is stored in the forecast df though.

A DataFrame without numpy datetime dtypes.
"""
for col in [c for c in df.columns if df[c].dtype == "datetime64[ns]"]:
df[col] = df[col].dt.tz_localize("UTC").astype("O")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - why are we wanting dates to be 'object' data type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extended the docstring for clarity. Converting the columns to "object" dtype results in the values within the columns being pandas datetime objects (Timestamp and Timedelta), rather than numpy datetime objects. Later functions then only need to handle pandas datetime objects.

Comment on lines +278 to +365
at least one day prior to the cycletime. The final validity time
within the training dataset is additionally offset by the number
of days within the forecast period to ensure that the dates defined
by the training dataset are in the past relative to the cycletime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion only, feel free to ignore. Could we maybe give an example here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example here.

Comment on lines +284 to +380
cycletime:
Cycletime of a format similar to 20170109T0000Z.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the last cycle time of the training period.. ? If so could we note it down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cycletime of the current cycle, rather than the final validity time within the training dataset, which is what this function calculates. I've expanded the docstring to try to make it clearer.

forecast_period: int,
training_length: int,
) -> Tuple[Cube, Cube]:
"""Convert a truth DataFrame into an iris Cube.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring needs updating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've updated this docstring.

Copy link
Contributor

@BelligerG BelligerG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few queries/comments from me, it's looking good though!

improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Outdated Show resolved Hide resolved
improver/calibration/__init__.py Show resolved Hide resolved
@gavinevans gavinevans force-pushed the improver1538_tabular_ingestion_functions branch 2 times, most recently from e97bd1c to e1a762c Compare October 15, 2021 10:21
improver/calibration/__init__.py Show resolved Hide resolved
improver/calibration/__init__.py Show resolved Hide resolved
improver_tests/calibration/test_init.py Outdated Show resolved Hide resolved
fionaRust
fionaRust previously approved these changes Oct 18, 2021
Copy link
Contributor

@fionaRust fionaRust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now happy with these changes. Thanks Gavin!

LaurenceBeard
LaurenceBeard previously approved these changes Oct 19, 2021
Copy link
Contributor

@LaurenceBeard LaurenceBeard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy for this to go in, testing related issues can be addressed at a later date.

self.forecast_period,
self.training_length,
)
self.assertEqual(len(result), 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good functionality to ensure (though station_id is one we may end up using), maybe a 'blah' column or something that would never exist?

return
cube = cubelist.merge_cube()

return RebadgePercentilesAsRealizations()(cube)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A check for equally spaced percentiles is done within this Plugin?

fionaRust
fionaRust previously approved these changes Oct 19, 2021
@gavinevans gavinevans force-pushed the improver1538_tabular_ingestion_functions branch from 8b134fa to ed31c06 Compare October 21, 2021 17:35
@fionaRust fionaRust self-assigned this Oct 22, 2021
@bayliffe bayliffe merged commit f6f90ce into metoppv:master Oct 22, 2021
MoseleyS pushed a commit to MoseleyS/improver that referenced this pull request Aug 22, 2024
…s cube (metoppv#1582)

* Modifications to functions required to support converting tabular data into iris cubes.

* Modifications to the columns expected within the truth table.

* Modify environments in preparation for changes required for ingestion of forecast and observation tables.

* Edits to expect the period column to be a timedelta64 dtype.

* Corrections to __init__.py

* Modifications to use assertCubeEqual to ensure the full cubes are compared.

* Modifications following review comments.

* Minor updates to docstrings.

* Add missing unit tests.

* Correction to csv files.

* Sort lists.

* Minor docstring amendment.

* NOT WORKING: Working commit to try to avoid using the truth altitude, latitude and longitude at all and replace with those from the forecast.

* Extend docstrings.

* Extended documentation updates.

* Refinement and addition of tests for column name checking.

* Modifications to tidy up dataframe preparation.

* Minor extended documentation edits.

* Further minor docstring edit.

* Correct test class naming.

* Fix isort.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants