Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/tc forecast hdf5 #33

Merged
merged 8 commits into from
Mar 7, 2022
Merged

Feature/tc forecast hdf5 #33

merged 8 commits into from
Mar 7, 2022

Conversation

ThomasRoosli
Copy link
Collaborator

Hdf5 input and output have been added to the TCTracks baseclass in this pull request in climada_python: CLIMADA-project/climada_python#349
This pull request makes sure this added funtionality does not fail in TCForecast which inherits from TCTracks.
(1) I have overloaded the methods write_hdf5 and from_hdf5
(2) I have changed the 'basin' attribute to dtype '<U2', (this decision was made in the baseclass TCTracks for details see the pull request mentioned above)
(3) I added the forecast_time attribute to the docstring of the TCForecast class

I took over the decisions on code structure/docstring etc from the pullrequest mentioned above to be on the same line as the baseclass.

Copy link
Member

@chahank chahank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, although I have some doubts about the inter-dependent variables.

@@ -92,7 +92,11 @@ class TCForecast(TCTracks):
data : list of xarray.Dataset
Same as in parent class, adding the following attributes
- ensemble_member (int)
- is_ensemble (bool; if False, the simulation is a high resolution deterministic run
- is_ensemble (bool; if False, the simulation is a high resolution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of ensemble_member is is_ensemble=False ? It is in general not easy to handle when parameters are inter-dependent, and if possible should be avoided.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i remember right, this is consistent with how ECMWF forecasts are provided (and this is the main use case for the class). the forecast ensemble is a set of tracks, one of which is also a deterministic forecast. it therefore has both an ensemble_member number, but needs to be distinguished as the deterministic run as well.

@manniepmkam - is that right?

Copy link
Collaborator Author

@ThomasRoosli ThomasRoosli Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ChrisFairless that this decision was made to deal with ECMWF forecasts. If we want to change those two parameters, we would have to change how we read in ECMWF data and I suggest we do that outside of this pull request.

- is_ensemble (bool; if False, the simulation is a high resolution deterministic run
- is_ensemble (bool; if False, the simulation is a high resolution
deterministic run)
- forecast_time (numpy.datetime64): timepoint of the initialisation of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have an ensemble, must all forecast_time be identical?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. the answer is probably yes. i'm not sure when you'd want to mix different forecast initialisation times in a single Hazard object.

Copy link
Collaborator Author

@ThomasRoosli ThomasRoosli Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that a TCForecast object does not require that all forecast_time are identical. But if they are part of the same ensemble they will be identical. Also if the resulting hazard is used in the Forecast class, the forecast_time needs to be identical. That is ensured by the parameter interface chosen in the forecast class.

I assume no change is needed based on this comment.

Path to a new HDF5 file. If it exists already, the file is overwritten.
complevel : int
Specifies a compression level (0-9) for the zlib compression of the data.
A value of 0 or None disables compression. Default: 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use 5?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's consistent with the parent class. we could make it a class attribute of the parent class so that it's inherited, but it's probably not that important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we leave it as it is. If we want to link it to the default of the parent class we would also have to think about a smart way to handle the docstring, so that the default is stated there as well. I thought about something like that for a while, but found no easy and readable solution.

def test_hdf5_io(self):
"""Test writting and reading hdf5 TCTracks instances"""
tc_track = TCForecast()
tc_track.fetch_ecmwf(files=TEST_BUFR_FILES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these test files already on the API? If not, it might be a good moment to do so and adapt the test accordingly... What do you think @emanuel-schmid ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two files are about 6 KB in size. I suggest we leave them in the repository for now. If we want to move them to the API I would appreciate some guidance or help.

- is_ensemble (bool; if False, the simulation is a high resolution
deterministic run)
- forecast_time (numpy.datetime64): timepoint of the initialisation of
the numerical weather prediction run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand. Is this the day on which the prediction was computed, the day on which the computation was started, or the day for which the prediction is?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think the parameter name could be better here. forecast_initialisation or initialisation_time would be clearer (or 'initialization' depending on our spelling policy).

usually this reflects the time that the forecast calculations were started. so a forecast time of 26/02/2022 00:00 will mean it was started at 00:00, published a few hours later, and the forecast itself will start at 00:00 and run for 240 hours in 6 hour steps (though these vary by provider).

again, this is off the top of my head so correct me if i've misremembered @ThomasRoosli @manniepmkam

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the name of the attribute confusing or the description of the attribute? The only thing I am proposing in this pull request is the description of the previously existing attribute, because it was missing. I tried to be as precise as possible in the description. I think if the user is not familiar with those terms (they should be), then a search in the relevant documentation of e.g. ECMWF will reveal the meaning.
If we want to change the name of that attribute we could also consider run_datetime to be consistent with the Forecast class. What do you think @manniepmkam @chahank?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisFairless reply is correct, forecast_time is the time when the forecast simulation is initiated. For me both initialisation_time and run_datetime are good to me if forecast_time is a confusing name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the issue is that both the naming can be confusing, and the docstring is not clear enough. No big deal, but it can easily be improved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency with other classes is seldom a bad idea - should we change to run_datetime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with changing the name of the attribute to run_datetime.
I suggest to keep the docstring as it is. It mentions "initialization" and "numerical weather prediction". A user that is not familiar with the topic might do a internet search and hopefully ends up at the combination of those two terms on wikipedia: "https://en.wikipedia.org/wiki/Numerical_weather_prediction#Initialization". I assume it is not the idea to link to the wikipedia articles in our docstrings. I am very happy for further suggestions to improve the docstring.

@ChrisFairless
Copy link
Collaborator

looking at the ncdump of the saved HDF5 file, most variables seem to have a 'coordinates' attribute which is set to "lat lon". this doesn't look right.
image

so i have a question: my understanding of track data says that 'lat' and 'lon' are time-dependent variables and that no other variables are indexed by them. so that would mean they are neither dimensions nor coordinates - just regular variables.

so i think when we're we're writing to netcdf we shouldn't designate them as coordinates at all?

if i've understood this right, then it also affects #349 (paging @tovogt), though it looks like the coordinates are implemented in different ways in the TCTracks from_ibtracs_netcdf and the TCForecast _subset_to_track methods.

@ChrisFairless
Copy link
Collaborator

could you also check that the test file's id_no variables are ok? i'm seeing what look like rounding errors in some ids, e.g. 2.2800000000000002 (this is nothing to do with the new HDF5 functionality). is this harmless?

@ChrisFairless
Copy link
Collaborator

other than the points above, it's looking good and working well!

@tovogt
Copy link
Collaborator

tovogt commented Feb 28, 2022

so i have a question: my understanding of track data says that 'lat' and 'lon' are time-dependent variables and that no other variables are indexed by them. so that would mean they are neither dimensions nor coordinates - just regular variables.

so i think when we're we're writing to netcdf we shouldn't designate them as coordinates at all?

That's an interesting point: According to the CF conventions for NetCDF files, each TC track is a "Discrete Sampling Geometry", namely "a series of data points along a path through space with monotonically increasing times" called a "trajectory" (see https://cfconventions.org/cf-conventions/cf-conventions.html#discrete-sampling-geometries). For that kind of data, the CF conventions tells us:

Every element of every feature must be unambiguously associated with its space and time coordinates and with the feature that contains it. The coordinates attribute must be attached to every data variable to indicate the spatiotemporal coordinate variables that are needed to geo-locate the data.
[from https://cfconventions.org/cf-conventions/cf-conventions.html#coordinates-metadata]

There is also an example given: https://cfconventions.org/cf-conventions/cf-conventions.html#_single_trajectory

So, the lat and lon variables should be listed in coords in the TCForecast implementation (just as in the TCTracks implementation):

track = xr.Dataset(
data_vars={
'max_sustained_wind': ('time', np.squeeze(wnd)),
'central_pressure': ('time', np.squeeze(pre)/100),
'ts_int': ('time', timestep_int),
'lat': ('time', lat),
'lon': ('time', lon),
},
coords={
'time': timestamp,
},
attrs={
'max_sustained_wind_unit': 'm/s',
'central_pressure_unit': 'mb',
'name': name,
'sid': sid,
'orig_event_flag': False,
'data_provider': provider,
'id_no': (int(id_no) + index / 100),
'ensemble_number': msg['ens_number'][index],
'is_ensemble': ens_bool,
'forecast_time': timestamp_origin,
}
)

Now, when it comes to writing this information to a NetCDF file, we don't deal with this convention so far, because xarray is supposed to do it. However, it seems like xarray doesn't do it right. The coordinates attribute in our case should actually contain "time lat lon" but xarray omits the time coordinate and only lists "lat lon". As far as I understand the convention, this is wrong. I opened an issue for that here: pydata/xarray#6310

@ThomasRoosli
Copy link
Collaborator Author

ThomasRoosli commented Feb 28, 2022

So, the lat and lon variables should be listed in coords in the TCForecast implementation (just as in the TCTracks implementation):

Thanks for the question and @ChrisFairless and for the indepth answer @tovogt.
To resolve this for now I will define the lat and lon variables as coords. Then we will wait for xarray to deal with the convention based on the issue opened by @tovogt.

@manniepmkam
Copy link
Collaborator

manniepmkam commented Feb 28, 2022

could you also check that the test file's id_no variables are ok? i'm seeing what look like rounding errors in some ids, e.g. 2.2800000000000002 (this is nothing to do with the new HDF5 functionality). is this harmless?

This inherits the original code from @mmyrte. My understanding is that user can assign a unique ID for each tracks, but if leave None then it assign a random number. @mmyrte is that correct?

@tovogt
Copy link
Collaborator

tovogt commented Feb 28, 2022

@ThomasRoosli I wouldn't literally "wait" for xarray to deal with the coordinates attribute issue. From my experience, I would expect that it might take very long until they merge a fix, then it needs to be released in a new xarray version, and finally this xarray version needs to be used by CLIMADA. I wouldn't be surprised if this process takes one or even several years. In my opinion, this is a very minor issue and we can safely ignore it. But if you think this is relevant for someone, we can discuss how to implement a workaround on our side until there is an official fix available in xarray. I don't know why @ChrisFairless even mentioned this point, I can't think of any situation where a user would be adversely affected by this.

@mmyrte
Copy link
Collaborator

mmyrte commented Mar 1, 2022

could you also check that the test file's id_no variables are ok? i'm seeing what look like rounding errors in some ids, e.g. 2.2800000000000002 (this is nothing to do with the new HDF5 functionality). is this harmless?

This inherits the original code from @mmyrte. My understanding is that user can assign a unique ID for each tracks, but if leave None then it assign a random number. @mmyrte is that correct?

I've no clue why I handled that as a string; I remember being very much behind schedule and just implementing whatever worked at the time. Normally, it would construct a float when reading all forecasts at a given point in time, adding i/100 with i out of n tracks per storm. That being said, I never saw the reasoning behind the id_no, which AFAIK TCTracks simply inherits from the base hazard class, which in turn simply emulated the original MATLAB struct - but it really doesn't have any semantic meaning, it's just there to satisfy the base class conditions. So generating a random number, or even just setting it to 1.0 as default would do IMO.

@ChrisFairless
Copy link
Collaborator

thanks @tovogt for the in-depth response! you're completely right that this doesn't affect functionality, (and, yeah, maybe wasn't worth bringing up). thanks!

@tovogt
Copy link
Collaborator

tovogt commented Mar 1, 2022

@ChrisFairless Don't worry, keep asking when in doubt. 👍 It was very well-spotted from your side. After all, only after a fun read of the CF conventions and some further consideration, we were able to settle this as harmless.

@ThomasRoosli
Copy link
Collaborator Author

I implemented the suggested changes on renaming the attribute and defining lat and lon as coords not data_vars.
I suggest we handle the definition of the attributes id_no, ensemble_member and is_ensemble outside this pull request.

@chahank is it a priorty to move the test files to the API? they are only 6 KB.
@chahank and @ChrisFairless are the points solved to your satisfaction?

@ChrisFairless
Copy link
Collaborator

ChrisFairless commented Mar 1, 2022

After all, only after a fun read of the CF conventions

"fun"

@ChrisFairless
Copy link
Collaborator

i'm happy for this to be merged

@ThomasRoosli ThomasRoosli merged commit e1242f1 into develop Mar 7, 2022
@ThomasRoosli ThomasRoosli deleted the feature/tc_forecast_hdf5 branch March 7, 2022 12:10
},
coords={
'time': timestamp,
'lat': ('time', lat),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThomasRoosli I'm only seeing this since I'm incidentally also working on this file - this change is wrong, since apparently, you cannot perform dropna() if lat/lon are already coordinates. See https://github.com/CLIMADA-project/climada_petals/pull/33/files#diff-f38c71c49b698e73f5039c46ceb6fe77921909d7754faa380df102339250a25cR486 for the relevant line. Do you mind if I undo this change in my current PR #36?

@@ -417,7 +468,7 @@ def _subset_to_track(msg, index, provider, timestamp_origin, name, id_no):
'id_no': (int(id_no) + index / 100),
'ensemble_number': msg['ens_number'][index],
'is_ensemble': ens_bool,
'forecast_time': timestamp_origin,
'run_datetime': timestamp_origin,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThomasRoosli Thanks for this rename, there's almost always a mess around this nomenclature. Much better like this. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants