-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/tc forecast hdf5 #33
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, although I have some doubts about the inter-dependent variables.
@@ -92,7 +92,11 @@ class TCForecast(TCTracks): | |||
data : list of xarray.Dataset | |||
Same as in parent class, adding the following attributes | |||
- ensemble_member (int) | |||
- is_ensemble (bool; if False, the simulation is a high resolution deterministic run | |||
- is_ensemble (bool; if False, the simulation is a high resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of ensemble_member
is is_ensemble=False
? It is in general not easy to handle when parameters are inter-dependent, and if possible should be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i remember right, this is consistent with how ECMWF forecasts are provided (and this is the main use case for the class). the forecast ensemble is a set of tracks, one of which is also a deterministic forecast. it therefore has both an ensemble_member
number, but needs to be distinguished as the deterministic run as well.
@manniepmkam - is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ChrisFairless that this decision was made to deal with ECMWF forecasts. If we want to change those two parameters, we would have to change how we read in ECMWF data and I suggest we do that outside of this pull request.
- is_ensemble (bool; if False, the simulation is a high resolution deterministic run | ||
- is_ensemble (bool; if False, the simulation is a high resolution | ||
deterministic run) | ||
- forecast_time (numpy.datetime64): timepoint of the initialisation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you have an ensemble, must all forecast_time be identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. the answer is probably yes. i'm not sure when you'd want to mix different forecast initialisation times in a single Hazard object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that a TCForecast object does not require that all forecast_time are identical. But if they are part of the same ensemble they will be identical. Also if the resulting hazard is used in the Forecast class, the forecast_time needs to be identical. That is ensured by the parameter interface chosen in the forecast class.
I assume no change is needed based on this comment.
Path to a new HDF5 file. If it exists already, the file is overwritten. | ||
complevel : int | ||
Specifies a compression level (0-9) for the zlib compression of the data. | ||
A value of 0 or None disables compression. Default: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's consistent with the parent class. we could make it a class attribute of the parent class so that it's inherited, but it's probably not that important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we leave it as it is. If we want to link it to the default of the parent class we would also have to think about a smart way to handle the docstring, so that the default is stated there as well. I thought about something like that for a while, but found no easy and readable solution.
def test_hdf5_io(self): | ||
"""Test writting and reading hdf5 TCTracks instances""" | ||
tc_track = TCForecast() | ||
tc_track.fetch_ecmwf(files=TEST_BUFR_FILES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these test files already on the API? If not, it might be a good moment to do so and adapt the test accordingly... What do you think @emanuel-schmid ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two files are about 6 KB in size. I suggest we leave them in the repository for now. If we want to move them to the API I would appreciate some guidance or help.
- is_ensemble (bool; if False, the simulation is a high resolution | ||
deterministic run) | ||
- forecast_time (numpy.datetime64): timepoint of the initialisation of | ||
the numerical weather prediction run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to understand. Is this the day on which the prediction was computed, the day on which the computation was started, or the day for which the prediction is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think the parameter name could be better here. forecast_initialisation
or initialisation_time
would be clearer (or 'initialization' depending on our spelling policy).
usually this reflects the time that the forecast calculations were started. so a forecast time of 26/02/2022 00:00 will mean it was started at 00:00, published a few hours later, and the forecast itself will start at 00:00 and run for 240 hours in 6 hour steps (though these vary by provider).
again, this is off the top of my head so correct me if i've misremembered @ThomasRoosli @manniepmkam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the name of the attribute confusing or the description of the attribute? The only thing I am proposing in this pull request is the description of the previously existing attribute, because it was missing. I tried to be as precise as possible in the description. I think if the user is not familiar with those terms (they should be), then a search in the relevant documentation of e.g. ECMWF will reveal the meaning.
If we want to change the name of that attribute we could also consider run_datetime
to be consistent with the Forecast class. What do you think @manniepmkam @chahank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrisFairless reply is correct, forecast_time
is the time when the forecast simulation is initiated. For me both initialisation_time
and run_datetime
are good to me if forecast_time
is a confusing name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the issue is that both the naming can be confusing, and the docstring is not clear enough. No big deal, but it can easily be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency with other classes is seldom a bad idea - should we change to run_datetime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with changing the name of the attribute to run_datetime
.
I suggest to keep the docstring as it is. It mentions "initialization" and "numerical weather prediction". A user that is not familiar with the topic might do a internet search and hopefully ends up at the combination of those two terms on wikipedia: "https://en.wikipedia.org/wiki/Numerical_weather_prediction#Initialization". I assume it is not the idea to link to the wikipedia articles in our docstrings. I am very happy for further suggestions to improve the docstring.
looking at the ncdump of the saved HDF5 file, most variables seem to have a 'coordinates' attribute which is set to "lat lon". this doesn't look right. so i have a question: my understanding of track data says that 'lat' and 'lon' are time-dependent variables and that no other variables are indexed by them. so that would mean they are neither dimensions nor coordinates - just regular variables. so i think when we're we're writing to netcdf we shouldn't designate them as coordinates at all? if i've understood this right, then it also affects #349 (paging @tovogt), though it looks like the coordinates are implemented in different ways in the TCTracks |
could you also check that the test file's |
other than the points above, it's looking good and working well! |
That's an interesting point: According to the CF conventions for NetCDF files, each TC track is a "Discrete Sampling Geometry", namely "a series of data points along a path through space with monotonically increasing times" called a "trajectory" (see https://cfconventions.org/cf-conventions/cf-conventions.html#discrete-sampling-geometries). For that kind of data, the CF conventions tells us:
There is also an example given: https://cfconventions.org/cf-conventions/cf-conventions.html#_single_trajectory So, the climada_petals/climada_petals/hazard/tc_tracks_forecast.py Lines 450 to 473 in e355ac1
Now, when it comes to writing this information to a NetCDF file, we don't deal with this convention so far, because xarray is supposed to do it. However, it seems like xarray doesn't do it right. The |
Thanks for the question and @ChrisFairless and for the indepth answer @tovogt. |
This inherits the original code from @mmyrte. My understanding is that user can assign a unique ID for each tracks, but if leave |
@ThomasRoosli I wouldn't literally "wait" for xarray to deal with the |
I've no clue why I handled that as a string; I remember being very much behind schedule and just implementing whatever worked at the time. Normally, it would construct a float when reading all forecasts at a given point in time, adding i/100 with i out of n tracks per storm. That being said, I never saw the reasoning behind the id_no, which AFAIK TCTracks simply inherits from the base hazard class, which in turn simply emulated the original MATLAB struct - but it really doesn't have any semantic meaning, it's just there to satisfy the base class conditions. So generating a random number, or even just setting it to 1.0 as default would do IMO. |
thanks @tovogt for the in-depth response! you're completely right that this doesn't affect functionality, (and, yeah, maybe wasn't worth bringing up). thanks! |
@ChrisFairless Don't worry, keep asking when in doubt. 👍 It was very well-spotted from your side. After all, only after a fun read of the CF conventions and some further consideration, we were able to settle this as harmless. |
I implemented the suggested changes on renaming the attribute and defining lat and lon as coords not data_vars. @chahank is it a priorty to move the test files to the API? they are only 6 KB. |
"fun" |
i'm happy for this to be merged |
}, | ||
coords={ | ||
'time': timestamp, | ||
'lat': ('time', lat), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ThomasRoosli I'm only seeing this since I'm incidentally also working on this file - this change is wrong, since apparently, you cannot perform dropna()
if lat/lon are already coordinates. See https://github.com/CLIMADA-project/climada_petals/pull/33/files#diff-f38c71c49b698e73f5039c46ceb6fe77921909d7754faa380df102339250a25cR486 for the relevant line. Do you mind if I undo this change in my current PR #36?
@@ -417,7 +468,7 @@ def _subset_to_track(msg, index, provider, timestamp_origin, name, id_no): | |||
'id_no': (int(id_no) + index / 100), | |||
'ensemble_number': msg['ens_number'][index], | |||
'is_ensemble': ens_bool, | |||
'forecast_time': timestamp_origin, | |||
'run_datetime': timestamp_origin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ThomasRoosli Thanks for this rename, there's almost always a mess around this nomenclature. Much better like this. :)
Hdf5 input and output have been added to the TCTracks baseclass in this pull request in climada_python: CLIMADA-project/climada_python#349
This pull request makes sure this added funtionality does not fail in TCForecast which inherits from TCTracks.
(1) I have overloaded the methods write_hdf5 and from_hdf5
(2) I have changed the 'basin' attribute to dtype '<U2', (this decision was made in the baseclass TCTracks for details see the pull request mentioned above)
(3) I added the forecast_time attribute to the docstring of the TCForecast class
I took over the decisions on code structure/docstring etc from the pullrequest mentioned above to be on the same line as the baseclass.