Feature/tc forecast hdf5 #33

ThomasRoosli · 2022-02-17T10:56:03Z

Hdf5 input and output have been added to the TCTracks baseclass in this pull request in climada_python: CLIMADA-project/climada_python#349
This pull request makes sure this added funtionality does not fail in TCForecast which inherits from TCTracks.
(1) I have overloaded the methods write_hdf5 and from_hdf5
(2) I have changed the 'basin' attribute to dtype '<U2', (this decision was made in the baseclass TCTracks for details see the pull request mentioned above)
(3) I added the forecast_time attribute to the docstring of the TCForecast class

I took over the decisions on code structure/docstring etc from the pullrequest mentioned above to be on the same line as the baseclass.

chahank

Looks good to me, although I have some doubts about the inter-dependent variables.

chahank · 2022-02-24T13:30:25Z

climada_petals/hazard/tc_tracks_forecast.py

@@ -92,7 +92,11 @@ class TCForecast(TCTracks):
    data : list of xarray.Dataset
        Same as in parent class, adding the following attributes
        - ensemble_member (int)
-        - is_ensemble (bool; if False, the simulation is a high resolution deterministic run
+        - is_ensemble (bool; if False, the simulation is a high resolution


What is the meaning of ensemble_member is is_ensemble=False ? It is in general not easy to handle when parameters are inter-dependent, and if possible should be avoided.

if i remember right, this is consistent with how ECMWF forecasts are provided (and this is the main use case for the class). the forecast ensemble is a set of tracks, one of which is also a deterministic forecast. it therefore has both an ensemble_member number, but needs to be distinguished as the deterministic run as well.

@manniepmkam - is that right?

I agree with @ChrisFairless that this decision was made to deal with ECMWF forecasts. If we want to change those two parameters, we would have to change how we read in ECMWF data and I suggest we do that outside of this pull request.

chahank · 2022-02-24T13:30:55Z

climada_petals/hazard/tc_tracks_forecast.py

-        - is_ensemble (bool; if False, the simulation is a high resolution deterministic run
+        - is_ensemble (bool; if False, the simulation is a high resolution
+                       deterministic run)
+        - forecast_time (numpy.datetime64): timepoint of the initialisation of


When you have an ensemble, must all forecast_time be identical?

good question. the answer is probably yes. i'm not sure when you'd want to mix different forecast initialisation times in a single Hazard object.

I would argue that a TCForecast object does not require that all forecast_time are identical. But if they are part of the same ensemble they will be identical. Also if the resulting hazard is used in the Forecast class, the forecast_time needs to be identical. That is ensured by the parameter interface chosen in the forecast class.

I assume no change is needed based on this comment.

chahank · 2022-02-24T13:31:27Z

climada_petals/hazard/tc_tracks_forecast.py

+            Path to a new HDF5 file. If it exists already, the file is overwritten.
+        complevel : int
+            Specifies a compression level (0-9) for the zlib compression of the data.
+            A value of 0 or None disables compression. Default: 5


Any reason to use 5?

it's consistent with the parent class. we could make it a class attribute of the parent class so that it's inherited, but it's probably not that important.

I suggest we leave it as it is. If we want to link it to the default of the parent class we would also have to think about a smart way to handle the docstring, so that the default is stated there as well. I thought about something like that for a while, but found no easy and readable solution.

chahank · 2022-02-24T13:35:01Z

climada_petals/hazard/test/test_tc_tracks_forecast.py

+    def test_hdf5_io(self):
+        """Test writting and reading hdf5 TCTracks instances"""
+        tc_track = TCForecast()
+        tc_track.fetch_ecmwf(files=TEST_BUFR_FILES)


Are these test files already on the API? If not, it might be a good moment to do so and adapt the test accordingly... What do you think @emanuel-schmid ?

These two files are about 6 KB in size. I suggest we leave them in the repository for now. If we want to move them to the API I would appreciate some guidance or help.

chahank · 2022-02-24T13:44:50Z

climada_petals/hazard/tc_tracks_forecast.py

+        - is_ensemble (bool; if False, the simulation is a high resolution
+                       deterministic run)
+        - forecast_time (numpy.datetime64): timepoint of the initialisation of
+            the numerical weather prediction run


I am not sure to understand. Is this the day on which the prediction was computed, the day on which the computation was started, or the day for which the prediction is?

yeah i think the parameter name could be better here. forecast_initialisation or initialisation_time would be clearer (or 'initialization' depending on our spelling policy).

usually this reflects the time that the forecast calculations were started. so a forecast time of 26/02/2022 00:00 will mean it was started at 00:00, published a few hours later, and the forecast itself will start at 00:00 and run for 240 hours in 6 hour steps (though these vary by provider).

again, this is off the top of my head so correct me if i've misremembered @ThomasRoosli @manniepmkam

Is the name of the attribute confusing or the description of the attribute? The only thing I am proposing in this pull request is the description of the previously existing attribute, because it was missing. I tried to be as precise as possible in the description. I think if the user is not familiar with those terms (they should be), then a search in the relevant documentation of e.g. ECMWF will reveal the meaning.
If we want to change the name of that attribute we could also consider run_datetime to be consistent with the Forecast class. What do you think @manniepmkam @chahank?

@ChrisFairless reply is correct, forecast_time is the time when the forecast simulation is initiated. For me both initialisation_time and run_datetime are good to me if forecast_time is a confusing name.

For me the issue is that both the naming can be confusing, and the docstring is not clear enough. No big deal, but it can easily be improved.

consistency with other classes is seldom a bad idea - should we change to run_datetime?

I agree with changing the name of the attribute to run_datetime.
I suggest to keep the docstring as it is. It mentions "initialization" and "numerical weather prediction". A user that is not familiar with the topic might do a internet search and hopefully ends up at the combination of those two terms on wikipedia: "https://en.wikipedia.org/wiki/Numerical_weather_prediction#Initialization". I assume it is not the idea to link to the wikipedia articles in our docstrings. I am very happy for further suggestions to improve the docstring.

ChrisFairless · 2022-02-26T00:33:23Z

looking at the ncdump of the saved HDF5 file, most variables seem to have a 'coordinates' attribute which is set to "lat lon". this doesn't look right.

so i have a question: my understanding of track data says that 'lat' and 'lon' are time-dependent variables and that no other variables are indexed by them. so that would mean they are neither dimensions nor coordinates - just regular variables.

so i think when we're we're writing to netcdf we shouldn't designate them as coordinates at all?

if i've understood this right, then it also affects #349 (paging @tovogt), though it looks like the coordinates are implemented in different ways in the TCTracks from_ibtracs_netcdf and the TCForecast _subset_to_track methods.

ChrisFairless · 2022-02-26T00:44:23Z

could you also check that the test file's id_no variables are ok? i'm seeing what look like rounding errors in some ids, e.g. 2.2800000000000002 (this is nothing to do with the new HDF5 functionality). is this harmless?

ChrisFairless · 2022-02-26T00:47:53Z

other than the points above, it's looking good and working well!

tovogt · 2022-02-28T12:17:08Z

so i have a question: my understanding of track data says that 'lat' and 'lon' are time-dependent variables and that no other variables are indexed by them. so that would mean they are neither dimensions nor coordinates - just regular variables.

so i think when we're we're writing to netcdf we shouldn't designate them as coordinates at all?

That's an interesting point: According to the CF conventions for NetCDF files, each TC track is a "Discrete Sampling Geometry", namely "a series of data points along a path through space with monotonically increasing times" called a "trajectory" (see https://cfconventions.org/cf-conventions/cf-conventions.html#discrete-sampling-geometries). For that kind of data, the CF conventions tells us:

Every element of every feature must be unambiguously associated with its space and time coordinates and with the feature that contains it. The coordinates attribute must be attached to every data variable to indicate the spatiotemporal coordinate variables that are needed to geo-locate the data.
[from https://cfconventions.org/cf-conventions/cf-conventions.html#coordinates-metadata]

There is also an example given: https://cfconventions.org/cf-conventions/cf-conventions.html#_single_trajectory

So, the lat and lon variables should be listed in coords in the TCForecast implementation (just as in the TCTracks implementation):

climada_petals/climada_petals/hazard/tc_tracks_forecast.py

Lines 450 to 473 in e355ac1

    
           track = xr.Dataset( 
        
               data_vars={ 
        
                   'max_sustained_wind': ('time', np.squeeze(wnd)), 
        
                   'central_pressure': ('time', np.squeeze(pre)/100), 
        
                   'ts_int': ('time', timestep_int), 
        
                   'lat': ('time', lat), 
        
                   'lon': ('time', lon), 
        
               }, 
        
               coords={ 
        
                   'time': timestamp, 
        
               }, 
        
               attrs={ 
        
                   'max_sustained_wind_unit': 'm/s', 
        
                   'central_pressure_unit': 'mb', 
        
                   'name': name, 
        
                   'sid': sid, 
        
                   'orig_event_flag': False, 
        
                   'data_provider': provider, 
        
                   'id_no': (int(id_no) + index / 100), 
        
                   'ensemble_number': msg['ens_number'][index], 
        
                   'is_ensemble': ens_bool, 
        
                   'forecast_time': timestamp_origin, 
        
               } 
        
           )

Now, when it comes to writing this information to a NetCDF file, we don't deal with this convention so far, because xarray is supposed to do it. However, it seems like xarray doesn't do it right. The coordinates attribute in our case should actually contain "time lat lon" but xarray omits the time coordinate and only lists "lat lon". As far as I understand the convention, this is wrong. I opened an issue for that here: pydata/xarray#6310

ThomasRoosli · 2022-02-28T16:13:18Z

So, the lat and lon variables should be listed in coords in the TCForecast implementation (just as in the TCTracks implementation):

Thanks for the question and @ChrisFairless and for the indepth answer @tovogt.
To resolve this for now I will define the lat and lon variables as coords. Then we will wait for xarray to deal with the convention based on the issue opened by @tovogt.

manniepmkam · 2022-02-28T16:36:00Z

could you also check that the test file's id_no variables are ok? i'm seeing what look like rounding errors in some ids, e.g. 2.2800000000000002 (this is nothing to do with the new HDF5 functionality). is this harmless?

This inherits the original code from @mmyrte. My understanding is that user can assign a unique ID for each tracks, but if leave None then it assign a random number. @mmyrte is that correct?

tovogt · 2022-02-28T16:38:19Z

@ThomasRoosli I wouldn't literally "wait" for xarray to deal with the coordinates attribute issue. From my experience, I would expect that it might take very long until they merge a fix, then it needs to be released in a new xarray version, and finally this xarray version needs to be used by CLIMADA. I wouldn't be surprised if this process takes one or even several years. In my opinion, this is a very minor issue and we can safely ignore it. But if you think this is relevant for someone, we can discuss how to implement a workaround on our side until there is an official fix available in xarray. I don't know why @ChrisFairless even mentioned this point, I can't think of any situation where a user would be adversely affected by this.

mmyrte · 2022-03-01T09:27:02Z

could you also check that the test file's id_no variables are ok? i'm seeing what look like rounding errors in some ids, e.g. 2.2800000000000002 (this is nothing to do with the new HDF5 functionality). is this harmless?

This inherits the original code from @mmyrte. My understanding is that user can assign a unique ID for each tracks, but if leave None then it assign a random number. @mmyrte is that correct?

I've no clue why I handled that as a string; I remember being very much behind schedule and just implementing whatever worked at the time. Normally, it would construct a float when reading all forecasts at a given point in time, adding i/100 with i out of n tracks per storm. That being said, I never saw the reasoning behind the id_no, which AFAIK TCTracks simply inherits from the base hazard class, which in turn simply emulated the original MATLAB struct - but it really doesn't have any semantic meaning, it's just there to satisfy the base class conditions. So generating a random number, or even just setting it to 1.0 as default would do IMO.

ChrisFairless · 2022-03-01T10:00:30Z

thanks @tovogt for the in-depth response! you're completely right that this doesn't affect functionality, (and, yeah, maybe wasn't worth bringing up). thanks!

tovogt · 2022-03-01T11:24:39Z

@ChrisFairless Don't worry, keep asking when in doubt. 👍 It was very well-spotted from your side. After all, only after a fun read of the CF conventions and some further consideration, we were able to settle this as harmless.

ThomasRoosli · 2022-03-01T11:58:48Z

I implemented the suggested changes on renaming the attribute and defining lat and lon as coords not data_vars.
I suggest we handle the definition of the attributes id_no, ensemble_member and is_ensemble outside this pull request.

@chahank is it a priorty to move the test files to the API? they are only 6 KB.
@chahank and @ChrisFairless are the points solved to your satisfaction?

ChrisFairless · 2022-03-01T16:51:34Z

After all, only after a fun read of the CF conventions

"fun"

ChrisFairless · 2022-03-01T17:57:57Z

i'm happy for this to be merged

mmyrte · 2022-03-11T14:38:50Z

climada_petals/hazard/tc_tracks_forecast.py

                },
                coords={
                    'time': timestamp,
+                    'lat': ('time', lat),


@ThomasRoosli I'm only seeing this since I'm incidentally also working on this file - this change is wrong, since apparently, you cannot perform dropna() if lat/lon are already coordinates. See https://github.com/CLIMADA-project/climada_petals/pull/33/files#diff-f38c71c49b698e73f5039c46ceb6fe77921909d7754faa380df102339250a25cR486 for the relevant line. Do you mind if I undo this change in my current PR #36?

mmyrte · 2022-03-11T14:41:10Z

climada_petals/hazard/tc_tracks_forecast.py

@@ -417,7 +468,7 @@ def _subset_to_track(msg, index, provider, timestamp_origin, name, id_no):
                    'id_no': (int(id_no) + index / 100),
                    'ensemble_number': msg['ens_number'][index],
                    'is_ensemble': ens_bool,
-                    'forecast_time': timestamp_origin,
+                    'run_datetime': timestamp_origin,


@ThomasRoosli Thanks for this rename, there's almost always a mess around this nomenclature. Much better like this. :)

roo added 5 commits February 17, 2022 10:23

Change TCForecast basin dtype to <U2

c5fe84d

Add forecast_time to TCForecast class doc string

d94b22a

Add write and read hdf5 functionality to TCForecast

2ceb072

remove lint warning

5bb66ac

remove lint warning

e355ac1

ThomasRoosli requested review from ChrisFairless and manniepmkam February 17, 2022 10:56

chahank reviewed Feb 24, 2022

View reviewed changes

Define lat and lon as coords not data_vars in TCForecast

9723148

Rename attribute forecast_time to run_datetime in TCForecast

e5dc1ab

Merge branch 'develop' into feature/tc_forecast_hdf5

6c754a7

ThomasRoosli merged commit e1242f1 into develop Mar 7, 2022

ThomasRoosli deleted the feature/tc_forecast_hdf5 branch March 7, 2022 12:10

mmyrte reviewed Mar 11, 2022

View reviewed changes

tovogt mentioned this pull request Jun 6, 2023

TCTracks: improve hdf5 I/O CLIMADA-project/climada_python#735

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/tc forecast hdf5 #33

Feature/tc forecast hdf5 #33

ThomasRoosli commented Feb 17, 2022

chahank left a comment

chahank Feb 24, 2022

ChrisFairless Feb 25, 2022

ThomasRoosli Feb 28, 2022 •

edited

Loading

chahank Feb 24, 2022

ChrisFairless Feb 25, 2022

ThomasRoosli Feb 28, 2022 •

edited

Loading

chahank Feb 24, 2022

ChrisFairless Feb 25, 2022

ThomasRoosli Feb 28, 2022

chahank Feb 24, 2022

ThomasRoosli Mar 1, 2022

chahank Feb 24, 2022

ChrisFairless Feb 25, 2022

ThomasRoosli Feb 28, 2022

manniepmkam Feb 28, 2022

chahank Feb 28, 2022

ChrisFairless Mar 1, 2022

ThomasRoosli Mar 1, 2022

ChrisFairless commented Feb 26, 2022

ChrisFairless commented Feb 26, 2022

ChrisFairless commented Feb 26, 2022

tovogt commented Feb 28, 2022

ThomasRoosli commented Feb 28, 2022 •

edited by tovogt

Loading

manniepmkam commented Feb 28, 2022 •

edited

Loading

tovogt commented Feb 28, 2022 •

edited

Loading

mmyrte commented Mar 1, 2022

ChrisFairless commented Mar 1, 2022

tovogt commented Mar 1, 2022

ThomasRoosli commented Mar 1, 2022

ChrisFairless commented Mar 1, 2022 •

edited

Loading

ChrisFairless commented Mar 1, 2022

mmyrte Mar 11, 2022

mmyrte Mar 11, 2022

Feature/tc forecast hdf5 #33

Feature/tc forecast hdf5 #33

Conversation

ThomasRoosli commented Feb 17, 2022

chahank left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRoosli Feb 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRoosli Feb 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisFairless commented Feb 26, 2022

ChrisFairless commented Feb 26, 2022

ChrisFairless commented Feb 26, 2022

tovogt commented Feb 28, 2022

ThomasRoosli commented Feb 28, 2022 • edited by tovogt Loading

manniepmkam commented Feb 28, 2022 • edited Loading

tovogt commented Feb 28, 2022 • edited Loading

mmyrte commented Mar 1, 2022

ChrisFairless commented Mar 1, 2022

tovogt commented Mar 1, 2022

ThomasRoosli commented Mar 1, 2022

ChrisFairless commented Mar 1, 2022 • edited Loading

ChrisFairless commented Mar 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRoosli Feb 28, 2022 •

edited

Loading

ThomasRoosli Feb 28, 2022 •

edited

Loading

ThomasRoosli commented Feb 28, 2022 •

edited by tovogt

Loading

manniepmkam commented Feb 28, 2022 •

edited

Loading

tovogt commented Feb 28, 2022 •

edited

Loading

ChrisFairless commented Mar 1, 2022 •

edited

Loading