Flat table format for VPTS #25

peterdesmet · 2022-03-09T19:46:08Z

@adokter and I discussed the VPTS format today and we suggest a flat table format that contains all necessary data for analysis. I'll describe the format in more detail later, but here is how you could reproduce it. The written file is this one: example_vpts.csv

library(bioRad)
library(dplyr)
library(readr)

df <- as.data.frame(example_vpts) %>%
  rename_with(~ paste0("orig_", .x)) %>%
  mutate(
    radar = example_vpts$radar,
    datetime = format(orig_datetime, "%Y-%m-%dT%H:%M:%SZ"),
    height = orig_height,
    u = orig_u,
    v = orig_v,
    w = orig_w,
    ff = orig_ff,
    dd = orig_dd,
    sd_vvp = orig_sd_vvp,
    gap = orig_gap,
    eta = orig_eta,
    dens = orig_dens,
    dbz = orig_dbz,
    dbz_all = orig_DBZH,
    n = orig_n,
    n_dbz = orig_n_dbz,
    n_all = orig_n_all,
    n_dbz_all = orig_n_dbz_all,
    rcs = example_vpts$attributes$how$rcs_bird,
    sd_vvp_threshold = example_vpts$attributes$how$sd_vvp_thres,
    vcp = NA_real_,
    radar_lon = example_vpts$attributes$where$lon,
    radar_lat = example_vpts$attributes$where$lat,
    radar_height = example_vpts$attributes$where$height,
    radar_wavelength = example_vpts$attributes$how$wavelength
  ) %>%
  select(-starts_with("orig_"))

write_csv(df, "example_vpts.csv", na = "")

^{Created on 2022-03-09 by the reprex package (v2.0.1)}

bart1 · 2022-03-09T23:03:57Z

@adokter recently we noticed in amsterdam that vol2bird changes the name of the reflectivity column if you change the input you select. e.g. th instead of dbzh. Would for such a new format it make sense to always use the same column naming? For example reflectivity factor? Then the attributes maybe should specify the original attribute used.

adokter · 2022-03-15T14:51:20Z

Hi @bart1 good catch, hadn't realized that. This only applies to the original DBZH column, it pastes in the DBZTYPE user option name that specifies which reflectivity quantity to use:
https://github.com/adokter/vol2bird/blob/eb41d8db8c0ec743f9566d84d54862c6279349fb/lib/libvol2bird.c#L2847

I agree the easiest fix would be to change the name of that column to something constant, that way we can get rid of the final capitalized column name as well. We can make it dbz_all

adokter · 2022-03-15T14:55:06Z

Added issue adokter/vol2bird#188

bart1 · 2022-03-15T15:03:29Z

Thanks @adokter ! I will also try to then use dbz_all for the naming in other code

BerendWijers · 2022-03-15T15:30:18Z

Hi all,

I'm communicating this with Johannes De Groeve to ensure we are using identical naming in the VP DB. For now, we had chosen to rename the column to reflectivity, any particular reason why dbz_all is chosen? Technically doesn't make a difference as we can rather easily set it to anything, I'm just curious.

peterdesmet · 2022-03-15T16:29:11Z

@BerendWijers the idea to call it dbz_all is because a) in contrast with eta, dens and dbz it is based on all reflectivity (not only birds), b) it has the same unit as dbz and c) there was already a column n_dbz_all that corresponds with it. @adokter correct me if wrong!

BerendWijers · 2022-03-15T16:36:01Z

@peterdesmet Thank you!

peterdesmet · 2022-04-08T13:20:00Z

Any more feedback on this format or can we start describing this format and implementing it in the ENRAM data repository?

bart1 · 2022-04-08T15:15:26Z

@peterdesmet I think most data is contained. Personally I try to avoid too much duplication (last few columns) but I see it is also the elegance of using csv's. There are maybe three things to think about that I see directly

Currently the height of the bin is not coded. If you want that each record/row is independently interpret-able the information of the height of bins is missing. As the start height of the bin is coded and not the mean it is a bit strange to miss the end height
Some what similar there is a little bit of ambiguity about what the timestamps means. They are generally start times of the pvol but no information about their regularity/interval is contained.
There is some duplicate information (e.g. u & v vs ff and dd) I think one version can be omitted.

I'm not sure if it is worth changing this but that is what i directly could see

adokter · 2022-04-08T15:40:25Z

I think we should rename rcs_bird to rcs, because the data is more general than just birds

adokter · 2022-04-08T15:45:28Z

Some duplicate info is ok from my perspective, especially with direction, because it helps to enforce one convention for defining the angle. Also, for height-integrated vpi quantities it's no longer the case that ff and dd can be derived directly from u and v, so there we have to keep both. So for similarity we might keep them also in vpts

adokter · 2022-04-08T15:48:09Z

height is currently defined by the bottom of the altitude bin, and that's not very intuitive. I could change vol2bird to output to center of the altitude bin instead, if we have a mechanism to track the versions of old/new data - what do you think @peterdesmet? It's a bit of a hassle to change in bioRad as there we add half of the height bin size everywhere in height calculations

peterdesmet · 2022-04-11T07:45:55Z

I have renamed rcs_birds to rcs

@adokter regarding middle or bottom of altitude bins:

How is that done for NEXRAD or IRIS data?
How is that done for the datetime? Is that the start of the middle timestamp of scan window?

Personally, I found the bottom (start) of the altitude bin rather intuitive, so I would not change it if we can avoid it.

@bart1 your suggestion to indicate the bottom and top of the height bin would indeed make it explicit that we're talking about a bottom and top and allow people to get the height of a bin from a single row. Same for timestamp. I'm just curious if seeing the increasing height and timestamp over multiple rows was ever a source of confusion? Does it warrant adding adding one or both columns?

bart1 · 2022-04-11T08:06:58Z

@bart1 your suggestion to indicate the bottom and top of the height bin would indeed make it explicit that we're talking about a bottom and top and allow people to get the height of a bin from a single row. Same for timestamp. I'm just curious if seeing the increasing height and timestamp over multiple rows was ever a source of confusion? Does it warrant adding adding one or both columns?

I have not really been confused, although I do not know if height integration accounts for half bins to the surface of the earth (I' don't think so). I brought is more up as, I do feel there is room for improvement that not necessarily needs to happen but would make work better towards the future. I think for it is more confusion can occur about the timestamps. bioRad here is also not always consistent for example for plotting it takes the time as the center timestamps (I think this is how it is often done where the time is treated as a point measurement):

require(bioRad)
#> Loading required package: bioRad
#> Welcome to bioRad version 0.5.2.9499
#> Docker daemon running, Docker functionality enabled (vol2bird version 0.5.0)
ts <- example_vpts[300:302]
# plot density of individuals for the first 500 time steps, in the altitude
# layer 0-3000 m.
plot(ts, ylim = c(0, 3000))
#> Warning in plot.vpts(ts, ylim = c(0, 3000)): Irregular time-series: missing
#> profiles will not be visible. Use 'regularize_vpts' to make time series regular.

ts$datetime
#> [1] "2016-09-02 02:40:00 UTC" "2016-09-02 02:50:00 UTC"
#> [3] "2016-09-02 02:59:00 UTC"

This 5 minute shift is generally not big deal but it would be good to get it right. Note that if not all pvols are analyzed the original duration of a scan/pvol also can't be reconstructed from vp data. Retaining every second pvol is not uncommon, when different scanning patterns do occur or when people want to save data volumes (eg at uva we have 2 years of german data only every 3 pvol so the duration of a pvol is about 5 minutes but we only have a measurement every 15 minutes).

bart1 · 2022-04-11T08:26:41Z

PS for this example data, I would not be able to identify backward (from the vpts) if the full pvol is scanned every 5 minutes and thus a 2.5 minutes shift would place it in the center of the pvol or if a full pvol takes 10 minutes and a 5 minutes shift would be correct.

peterdesmet · 2022-04-11T08:28:15Z

@bart1 Thanks! So, would your use case be solved if start and end timestamp were included in the tabular data?

bart1 · 2022-04-11T08:30:24Z

I think so then at least the information for finding the median timestamp is available. @adokter what do you think, if you are treating the vp's as point measurements in time is the median timestamp the best?

peterdesmet · 2022-04-11T08:32:46Z

You would calculate median timestamp, but we are suggesting to add both columns right (start and end), not provide a single column with median timestamp?

adokter · 2022-04-11T22:33:17Z

The met office typically assign a nominal time to radar polar volumes, and I think we should stick to that convention for simplicity, even though different countries might have different conventions. That nominal is typically also in the filename, and recalculating a time ourselves gets rather complicated I feel..

adokter · 2022-04-11T22:36:22Z

@adokter regarding middle or bottom of altitude bins:
* How is that done for NEXRAD or IRIS data?

The same, it's defined by vol2bird, not by the data

* How is that done for the datetime? Is that the start of the middle timestamp of scan window?

see above, I think it might vary from met office to met office

Personally, I found the bottom (start) of the altitude bin rather intuitive, so I would not change it if we can avoid it.

ok - I have no problem with that

peterdesmet · 2022-04-12T11:34:31Z

@adokter, ok, so if I understand:

A single nominal time. I think it might be good to clarify what is meant by nominal, I wasn't very familiar with the term.
Would vol2bird have any notion of the length/interval at which data are collected and provide that? So the issue described by @bart1 in Flat table format for VPTS #25 (comment) can be resolved?

@adokter @bart1 Is there consensus on adding the upper bin height? I have no preference. If we add one, how do we name the fields? height_min and height_max?

adokter · 2022-04-14T20:04:14Z

@peterdesmet vol2bird isn't aware of the sampling interval, it can only be determined after the fact when you have a time series of profiles.

Here is how the ODIM format defines nominal time in https://www.eumetnet.eu/wp-content/uploads/2019/01/ODIM_H5_v23.pdf:

Note that all date and time information is for the nominal time of the data, ie. the time for which the data are valid. (The nominal time is not the exact acquisition time which is found elsewhere in the file.)

We could stick to that with a ref to ODIM, although admittedly it isn't very clear, I suspect because of subtle differences in how countries define it. So @bart1 is right that we could improve, we might be able to extract the start and end acquisition time from sweep-specific meta-data, but the problem is that these metadata isn't mandatory like the nominal time.

think so then at least the information for finding the median timestamp is available. @adokter what do you think, if you are treating the vp's as point measurements in time is the median timestamp the best?

In bioRad function regularize_vpts() I do use the median timeinterval as a best guess. But this approach breaks down sometimes because radars change their interval depending on inclement weather, and data gaps further complicate it. There must be existing techniques to estimates repeating intervals and switchpoints in noisy data, something to look into.

To keep it simple I would stick to nominal time and stick to 'irregular time series of vertical profiles', the regularization on time-grid, and interpretation of what nominal time is, can then be left to the user.

I would vote for height_lower and height_upper if we really want two height columns. It's mostly useful if we would ever want to support changing bin sizes with altitudes. For now I feel a single height would do, but I don't have a strong preference

peterdesmet · 2022-04-15T10:06:57Z

Thanks for the clarification @adokter. @bart1, how do you feel about one vs two height columns?

peterdesmet · 2022-05-05T15:15:29Z

Question: should any VPTS data contain all the columns presented above, or is it ok if some columns are omitted?

Order as suggested in #25

peterdesmet · 2022-05-06T07:43:25Z

The format is now described at https://enram.github.io/vpts/format/ Please review and create issues for items that are unclear or need change. Input is also welcome on the issues labelled help wanted.

peterdesmet mentioned this issue Apr 8, 2022

Add download_pvolfiles adokter/bioRad#487

Merged

peterdesmet mentioned this issue Apr 15, 2022

Suggestion for directory structure aloftdata/data-repository#65

Closed

peterdesmet added a commit that referenced this issue May 6, 2022

Define fields in correct order + use KGBM example

4030e13

Order as suggested in #25

peterdesmet closed this as completed May 6, 2022

peterdesmet mentioned this issue May 6, 2022

Define table schema for processing settings #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flat table format for VPTS #25

Flat table format for VPTS #25

peterdesmet commented Mar 9, 2022 •

edited

Loading

bart1 commented Mar 9, 2022

adokter commented Mar 15, 2022

adokter commented Mar 15, 2022

bart1 commented Mar 15, 2022

BerendWijers commented Mar 15, 2022

peterdesmet commented Mar 15, 2022

BerendWijers commented Mar 15, 2022

peterdesmet commented Apr 8, 2022

bart1 commented Apr 8, 2022

adokter commented Apr 8, 2022

adokter commented Apr 8, 2022

adokter commented Apr 8, 2022

peterdesmet commented Apr 11, 2022

bart1 commented Apr 11, 2022

bart1 commented Apr 11, 2022

peterdesmet commented Apr 11, 2022

bart1 commented Apr 11, 2022

peterdesmet commented Apr 11, 2022

adokter commented Apr 11, 2022

adokter commented Apr 11, 2022 •

edited

Loading

peterdesmet commented Apr 12, 2022

adokter commented Apr 14, 2022 •

edited

Loading

peterdesmet commented Apr 15, 2022

peterdesmet commented May 5, 2022

peterdesmet commented May 6, 2022

Flat table format for VPTS #25

Flat table format for VPTS #25

Comments

peterdesmet commented Mar 9, 2022 • edited Loading

bart1 commented Mar 9, 2022

adokter commented Mar 15, 2022

adokter commented Mar 15, 2022

bart1 commented Mar 15, 2022

BerendWijers commented Mar 15, 2022

peterdesmet commented Mar 15, 2022

BerendWijers commented Mar 15, 2022

peterdesmet commented Apr 8, 2022

bart1 commented Apr 8, 2022

adokter commented Apr 8, 2022

adokter commented Apr 8, 2022

adokter commented Apr 8, 2022

peterdesmet commented Apr 11, 2022

bart1 commented Apr 11, 2022

bart1 commented Apr 11, 2022

peterdesmet commented Apr 11, 2022

bart1 commented Apr 11, 2022

peterdesmet commented Apr 11, 2022

adokter commented Apr 11, 2022

adokter commented Apr 11, 2022 • edited Loading

peterdesmet commented Apr 12, 2022

adokter commented Apr 14, 2022 • edited Loading

peterdesmet commented Apr 15, 2022

peterdesmet commented May 5, 2022

peterdesmet commented May 6, 2022

peterdesmet commented Mar 9, 2022 •

edited

Loading

adokter commented Apr 11, 2022 •

edited

Loading

adokter commented Apr 14, 2022 •

edited

Loading