Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flat table format for VPTS #25

Closed
peterdesmet opened this issue Mar 9, 2022 · 25 comments
Closed

Flat table format for VPTS #25

peterdesmet opened this issue Mar 9, 2022 · 25 comments

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Mar 9, 2022

@adokter and I discussed the VPTS format today and we suggest a flat table format that contains all necessary data for analysis. I'll describe the format in more detail later, but here is how you could reproduce it. The written file is this one: example_vpts.csv

library(bioRad)
library(dplyr)
library(readr)

df <- as.data.frame(example_vpts) %>%
  rename_with(~ paste0("orig_", .x)) %>%
  mutate(
    radar = example_vpts$radar,
    datetime = format(orig_datetime, "%Y-%m-%dT%H:%M:%SZ"),
    height = orig_height,
    u = orig_u,
    v = orig_v,
    w = orig_w,
    ff = orig_ff,
    dd = orig_dd,
    sd_vvp = orig_sd_vvp,
    gap = orig_gap,
    eta = orig_eta,
    dens = orig_dens,
    dbz = orig_dbz,
    dbz_all = orig_DBZH,
    n = orig_n,
    n_dbz = orig_n_dbz,
    n_all = orig_n_all,
    n_dbz_all = orig_n_dbz_all,
    rcs = example_vpts$attributes$how$rcs_bird,
    sd_vvp_threshold = example_vpts$attributes$how$sd_vvp_thres,
    vcp = NA_real_,
    radar_lon = example_vpts$attributes$where$lon,
    radar_lat = example_vpts$attributes$where$lat,
    radar_height = example_vpts$attributes$where$height,
    radar_wavelength = example_vpts$attributes$how$wavelength
  ) %>%
  select(-starts_with("orig_"))

write_csv(df, "example_vpts.csv", na = "")

Created on 2022-03-09 by the reprex package (v2.0.1)

@bart1
Copy link

bart1 commented Mar 9, 2022

@adokter recently we noticed in amsterdam that vol2bird changes the name of the reflectivity column if you change the input you select. e.g. th instead of dbzh. Would for such a new format it make sense to always use the same column naming? For example reflectivity factor? Then the attributes maybe should specify the original attribute used.

@adokter
Copy link
Contributor

adokter commented Mar 15, 2022

Hi @bart1 good catch, hadn't realized that. This only applies to the original DBZH column, it pastes in the DBZTYPE user option name that specifies which reflectivity quantity to use:
https://github.com/adokter/vol2bird/blob/eb41d8db8c0ec743f9566d84d54862c6279349fb/lib/libvol2bird.c#L2847

I agree the easiest fix would be to change the name of that column to something constant, that way we can get rid of the final capitalized column name as well. We can make it dbz_all

@adokter
Copy link
Contributor

adokter commented Mar 15, 2022

Added issue adokter/vol2bird#188

@bart1
Copy link

bart1 commented Mar 15, 2022

Thanks @adokter ! I will also try to then use dbz_all for the naming in other code

@BerendWijers
Copy link

Hi all,

I'm communicating this with Johannes De Groeve to ensure we are using identical naming in the VP DB. For now, we had chosen to rename the column to reflectivity, any particular reason why dbz_all is chosen? Technically doesn't make a difference as we can rather easily set it to anything, I'm just curious.

@peterdesmet
Copy link
Member Author

@BerendWijers the idea to call it dbz_all is because a) in contrast with eta, dens and dbz it is based on all reflectivity (not only birds), b) it has the same unit as dbz and c) there was already a column n_dbz_all that corresponds with it. @adokter correct me if wrong!

@BerendWijers
Copy link

@peterdesmet Thank you!

@peterdesmet
Copy link
Member Author

Any more feedback on this format or can we start describing this format and implementing it in the ENRAM data repository?

@bart1
Copy link

bart1 commented Apr 8, 2022

@peterdesmet I think most data is contained. Personally I try to avoid too much duplication (last few columns) but I see it is also the elegance of using csv's. There are maybe three things to think about that I see directly

  • Currently the height of the bin is not coded. If you want that each record/row is independently interpret-able the information of the height of bins is missing. As the start height of the bin is coded and not the mean it is a bit strange to miss the end height
  • Some what similar there is a little bit of ambiguity about what the timestamps means. They are generally start times of the pvol but no information about their regularity/interval is contained.
  • There is some duplicate information (e.g. u & v vs ff and dd) I think one version can be omitted.

I'm not sure if it is worth changing this but that is what i directly could see

@adokter
Copy link
Contributor

adokter commented Apr 8, 2022

I think we should rename rcs_bird to rcs, because the data is more general than just birds

@adokter
Copy link
Contributor

adokter commented Apr 8, 2022

Some duplicate info is ok from my perspective, especially with direction, because it helps to enforce one convention for defining the angle. Also, for height-integrated vpi quantities it's no longer the case that ff and dd can be derived directly from u and v, so there we have to keep both. So for similarity we might keep them also in vpts

@adokter
Copy link
Contributor

adokter commented Apr 8, 2022

height is currently defined by the bottom of the altitude bin, and that's not very intuitive. I could change vol2bird to output to center of the altitude bin instead, if we have a mechanism to track the versions of old/new data - what do you think @peterdesmet? It's a bit of a hassle to change in bioRad as there we add half of the height bin size everywhere in height calculations

@peterdesmet
Copy link
Member Author

I have renamed rcs_birds to rcs

@adokter regarding middle or bottom of altitude bins:

  • How is that done for NEXRAD or IRIS data?
  • How is that done for the datetime? Is that the start of the middle timestamp of scan window?

Personally, I found the bottom (start) of the altitude bin rather intuitive, so I would not change it if we can avoid it.

@bart1 your suggestion to indicate the bottom and top of the height bin would indeed make it explicit that we're talking about a bottom and top and allow people to get the height of a bin from a single row. Same for timestamp. I'm just curious if seeing the increasing height and timestamp over multiple rows was ever a source of confusion? Does it warrant adding adding one or both columns?

@bart1
Copy link

bart1 commented Apr 11, 2022

@bart1 your suggestion to indicate the bottom and top of the height bin would indeed make it explicit that we're talking about a bottom and top and allow people to get the height of a bin from a single row. Same for timestamp. I'm just curious if seeing the increasing height and timestamp over multiple rows was ever a source of confusion? Does it warrant adding adding one or both columns?

I have not really been confused, although I do not know if height integration accounts for half bins to the surface of the earth (I' don't think so). I brought is more up as, I do feel there is room for improvement that not necessarily needs to happen but would make work better towards the future. I think for it is more confusion can occur about the timestamps. bioRad here is also not always consistent for example for plotting it takes the time as the center timestamps (I think this is how it is often done where the time is treated as a point measurement):

require(bioRad)
#> Loading required package: bioRad
#> Welcome to bioRad version 0.5.2.9499
#> Docker daemon running, Docker functionality enabled (vol2bird version 0.5.0)
ts <- example_vpts[300:302]
# plot density of individuals for the first 500 time steps, in the altitude
# layer 0-3000 m.
plot(ts, ylim = c(0, 3000))
#> Warning in plot.vpts(ts, ylim = c(0, 3000)): Irregular time-series: missing
#> profiles will not be visible. Use 'regularize_vpts' to make time series regular.

ts$datetime
#> [1] "2016-09-02 02:40:00 UTC" "2016-09-02 02:50:00 UTC"
#> [3] "2016-09-02 02:59:00 UTC"

This 5 minute shift is generally not big deal but it would be good to get it right. Note that if not all pvols are analyzed the original duration of a scan/pvol also can't be reconstructed from vp data. Retaining every second pvol is not uncommon, when different scanning patterns do occur or when people want to save data volumes (eg at uva we have 2 years of german data only every 3 pvol so the duration of a pvol is about 5 minutes but we only have a measurement every 15 minutes).

@bart1
Copy link

bart1 commented Apr 11, 2022

PS for this example data, I would not be able to identify backward (from the vpts) if the full pvol is scanned every 5 minutes and thus a 2.5 minutes shift would place it in the center of the pvol or if a full pvol takes 10 minutes and a 5 minutes shift would be correct.

@peterdesmet
Copy link
Member Author

@bart1 Thanks! So, would your use case be solved if start and end timestamp were included in the tabular data?

@bart1
Copy link

bart1 commented Apr 11, 2022

I think so then at least the information for finding the median timestamp is available. @adokter what do you think, if you are treating the vp's as point measurements in time is the median timestamp the best?

@peterdesmet
Copy link
Member Author

You would calculate median timestamp, but we are suggesting to add both columns right (start and end), not provide a single column with median timestamp?

@adokter
Copy link
Contributor

adokter commented Apr 11, 2022

The met office typically assign a nominal time to radar polar volumes, and I think we should stick to that convention for simplicity, even though different countries might have different conventions. That nominal is typically also in the filename, and recalculating a time ourselves gets rather complicated I feel..

@adokter
Copy link
Contributor

adokter commented Apr 11, 2022

@adokter regarding middle or bottom of altitude bins:

* How is that done for NEXRAD or IRIS data?

The same, it's defined by vol2bird, not by the data

* How is that done for the datetime? Is that the start of the middle timestamp of scan window?

see above, I think it might vary from met office to met office

Personally, I found the bottom (start) of the altitude bin rather intuitive, so I would not change it if we can avoid it.

ok - I have no problem with that

@peterdesmet
Copy link
Member Author

@adokter, ok, so if I understand:

  • A single nominal time. I think it might be good to clarify what is meant by nominal, I wasn't very familiar with the term.
  • Would vol2bird have any notion of the length/interval at which data are collected and provide that? So the issue described by @bart1 in Flat table format for VPTS #25 (comment) can be resolved?

@adokter @bart1 Is there consensus on adding the upper bin height? I have no preference. If we add one, how do we name the fields? height_min and height_max?

@adokter
Copy link
Contributor

adokter commented Apr 14, 2022

@peterdesmet vol2bird isn't aware of the sampling interval, it can only be determined after the fact when you have a time series of profiles.

Here is how the ODIM format defines nominal time in https://www.eumetnet.eu/wp-content/uploads/2019/01/ODIM_H5_v23.pdf:

Note that all date and time information is for the nominal time of the data, ie. the time for which the data are valid. (The nominal time is not the exact acquisition time which is found elsewhere in the file.)

We could stick to that with a ref to ODIM, although admittedly it isn't very clear, I suspect because of subtle differences in how countries define it. So @bart1 is right that we could improve, we might be able to extract the start and end acquisition time from sweep-specific meta-data, but the problem is that these metadata isn't mandatory like the nominal time.

think so then at least the information for finding the median timestamp is available. @adokter what do you think, if you are treating the vp's as point measurements in time is the median timestamp the best?

In bioRad function regularize_vpts() I do use the median timeinterval as a best guess. But this approach breaks down sometimes because radars change their interval depending on inclement weather, and data gaps further complicate it. There must be existing techniques to estimates repeating intervals and switchpoints in noisy data, something to look into.

To keep it simple I would stick to nominal time and stick to 'irregular time series of vertical profiles', the regularization on time-grid, and interpretation of what nominal time is, can then be left to the user.

I would vote for height_lower and height_upper if we really want two height columns. It's mostly useful if we would ever want to support changing bin sizes with altitudes. For now I feel a single height would do, but I don't have a strong preference

@peterdesmet
Copy link
Member Author

Thanks for the clarification @adokter. @bart1, how do you feel about one vs two height columns?

@peterdesmet
Copy link
Member Author

Question: should any VPTS data contain all the columns presented above, or is it ok if some columns are omitted?

peterdesmet added a commit that referenced this issue May 6, 2022
@peterdesmet
Copy link
Member Author

The format is now described at https://enram.github.io/vpts/format/ Please review and create issues for items that are unclear or need change. Input is also welcome on the issues labelled help wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants