Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure rejected/missing data in peak flows are flagged as such in output #19

Closed
ilapros opened this issue Jul 30, 2019 · 12 comments
Closed

Comments

@ilapros
Copy link
Collaborator

ilapros commented Jul 30, 2019

For the use I do of the data (and for my knowledge of the data) there is an issue when retrieving the amax-flow and pot-flow: the NRFA flags some years as Rejected (for amax) and as Missing (for the pot) and this information is not available in the time-series format, but is only available in the feh-data format. In my winfapReader package I have functions to process the feh-data formats, and I am happy to rework those to be usable in the rnrfa package (I have a prototyped something this morning). I have two main issues with these:

  • don't know how these would work in parallel (I haven't looked at the parallel implementations in rnrfa)
  • the API downloads files when using the feh-data format in the browser, so I need to figure out if interrogating it via httr effectively saves files somewhere in the user's computer, which would not be good.

Once this issue is resolved I think we should discourage users to use the pot-flow/stage and amax-flow/stage types as they give data which could have some unreliable records.

@ilapros ilapros mentioned this issue Jul 30, 2019
@mattfry-ceh
Copy link

New metadata "field" now added so that excluded years and rejected periods can be accessed. e.g.:
https://nrfaapps.ceh.ac.uk/nrfa/ws/station-info?station=43010&format=json-object&fields=peak-flow-metadata

The list of excluded years is the same as the list in the “.AM” file. It takes the rejected periods into account so you don’t have to look at both if you just want the list of water-years.

This data is used to provide the red shading on the AMAX and POT plots on the NRFA web-site (peak flow tab of the gauging station pages, e.g. https://nrfa.ceh.ac.uk/data/station/peakflow/43010

It may be possible to integrate this information with the time series data, e.g. as data quality flags, in future, but we don't have a consistent pattern for that yet.

“missing” data periods is something else entirely. These are shaded yellow(ish?) on the POT plot, and listed in the missing data tab on the website. They sort-of appear in the “.PT” file, but in a processed form excluding the shorter missing periods. There is no API output for these at the moment.

There are plenty of issues with the reliability of the POT data, even if you make use of the missing data periods we have.

@ilapros
Copy link
Collaborator Author

ilapros commented Oct 22, 2019

Thanks Matt - the new field is quite useful.
I have been thinking about this on and off in the last months and I am a bit stuck deciding what would be best. Mostly the issue I have is that I don't think the output of the get_ts function should be different when type = "amax-flow", while if we include information on the years which are to be rejected (which is a good thing to do), we would need to change the output format.
Possible solutions (and I have some ways to implement most of these):

  1. leave things as they are (possibly adding a warning in the help file or printing it on screen)
  2. if type = "amax-flow", get_ts outputs the same information as now and automatically also gives the information on the missing year retrieved by the newly available field (can be easily done using a reworking of catalogue).
  3. if type = "amax-flow", get_ts gives a different output type than for the other types - this should be clearly documented in the help files
  4. drop the type = "amax-flow" option (but I plan to add a way to use the API within my winfapReader package rather than relying on the user having the files downloaded in a local repository).

All of the above can be done also for the type = "pot-flow" option, with a good amount of good which needs to be added if we go for option 3 (but maybe as Matt says we should really make it easy for people to get to the POT datasets since they are somewhat unreliable).
I am inclined to think option 1 or 2 are the best - and I would possibly add a reference to the winfapReader package in the help files or a vignette for those who might be interested.
@cvitolo - you have much more experience than me (and this is your package!) - so I'll go with your decision on this, but happy to talk further.

@ilapros
Copy link
Collaborator Author

ilapros commented Oct 28, 2019

FYI - I have in the meanwhile created some functions in winfapReader to obtain annual maxima, peaks over threshold and catchment descriptors from the NRFA API. These functions output objects which should be more similar to what one could derive from the winfap files. I have also added some ideas on how the winfapReader package could be combined with rnrfa in the vignette. I hope I have done justice to the NRFA view on the peak flow holdings and to the rnrfa package for its functionality. Comments and feedback would be very welcome!

@mattfry-ceh
Copy link

Has there been further development of this? I think an option that explicitly showed that a year was rejected would be best. Some users want to use it anyway (or reverse the series looking at only the rejected years, e.g. after a reservoir has been built).

We could include rejected flags within the time series formats from the NRFA API, but I don't know exactly how RNRFA makes use of these.

For POT data it's also important to have the missing data periods, and these wouldn't fit with the time series data, so would need to be metadata or a separate end point (which seems excessive). Are these needed explicitly as well as the rejected periods?

@mattfry-ceh
Copy link

And incidentally it seems httr has in built functions to check and manage caching: https://github.com/r-lib/httr/blob/master/R/cache.R

@ilapros
Copy link
Collaborator Author

ilapros commented Dec 17, 2019

Has there been further development of this? I think an option that explicitly showed that a year was rejected would be best. Some users want to use it anyway (or reverse the series looking at only the rejected years, e.g. after a reservoir has been built).

We could include rejected flags within the time series formats from the NRFA API, but I don't know exactly how RNRFA makes use of these.

Hi Matt - changing the output of the NRFA API could make the information easier to collate, but the functions I built in winfapReader (happy to change the name by the way) do give information on the rejected year. I think we need @cvitolo to weigh in on this since it would mean that for some data types the get_ts function would output something with different formats

For POT data it's also important to have the missing data periods, and these wouldn't fit with the time series data, so would need to be metadata or a separate end point (which seems excessive). Are these needed explicitly as well as the rejected periods?

And indeed it is to complicated to add the missing/rejected year information on the POT records. I am inclined to have separate functions for these records, either via winfapReader or with functions within RNRFA. Again - I think @cvitolo gets the final word on this

Re: caching. I was worried about getting too much volume of requests to the API, I'll have a look into it although I know nothing about it, thanks for the pointer.

@cvitolo
Copy link
Owner

cvitolo commented Dec 20, 2019

Hi @ilapros and @mattfry-ceh!
Thanks for reporting this and apologise for the delay in responding. I think it would be very helpful if you could provide a reproducible example. @ilapros can you provide the ID of a station for which there are rejected/missing data?

@ilapros
Copy link
Collaborator Author

ilapros commented Dec 20, 2019

Of sorry - should have thought of that - repex for annual maxima below

remotes::install_github("ilapros/winfapReader")  
## station 43010 has several rejected peak flow value - see 
## https://nrfa.ceh.ac.uk/data/station/peakflow/43010 

## read the data from rnrfa 
r43010 <- rnrfa::get_ts(id = 43010, type = "amax-flow") 
head(r43010) 
class(r43010) ; dim(r43010)

## read the data with winfapReader 
w43010 <- winfapReader::get_amax(station = 43010) 
head(w43010) 
class(w43010) ; dim(w43010)
## the same values
## but with the information on the years in which the record is rejected 

The issue is that rnrfa at present doesn't report the missing information. This is even more complicated for the pot data, where a missing year might be thought of as a year with no peaks above the threshold (see for example https://nrfa.ceh.ac.uk/data/station/peakflow/39003 for a station with pot periods of missing recording and rejected years)

### peaks over threshold 
## read the data from rnrfa 
r39003 <- rnrfa::get_ts(id = 39003, type = "pot-flow") 
head(r39003) 
class(r39003) ; dim(r39003)

## read the data with winfapReader 
w39003 <- winfapReader::get_pot(station = 39003) 
head(w39003$tablePOT); head(w39003$WaterYearInfo) 
class(w39003) ; sapply(w39003,dim)
## the same values
## but the WaterYearInfo table gives us the information of rejected/missing years

The root of the problem is that peak flow data have mostly been used in a different way than the daily data I think.

@mattfry-ceh
Copy link

Just to clarify, I think the model is that there are rejected periods and missing periods, which both have a start and end date which doesn't have to conform to a year start / end.

See https://nrfa.ceh.ac.uk/data/station/peakflow/39003 and select "Missing Data" from the dropdown to see example.

These combine to define "Rejected" Ann Max values, i.e. a significant part of the year was missing or unrepresentative. The significance is manually defined, i.e. it's not a question of automatically rejecting a year because any of it was missing. So the NRFA basically stores a set of rejected years for AM values.

POT data are also flagged as rejected, but this is solely because they fall in an unrepresentative period. There is no concept of rejected years of POT data by default.

To properly use this I think the NRFA API should be able to provide:

  • a service for unrepresentative periods
  • a service for missing data periods
  • enhance the AMAX service with flags to state a value is rejected (due to being unrepresentative or missing data)
  • enhance the POT service with flags to state a values is rejected (due to being unrepresentative)

But any user of the POT data will need to know about the full set of missing data periods in order to properly use the data.

cvitolo added a commit that referenced this issue Dec 22, 2019
Major changes:

1. Fixed issue #19 whereby rejected/missing data in peak flows are flagged as such in output. Added full_info to input parameters to retrieve data quality flags.
2. timeseries are now classed as zoo object, not xts.
3. startseason and endseason in seasonal_averages() are now deprecated, seasons are labelled by the calendar quarter in which the season ends.

Minor changes:

- Added more tests
@cvitolo
Copy link
Owner

cvitolo commented Dec 22, 2019

I just committed my first attempt to fix this issue.

Basically I added an extra input parameter, called full_info to get_ts() which allows to retrieve data quality flags.

Here is how it works with amax, for which flags are derived from peak-flow-rejected-amax-years:

# TEST AMAX
# If I do not use the full_info input argument, I get the usual result
r39003 <- rnrfa::get_ts(id = 39003, type = "amax-flow")
head(r39003)

# If I use the full_info input argument, I get an additional column with
# data quality flags: 1 = rejected and 0 = not rejected
# note the time series class is simply zoo, no longer zoo/xts
r39003 <- rnrfa::get_ts(id = 39003, type = "amax-flow", full_info = TRUE)
head(r39003)
class(r39003) ; dim(r39003)

## read the data with winfapReader 
w39003 <- winfapReader::get_amax(station = 39003)
head(w39003)

# compare results from rnrfa and winfapReader
all(r39003$rejected == w39003$Rejected)

With pot, the quality flags are derived from peak-flow-rejected-periods. In this case, the result is not directly comparable with winfapReader. Here is an example:

# TEST POT
# If I use the full_info input argument, I get an additional column with flags
r39003pot <- rnrfa::get_ts(id = 39003, type = "pot-flow", full_info = TRUE)
head(r39003pot)
table(r39003pot$rejected)

@mattfry-ceh @ilapros please let me know what you think.

@ilapros
Copy link
Collaborator Author

ilapros commented Mar 15, 2020

Hi @cvitolo
the new option sounds great - I have made a small pull request #21 to fix issues when the POT are complete, but I think your fix closes the issue. I will still keep winfapReader as a separate package - will try to tidy things up and make it into a real package.
Thanks!
Ilaria

@cvitolo
Copy link
Owner

cvitolo commented Apr 5, 2020

Thanks @ilapros ! Just merged your PR

@cvitolo cvitolo closed this as completed Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants