Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2023 data #251

Open
wengraf opened this issue Oct 1, 2024 · 8 comments
Open

2023 data #251

wengraf opened this issue Oct 1, 2024 · 8 comments

Comments

@wengraf
Copy link
Contributor

wengraf commented Oct 1, 2024

Hi:

I'm struggling to format 2023 data (actually, I'm wanting 2004 to 2023, so would prefer to use dft-road-casualty-statistics-casualty-1979-latest-published-year.csv etc.).

downloader::download("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-1979-latest-published-year.csv", "cas.csv")
cas.df <- read_csv("cas.csv")
cas.df <- format_casualties(cas.df)

The code above gives 13 warning messages, and puts in an awful lot of NAs.

I tried this:
x = get_stats19(2023, silent = FALSE, type = "collision", output_format = "data.frame")

with the plan that I'd loop through the relevant years and bind_rows. Unfortunately, it can't find 2023 data. From what I can work out, there is a table of links to which the function (or some other function nested within it) refers. The 2023 links aren't up yet. Would it be possible to make updating that table a user-updateable process? Or, if the single year isn't found when using get_stats19(), to fall back on the 1979 to latest date set of files in the first instance?

@Robinlovelace
Copy link
Member

Boom, great it's out there, thanks for the nudge. On the case! Cc @layik, could be a good 'bonus' one for our hackathon on Thursday: https://github.com/Robinlovelace/netvishack/

@wengraf
Copy link
Contributor Author

wengraf commented Oct 1, 2024

Hi @Robinlovelace: if it is going to a be a quick update, I'll wait before I continue the piece of work I'm currently doing. Thanks!

@Robinlovelace
Copy link
Member

Good motivation to be fast, will try to do by end of play today.

@R-M-J-P
Copy link

R-M-J-P commented Nov 7, 2024

@Robinlovelace I think the NAs are being introduced by one line within the format_stats19() function.

Specifically the line highlighted in yellow below.
I think the line above that line of code successfully retains the original value in cases where NAs are introduced (for example, for variable where -1 isn't declared as a level in the schema for a given variable).
The yellow highlighted line, introduces NAs as it is (often) trying to convert characters to integers (due to the original class being integer (at least in some cases)).
image

When i declared a version of format_stats19() with that yellow line of code hashed out, the formatting appears to work correctly, when using format_stats19() directly

format_stats19(casualties_2023, type = "Casualty")

Though i would note that format_stats19() doesn't exist within the current version of the stats19 package, and i obtained the function through https://github.com/ropensci/stats19/blob/master/R/format.R

There might be a good reason why that function isn't available within the current version of the package.

@ar0berts
Copy link

ar0berts commented Dec 25, 2024

I think there is some urgency to getting 2023 and the partial 2024 stats into scope for stats19. In response to a claim by the Secretary of State for Wales in Parliament I looked at the data from 2019 through to the partial data for 2024. Making loads of assumptions and taking Welsh statistics as a proportion of total statistics it seems that the introduction of 20mph speed limits has reduced fatalities in the 2024 data. Robust statistical analysis would have three potential benefits:

  1. Quantifying the benefits in terms of lives saved (and cost savings) would enhance driver compliance
  2. Geospatial statistics would enable a refinement of the restriction zones making them more resistant to a change of Senedd government reversing the policy. A machine learning approach to zones relative to shops, schools, play areas, pedestrian routes and housing etc could be envisaged.
  3. Getting the application refined would open the way for a similar policy change to be rolled out across England with precision.

image

@Robinlovelace
Copy link
Member

I will look to fix this in the coming week.

@wengraf
Copy link
Contributor Author

wengraf commented Dec 29, 2024

Thanks @Robinlovelace. In the end, I modified the code of the package in an ugly way to get my particular project over the line, but it would be great if the package could handle year change on its own with less or no human action in the future. I think provisional data (and how to code in appropriate guidance on use) should be a separate issue.

@BlaiseKelly
Copy link

@Robinlovelace I think the NAs are being introduced by one line within the format_stats19() function.

Specifically the line highlighted in yellow below. I think the line above that line of code successfully retains the original value in cases where NAs are introduced (for example, for variable where -1 isn't declared as a level in the schema for a given variable). The yellow highlighted line, introduces NAs as it is (often) trying to convert characters to integers (due to the original class being integer (at least in some cases)). image

When i declared a version of format_stats19() with that yellow line of code hashed out, the formatting appears to work correctly, when using format_stats19() directly

format_stats19(casualties_2023, type = "Casualty")

Though i would note that format_stats19() doesn't exist within the current version of the stats19 package, and i obtained the function through https://github.com/ropensci/stats19/blob/master/R/format.R

There might be a good reason why that function isn't available within the current version of the package.

I was using the version that I cloned in July and was able to use the 2023 data with no problem. Digging into the code it seems it is the line @R-M-J-P mentions that is the difference? Not sure when that line was added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants