Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AFLTables Extract Has Fewer Unique IDs Than Debutants On AFLTables #72

Open
TonyCorke opened this issue Apr 14, 2019 · 8 comments
Open
Assignees
Labels
data-bug bugs related to data and not fitzroy itself

Comments

@TonyCorke
Copy link

Please briefly describe your problem and what output you expect.

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.


According to AFLTables as at the end of R3 2019, there have been 12,710 debutants. There are only 12,703 unique IDs in the AFLTables extract.

``` r
library(fitzRoy)

stats = get_afltables_stats(start_date = "1897-01-01", end_date = "2019-06-01") 
#> Returning data from 1897-01-01 to 2019-06-01
#> Finished getting afltables data
length(unique(stats$ID))
#> [1] 12703
setdiff(1:12710, unique(stats$ID))
#> [1] 12581 12677 12678 12679 12680 12681 12682 12683
setdiff(unique(stats$ID), 1:12710)
#> [1] 0

Created on 2019-04-14 by the reprex package (v0.2.1)

@afableco
Copy link

The difference are due to mis-codings. This following accounts for the 7 player discrepancy that @TonyCorke highlighted:

  1. Arthur Davidson there have been two. The first played for Fitzroy in 1897. In the games he played he was recorded as his contemporary Alex Davidson (ID 4350). The second (ID 2755) played for Hawthorn in 1939.
  2. George McLeod there have been two; both played for St Kilda at the the turn of the century. The one who played in 1903 has been recorded as the other George McLeod.
  3. Archie Richardson (ID 4528) - this is an interesting one. Archie played in the VFA for Richmond, and it was believed that he played for St Kilda although this now seems to be discredited. The games have been credited to three separate Richardsons: 1898 Mr Richardson, 1900 William Richardson, and 1901 Alfred Richardson.
  4. Jim Dorgan (ID 2796), who played for South Melbourne, has been coded for Jack Dorgan who played for Melbourne.
  5. Alex Johnston (ID 5395) played only once for Richmond in 1908, you have him as playing twice. You do not have Walter Johnston who played in Round 8.

The following are inconsistencies between the names used in the FitRoy package and AFL Tables. They don't in affect the statistics, but may cause problems if the data sitting behind the player ids is ever rescraped.

  1. Kelly Robinson (ID 4375). should be James Robinson
  2. Alf McDougall (ID 4577) should be Abe McDougall
  3. Alex Barningham (ID 5760) should be Alick Barningham
  4. Phonse Hayes (ID 7212) this matches up with Australianfootball.com, but AFL Tables has Alf Hayes
  5. Allan Rogers (ID 3613) this matches up with Australianfootball.com, but AFL Tables has Allen Rogers
  6. Andy McDonnell (ID 5240) should be McDonell
  7. Arch Middleton (ID 4264) is listed on AFL Tables twice - once as Arch Middleton and once as Arthur Middleton. Australianfootball.com has him as the latter. This is not double counted because the all the AFL Tables refer to Arch.
  8. Garry Lowe (ID 10792) this matches up with Australianfootball.com, but AFL Tables has Gary Lowe.
  9. Harrison Himmelberg (ID 12462) AFL Tables uses Harry.
  10. Jack Matthews (ID 8376) this matches up with Australianfootball.com, but AFL Tables has Mathews.
  11. Jay Kennedy-Harris (ID 12245) AFL Tables does not hyphenate the Kennedy Harris.
  12. Bob Hooper (ID 4695) AFL Tables uses John rather than the shortened form of 'Bobadil'.
  13. Matthew de Boer (ID 11746) AFL Tables uses Matt.
  14. Patrick Ryder (ID 4144) AFL Tables uses Paddy.
  15. Ernie Blencowe (ID 5234) is listed on AFL Tables twice - once as Percy Blencowe and once as Ernie Blencowe. Australianfootball.com has him as the latter. This is not double counted because the all the AFL Tables refer to Percy.
  16. Jim Darcy (ID 4318) should be Tom Darcy.
  17. Pos Watson (ID 4570) AFL Tables has this as Unknown Watson, but it has Pos's date of birth etc.
  18. Terry De Konning (ID 11103) should be De Koning.

@TonyCorke
Copy link
Author

This is fabulous @afableco. Thank you.

Am I right that we're still one short of the seven we need though, as we get from the changes:

  • one new ID for the 1898 Arthur Davidson
  • one new ID for the 1903 George McLeod
  • three new IDs for the various Archie Richardson BUT we lose ID 4528 altogether because no games are associated with that ID, so net gain is two
  • one new ID for the 1949 Jack Dorgan
  • one new ID for the Round 8 1908 Walter Johnston

So that's +7 and -1 for a net gain of 6.

Or, have I misinterpreted your explanation?

@afableco
Copy link

Sadly, you are correct. I forgot to net off Archie Richardson. I will try and get back to this on the weekend to see if I can work out who else is missing.

@TonyCorke
Copy link
Author

No rush at all - and thank you for looking at the issue I raised so quickly!

@afableco
Copy link

The answer is Tom Darcy. In my original note, I had that Jim Darcy (ID 4318) should have been Tom Darcy, but it seems they are two separate people. Tom played for South Melbourne had his first game 1904-09-03, and Jim played for Essendon and had his first game 1897-05-08.

There are other issues with the data (eg Cam Rayner is recorded as Heber Quinton in 2018).

@TonyCorke
Copy link
Author

Perfect! Thanks again.

Below is some code that can be used to patch the data:

library(fitzRoy)

dat <- get_afltables_stats(start_date = "1897-05-01", end_date = "2019-05-21")

Fix Arthur Davidson (recorded as Alex Davidson)

dat$ID[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = 15000
dat$First.name[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = "Arthur"
dat$Surname[dat$ID == 4350 & dat$Playing.for == "Fitzroy" & dat$Season == 1898 & dat$Round %in% c(7,10)] = "Davidson"

Fix George McLeod (there were two)

dat$ID[dat$First.name == "George" & dat$Surname == "McLeod" & dat$Playing.for == "St Kilda" & dat$Season == 1903] = 15001

Fix Archie Richardson (three different guys)

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1898] = 15002
dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1898] = "Mr"
dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1898] = "Richardson"

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1900] = 15003
dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1900] = "William"
dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1900] = "Richardson"

dat$ID[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1901] = 15004
dat$First.name[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1901] = "Alfred"
dat$Surname[dat$First.name == "Archie" & dat$Surname == "Richardson" &
dat$Playing.for == "St Kilda" & dat$Season == 1901] = "Richardson"

Fix Jack Dorgan (recorded as Jim Dorgan)

dat$ID[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = 15005
dat$First.name[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = "Jack"
dat$Surname[dat$First.name == "Jim" & dat$Surname == "Dorgan" & dat$Season == 1949] = "Dorgan"

Fix Walter Johnston (recorded as Alex Johnston)

dat$ID[dat$First.name == "Alex" & dat$Surname == "Johnston" &
dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = 15006
dat$First.name[dat$First.name == "Alex" & dat$Surname == "Johnston" &
dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = "Walter"
dat$Surname[dat$First.name == "Alex" & dat$Surname == "Johnston" &
dat$Playing.for == "Richmond" & dat$Season == 1908 & dat$Round == 8] = "Johnston"

Fix Tom Darcy (recorded as Jim)

dat$ID[dat$First.name == "Jim" & dat$Surname == "Darcy" &
dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = 15007
dat$First.name[dat$First.name == "Jim" & dat$Surname == "Darcy" &
dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = "Tom"
dat$Surname[dat$First.name == "Jim" & dat$Surname == "Darcy" &
dat$Playing.for == "Sydney" & dat$Season == 1904 & dat$Round == 17] = "Darcy"

@jimmyday12
Copy link
Owner

Thanks heaps for all this guys. I'm going to try block out some time to focus on some of these in the coming weeks.

I will need to work out which issues are to do with fitzRoy, versus which are to do with the underlying data on afltables.com. My general philosophy is to leave things as they appear on afltables.com and try get Paul who runs the website to fix it there. But some helper functions to clean the data may also be useful - will have to think about it!

Thanks for all the work so far identifying them!

@jimmyday12 jimmyday12 self-assigned this Jul 2, 2019
@jimmyday12 jimmyday12 added bug an unexpected problem or unintended behavior data-bug bugs related to data and not fitzroy itself labels Jul 2, 2019
@jimmyday12 jimmyday12 reopened this Jul 2, 2019
@jimmyday12 jimmyday12 removed the bug an unexpected problem or unintended behavior label Jan 11, 2022
@peteowen1
Copy link
Contributor

fixed by #235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-bug bugs related to data and not fitzroy itself
Projects
None yet
Development

No branches or pull requests

4 participants