Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Data #15

Closed
probonobuddy opened this issue Jun 4, 2020 · 4 comments
Closed

Incorrect Data #15

probonobuddy opened this issue Jun 4, 2020 · 4 comments

Comments

@probonobuddy
Copy link

Using the zipped files from legacy data and python I have found 1241 instances where the "posteam" is also recorded as the "defteam". It only happens to Jacksonville and LA (Rams) (renaming issue?). The relevant "game_id"s are:
2009110804
2009112204
2009120605
2009121305
2009121700
2010092610
2010100308
2010122608
2015102500
2015111504
2015111900
2015112906
2015120609
2015121309
2015121700
2015122003

In game 2009110804, the only time that JAX is reported as "posteam" with KC as "defteam" are kickoffs. All kickoffs for that game are listed as JAX as "posteam" and KC as "defteam" regardless of which team is kicking. All other instances where JAX is the "posteam" also lists JAX as the "defteam".

The same pattern holds for 2009112204 (another JAX game)

The same pattern for 2015111504 (with LA in place of JAX)

I am not sure if this problem is specific to the .gz files.

@guga31bb
Copy link
Member

guga31bb commented Jun 4, 2020

Thank you for raising this! I'm fairly certain this is a problem with the underlying data from NFL where team name changes messed them up. We actually discovered this issue yesterday and fixed it for the next version of the package (not released yet), but only noticed it for JAX. So I need to check on STL and will leave this open.

@guga31bb
Copy link
Member

guga31bb commented Jun 5, 2020

Looked into this with the new data source and seems to be fixed. For future reference, my code for checking the ratio of home plays to away plays:

library(tidyverse)

seasons <- 1999:2019
pbp <- purrr::map_df(seasons, function(x) {
  readRDS(
    url(
      glue::glue("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{x}.rds")
    )
  )
})

sum <- pbp %>%
  filter(!is.na(posteam), !is.na(defteam), posteam != "") %>%
  group_by(game_id, posteam, home_team) %>%
 summarize(n = n ()) 

gs <- sum %>%
  ungroup() %>%
  mutate(home = if_else(posteam == home_team, 1, 0)) %>%
  select(game_id, home, n) %>%
  pivot_wider(
    names_from = home, 
    values_from = n
  ) %>%
  dplyr::rename(
    away = `0`,
    home = `1`
  ) %>%
  mutate(ratio = away / home) 

Doing this led me to find some games with way too many plays. Here's an example where duplicates need to be removed ('2007_08_IND_CAR'):

image

@guga31bb
Copy link
Member

guga31bb commented Jun 5, 2020

This was a problem with duplicate play_id messing with adding CPOE and creating duplicates. This will be fixed when we push the update

@guga31bb guga31bb closed this as completed Jun 5, 2020
@guga31bb
Copy link
Member

guga31bb commented Jun 5, 2020

One more comment- sadly, we can't fix anything in legacy-data because NFL removed the underlying data source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants