-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plays with Laterals Will Not Have the Correct Yardage Gained #216
Comments
Play in question
Maybe we could fix yards_gained by using the lateral yards on plays with laterals.
Usually 20 minutes or so using the R package, and rebuilt overnight for data repo. |
That would be wrong especially on plays with multiple laterals. In my opinion we have to split up by pass and rush plays.
library(tidyverse)
pbp_db %>%
filter(pass == 1, passing_yards != yards_gained) %>%
select(game_id, play_id, yards_gained, passing_yards, receiving_yards, lateral_receiving_yards, desc) %>%
collect() %>%
mutate(lateral = str_detect(tolower(desc), fixed("lateral")))
#> # A tibble: 188 x 8
#> game_id play_id yards_gained passing_yards receiving_yards lateral_receiving_ya~ desc lateral
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 1999_04_CAR~ 4569 12 14 5 12 (:03) S.Beuerlein pass to M.Muhammad to CAR 31 for 5 yards. Lateral to A.Johnson to CAR 28 for -3 yard~ TRUE
#> 2 1999_04_TEN~ 3968 22 25 3 22 (2:25) (Shotgun) N.O'Donnell pass to M.Roan to TEN 46 for 3 yards. Lateral to E.George ran ob at SF 32~ TRUE
#> 3 1999_05_CHI~ 3887 2 12 10 2 (4:50) R.Cunningham pass to R.Moss to CHI 40 for 10 yards. Lateral to C.Carter ran ob at CHI 38 for 2 ~ TRUE
#> 4 1999_16_DEN~ 3870 5 17 12 5 (:14) C.Batch pass to H.Moore to DET 40 for 12 yards. Lateral to G.Crowell to DET 45 for 5 yards (C.Wa~ TRUE
#> 5 2000_03_MIN~ 756 0 27 27 0 (1:30) D.Culpepper pass to C.Carter to NE 36 for 27 yards. Lateral to R.Moss ran ob at NE 36 for no ga~ TRUE
#> 6 2000_06_TB_~ 3525 26 30 4 26 (:49) (Shotgun) S.King pass to D.Moore to TB 27 for 4 yards. Lateral to W.Dunn pushed ob at MIN 47 for~ TRUE
#> 7 2000_08_DET~ 3774 8 12 4 8 (1:27) S.King pass to D.Moore to TB 48 for 4 yards. Lateral to W.Dunn ran ob at DET 44 for 8 yards (R.~ TRUE
#> 8 2000_09_STL~ 3211 11 18 7 11 (15:00) T.Green pass to I.Bruce to STL 49 for 7 yards. Lateral to R.Holcombe to SF 40 for 11 yards (A.~ TRUE
#> 9 2000_10_CAR~ 3536 3 19 16 3 (:07) T.Green pass to I.Bruce to STL 44 for 16 yards. Lateral to A.Hakim to STL 47 for 3 yards (E.Robi~ TRUE
#> 10 2000_10_KC_~ 4270 6 17 11 6 (:28) E.Grbac pass to S.Morris to OAK 35 for 11 yards. Lateral to T.Richardson to OAK 29 for 6 yards (~ TRUE
#> # ... with 178 more rows
pbp_db %>%
filter(rush == 1, rushing_yards != yards_gained) %>%
select(game_id, play_id, yards_gained, rushing_yards, lateral_rushing_yards, desc) %>%
collect() %>%
mutate(lateral = str_detect(tolower(desc), fixed("lateral")))
#> # A tibble: 31 x 7
#> game_id play_id yards_gained rushing_yards lateral_rushing_ya~ desc lateral
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 2000_17_TB_~ 1654 11 8 11 (1:50) (Shotgun) S.King left end to TB 44 for 8 yards. Lateral to W.Dunn to GB 45 for 11 yards (B.Harris, C.Hunt). TRUE
#> 2 2001_06_STL~ 1523 44 12 44 (3:22) 81-A.Hakim right end to NYJ 44 for 12 yards. Lateral to 24-T.Canidate for 44 yards, TOUCHDOWN. Warner hands to #8~ TRUE
#> 3 2001_15_NYJ~ 3630 11 12 11 (1:37) 16-V.Testaverde up the middle to IND 32 for 12 yards. Lateral to 28-C.Martin to IND 21 for 11 yards. TRUE
#> 4 2002_10_HOU~ 854 6 0 6 (15:00) (Punt formation) 7-C.Stanley Aborted. 80-S.McDermott FUMBLES at HOU 37, recovered by HOU-7-C.Stanley at HOU 23. ~ FALSE
#> 5 2005_02_JAX~ 605 0 4 0 (3:42) 28-F.Taylor up the middle to IND 39 for 4 yards. Lateral to 18-M.Jones to IND 36 for 3 yards. FUMBLES, and recove~ TRUE
#> 6 2006_03_GB_~ 1501 5 1 5 (6:10) 83-A.Hakim right end to DET 45 for 1 yard. Lateral to 29-B.Calhoun to 50 for 5 yards (22-M.Manuel). TRUE
#> 7 2006_10_NO_~ 3658 4 0 4 (8:09) 26-D.McAllister Aborted. 52-J.Faine FUMBLES at PIT 4, recovered by NO-26-D.McAllister at PIT 4. 26-D.McAllister f~ FALSE
#> 8 2007_04_STL~ 1991 4 0 4 (:56) (Shotgun) 9-T.Romo Aborted. 65-A.Gurode FUMBLES at STL 50, recovered by DAL-9-T.Romo at DAL 17. 9-T.Romo ran ob at~ FALSE
#> 9 2007_15_IND~ 1923 0 5 0 (:01) (Shotgun) 12-J.McCown up the middle to OAK 48 for 5 yards. Lateral to 33-D.Rhodes to OAK 48 for no gain. FUMBLES, ~ TRUE
#> 10 2008_08_OAK~ 1678 19 2 19 (2:00) (Shotgun) 10-T.Smith right end to BAL 46 for 2 yards. Lateral to 27-R.Rice to OAK 35 for 19 yards (31-H.Eugene). TRUE
#> # ... with 21 more rows Created on 2021-03-11 by the reprex package (v1.0.0) SolutionSo my suggestion is yards_gained = dplyr::case_when(
!is.na(.data$passing_yards) &
.data$yards_gained != .data$passing_yards &
.data$penalty == 0 ~ .data$passing_yards,
!is.na(.data$rushing_yards) &
!is.na(.data$lateral_rushing_yards) &
.data$yards_gained != .data$rushing_yards &
.data$penalty == 0 ~ .data$rushing_yards + .data$lateral_rushing_yards,
TRUE ~ yards_gained
) and if there will be a play with multiple instances of |
Here is code to scrape offensive yards from nfl.com for further checks library(dplyr, warn.conflicts = FALSE)
options(dplyr.summarise.inform = FALSE)
options(tibble.print_min = 32)
passing <- rvest::read_html("https://www.nfl.com/stats/team-stats/offense/passing/2020/reg/all") %>%
rvest::html_table() %>%
purrr::pluck(1) %>%
dplyr::mutate(Team = stringr::str_extract(Team, ".+(?=\\n)")) %>%
janitor::clean_names() %>%
dplyr::select(team:cmp, pass_yds)
rushing <- rvest::read_html("https://www.nfl.com/stats/team-stats/offense/rushing/2020/reg/all") %>%
rvest::html_table() %>%
purrr::pluck(1) %>%
dplyr::mutate(Team = stringr::str_extract(Team, ".+(?=\\n)")) %>%
janitor::clean_names() %>%
dplyr::select(team:rush_yds)
passing %>%
dplyr::left_join(rushing, by = "team") %>%
dplyr::mutate(overall = pass_yds + rush_yds) %>%
dplyr::arrange(dplyr::desc(overall)) %>%
dplyr::select(team, overall)
#> # A tibble: 32 x 2
#> team overall
#> <chr> <int>
#> 1 Chiefs 6804
#> 2 Vikings 6548
#> 3 Titans 6516
#> 4 Bills 6509
#> 5 Packers 6417
#> 6 Cardinals 6339
#> 7 Chargers 6332
#> 8 Texans 6309
#> 9 Raiders 6299
#> 10 Cowboys 6299
#> 11 Buccaneers 6295
#> 12 Seahawks 6216
#> 13 Saints 6210
#> 14 49ers 6209
#> 15 Rams 6201
#> 16 Colts 6182
#> 17 Falcons 6152
#> 18 Browns 6075
#> 19 Ravens 5990
#> 20 Lions 5896
#> 21 Panthers 5833
#> 22 Eagles 5755
#> 23 Dolphins 5625
#> 24 Broncos 5591
#> 25 Bears 5572
#> 26 Steelers 5480
#> 27 Jaguars 5474
#> 28 Patriots 5470
#> 29 Bengals 5461
#> 30 Football Team 5407
#> 31 Giants 5104
#> 32 Jets 4798 Created on 2021-03-13 by the reprex package (v1.0.0) |
I believe that the NFL passing page is showing gross passing yards (sack yards not deducted). When I subtract KC's sack yardage (151) from the 6804 that you reported with your scrape, you arrive at 6653. This is also the total offense for KC on pfref.com. I strongly suspect if you transform the gross passing yards to net passing yards, the stats should line up. I feel like the only necessary work is to deduct sack yardage. (Sack yardage is on the far right of the NFL page you scraped). They really should better label what they are showing in that table. The rushing yardage seems to agree from all three sources. In your screenshot I can see KC's rushing is 1799. It is also 1799 on NFL.com and pfref.com. Turns out rushing doesn't have a "gross rushing yardage", since they are deducting as they go along for every negative run. |
Seeing all of this excellent code almost makes it worth asking the question, even if the bugs don't get squished! Thanks for showing me such useful functions like clean_names() and pluck(). I can't wait to try to put these new functions into my arsenal! I have been a pretty rookie scraper myself, but maybe now I can be a little better at it. Thanks also for kindly working on this issue. |
I am sorry if I was supposed to take this code and try it out. I can do that if you would like and see if it clears up my issue. I wasn't sure if this was an open internal discussion between the creators of the tool, and I should wait for this to be implemented on your side, or I was supposed to incorporate this code into mine and expect the tool to remain unchanged. QUICK CHECK LATER: Success! I cannot detect any problems with 2020 season data. All of my YPP calculations are now correct! All I need to know is whether I need to leave that code in my code, or will it eventually make it into nflfastR. Thanks for the incredibly fast reply and sharp work! I wish people who charged for products/support were as responsive as you have been. What an incredible service you are providing to the community. |
Oh no worries, you are not supposed to use any of that code. I just dumped it in the issue for me and Ben. |
As soon as I am sure that |
Phew! I was nervous for a sec! Thanks for posting the correction code anyway. That allows your users to use the corrections before the final release. It felt good to see the errors disappear when I inserted the code into mine. I will now leave that snippet in until the new nflfastR is released. Thanks for the swift attention! |
The data in the data repo has been updated with this fix. |
For plays with laterals, the yardage gained is often incorrect. Notice the following 2nd quarter play from the CLE at CIN game in 2020 (2020_07_CLE_CIN).
(10:26) (Shotgun) 6-B.Mayfield pass short middle to 80-J.Landry to CLE 47 for -6 yards. Lateral to 27-K.Hunt to 50 for 3 yards (21-M.Alexander; 90-K.Kareem)
This play is listed as a 0 yard gain, when it should have been listed as a 3 yard loss.
Also, here is a desperation play by Houston at the end of their game with Cincinnati this year (2020_16_CIN_HOU).
(:09) (Shotgun) 4-D.Watson pass short left to 17-C.Hansen to HOU 38 for 8 yards. Lateral to 16-K.Coutee to HOU 29 for -9 yards. Lateral to 4-D.Watson to HOU 25 for -4 yards. Lateral to 13-B.Cooks to HOU 45 for 20 yards. Lateral to 74-M.Scharping to CIN 48 for 7 yards. Lateral to 67-C.Heck to CIN 40 for 8 yards (23-D.Phillips). FUMBLES (23-D.Phillips), ball out of bounds at CIN 40.
This play is listed as an 8 yard gain. It gained 30 yards (43 yards of forward progress less 13 yards of backwards movement).
The yardline_100 info appears correct (not thoroughly verified yet), so "internal" plays like the CLE example above might be detectable/correctable by sanity checks with that data. The final plays of games won't have a yardline_100 available on the next play, so that may have to be dealt with differently. That play description text may have to be parsed to find all of the gains and losses.
My main concern with this information is trying to calculate yards per play for each team. These discrepancies in the yards gained information cause my yards per play to be inaccurate. If you could suggest a workaround for calculating yards per play for teams while the yards gained on a play with laterals is corrected, I would very much be grateful.
Thanks again for reading my report, and for creating the most magnificent R tool. Have a tremendous day!
OFF TOPIC QUESTION: I am new to the tool. I haven't used it in season yet. How soon after completion of a game is data usually available?
The text was updated successfully, but these errors were encountered: