forked from nflverse/nflfastR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
160 lines (124 loc) · 9.11 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/readme-"
)
```
# **nflfastR** <img src="man/figures/logo.png" align="right" width="25%" min-width="120px"/>
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version-last-release/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![CRAN downloads](http://cranlogs.r-pkg.org/badges/grand-total/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![R build status](https://github.com/mrcaseb/nflfastR/workflows/R-CMD-check/badge.svg)](https://github.com/mrcaseb/nflfastR/actions)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![Support Server](https://img.shields.io/discord/591914197219016707.svg?color=7289da&label=Support&logo=discord&style=flat-square)](https://discord.com/invite/5Er2FBnnQa)
[![Twitter Follow](https://img.shields.io/twitter/follow/nflfastR.svg?style=social)](https://twitter.com/nflfastR)
<!-- [![Travis build status](https://travis-ci.com/mrcaseb/nflfastR.svg?branch=master)](https://travis-ci.com/mrcaseb/nflfastR) -->
<!-- ![GitHub release (latest by date)](https://img.shields.io/github/v/release/mrcaseb/nflfastR?label=development%20version) -->
<!-- badges: end -->
`nflfastR` is a set of functions to efficiently scrape NFL play-by-play data. `nflfastR` expands upon the features of nflscrapR:
* The package contains NFL play-by-play data back to 1999
* As suggested by the package name, it obtains games **much** faster
* Includes completion probability (`cp`), completion percentage over expected (`cpoe`), and expected yards after the catch (`xyac_epa` and `xyac_mean_yardage`) in play-by-play going back to 2006
* Includes drive information, including drive starting position and drive result
* Includes series information, including series number and series success
* Hosts [a repository of play-by-play data going back to 1999](https://github.com/guga31bb/nflfastR-data) for very quick access
* Features models for Expected Points, Win Probability, Completion Probability, and Yards After the Catch (see section below)
* Includes a function `update_db()` that creates and updates a database
We owe a debt of gratitude to the original [`nflscrapR`](https://github.com/maksimhorowitz/nflscrapR) team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, without whose contributions and inspiration this package would not exist.
## Installation
The easiest way to get nflfastR is to install it from [CRAN](https://cran.r-project.org/package=nflfastR) with:
```{r, eval=FALSE}
install.packages("nflfastR")
```
To get a bug fix or to use a feature from the development version, you can install the development version of nflfastR from [GitHub](https://github.com/mrcaseb/nflfastR/) with:
``` {r eval = FALSE}
if (!require("remotes")) install.packages("remotes")
remotes::install_github("mrcaseb/nflfastR")
```
## Usage
We have provided some application examples in the **[Getting Started](https://www.nflfastr.com/articles/nflfastR.html)** article. However, these require a basic knowledge of R. For this reason we have the **[nflfastR beginner's guide](https://www.nflfastr.com/articles/beginners_guide.html)**, which we recommend to all those who are looking for an introduction to nflfastR with R.
You can find column names and descriptions in the **[Field Descriptions](https://www.nflfastr.com/articles/field_descriptions.html)** article, or by accessing the `field_descriptions` dataframe from the package.
## Data repository
Even though `nflfastR` is very fast, **for historical games we recommend downloading the data from [here](https://github.com/guga31bb/nflfastR-data)**. These data sets include play-by-play data of complete seasons going back to 1999 and we will update them in 2020 once the season starts. The files contain both regular season and postseason data, and one can use game_type or week to figure out which games occurred in the postseason. Data are available as .csv.gz, .parquet, or .rds.
## nflfastR models
`nflfastR` uses its own models for Expected Points, Win Probability, Completion Probability, and Expected Yards After the Catch. To read about the models, please see [this post on Open Source Football](https://www.opensourcefootball.com/posts/2020-09-28-nflfastr-ep-wp-and-cp-models/). For a more detailed description of the motivation for Expected Points models, we highly recommend this paper [from the nflscrapR team located here](https://arxiv.org/pdf/1802.00998.pdf).
Here is a visualization of the Expected Points model by down and yardline.
``` {r epa-model, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 600, echo=FALSE, eval = FALSE}
# This code was used to create the ep model image. Since we don't want to include
# the resulting png file in the package for file size reasons it was uploaded to
# the nflfastR repo and embedded remotely with the next chunk
library(tidyverse)
df <- map_df(2014:2019, ~{
readRDS(url(glue::glue('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_{.x}.rds'))) %>%
filter(!is.na(posteam) & !is.na(ep), !is.na(down)) %>%
select(ep, down, yardline_100, air_yards, pass_location, cp)
})
df %>%
ggplot(aes(x = yardline_100, y = ep, color = as.factor(down))) +
geom_smooth(size = 2) +
labs(x = "Yards from opponent's end zone",
y = "Expected points value",
color = "Down",
title = "Expected Points by Yardline and Down") +
theme_bw() +
scale_y_continuous(expand=c(0,0), breaks = scales::pretty_breaks(10)) +
scale_x_continuous(expand=c(0,0), breaks = seq(from = 5, to = 95, by = 10)) +
theme(
plot.title = element_text(size = 18, hjust = 0.5),
plot.subtitle = element_text(size = 16, hjust = 0.5),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16),
legend.position = c(.90, .80)) +
annotate("text", x = 14, y = -2.2, size = 3, label = "2014-2019 | Model: @nflfastR")
```
```{r echo=FALSE, fig.align='center', fig.cap='', out.width='100%'}
knitr::include_graphics('https://github.com/mrcaseb/nflfastR/raw/master/man/figures/readme-epa-model-1.png')
```
Here is a visualization of the Completion Probability model by air yards and pass direction.
``` {r cp-model, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 600, echo=FALSE, eval = FALSE}
# This code was used to create the cp model image. Since we don't want to include
# the resulting png file in the package for file size reasons it was uploaded to
# the nflfastR repo and embedded remotely with the next chunk
df %>%
filter(!is.na(cp), between(air_yards, -5, 45)) %>%
mutate(pass_middle = if_else(pass_location == "middle", "Yes", "No")) %>%
ggplot(aes(x = air_yards, y = cp, color = as.factor(pass_middle))) +
geom_smooth(size = 2) +
labs(x = "Air yards",
y = "Expected completion %",
color = "Pass middle",
title = "Expected Completion % by Air Yards and Pass Direction") +
theme_bw() +
scale_y_continuous(expand=c(0,0), breaks = scales::pretty_breaks(5)) +
scale_x_continuous(expand=c(0,0)) +
theme(
plot.title = element_text(size = 18, hjust = 0.5),
plot.subtitle = element_text(size = 16, hjust = 0.5),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16),
legend.position = c(.80, .80)) +
annotate("text", x = 2, y = .32, size = 3, label = "2014-2019 | Model: @nflfastR")
```
```{r echo=FALSE, fig.align='center', fig.cap='', out.width='100%'}
knitr::include_graphics('https://github.com/mrcaseb/nflfastR/raw/master/man/figures/readme-cp-model-1.png')
```
`nflfastR` includes two win probability models: one with and one without incorporating the pre-game spread.
## Special thanks
* To Nick Shoemaker for [finding and making available JSON-formatted NFL play-by-play back to 1999](https://github.com/CroppedClamp/nfl_pbps) (`nflfastR` uses this source for 1999 and 2000 and previously also used it for 2001-2010)
* To Lau Sze Yui for developing a scraping function to access JSON-formatted NFL play-by-play beginning in 2001
* To Aaron Schatz and Football Outsiders for providing charting data to correctly mark scrambles in the 2005 season
* To Lee Sharpe for curating a resource for game information
* To Timo Riske, Lau Sze Yui, Sean Clement, and Daniel Houston for many helpful discussions regarding the development of the new `nflfastR` models
* To Zach Feldman and Josh Hermsmeyer for many helpful discussions about CPOE models as well as Peter Owen for many helpful suggestions for the CP model
* To Florian Schmitt for the logo design
* The many users who found and reported bugs in `nflfastR` 1.0
* And of course, the original [`nflscrapR`](https://github.com/maksimhorowitz/nflscrapR) team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, whose work represented a dramatic step forward for the state of public NFL research