499_assignment_3_moneyball.Rmd

<!-- 
This file by Martin Monkman 
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/  
-->

```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 100)
options(scipen=999)


```

# Assignment 3 - week 4 - Moneyball {-}


The book [_Moneyball: The Art of Winning an Unfair Game_](https://en.wikipedia.org/wiki/Moneyball) (and the movie of the same name) tells the story of the 2002 Oakland Athletics baseball team. That year, the team won 103 of its 162 games (a win-loss percentage^[Yes, I know this isn't a percentage but that's the term used!] of 0.636). Only the New York Yankees had a better record that Major League Baseball (MLB) season; the Yankees also won 103 but played one fewer due to a rain-out, so finished with a win-loss percentage of 0.640. 

Oakland had a payroll of $40 million (only Tampa Bay and Montreal, at $34.4 and $38.7 million, were lower). The Yankees, by contrast, had the highest payroll at $125.9 million, with a roster full of star players.

Another way to compare an individual team's payroll is relative to the Major League Baseball average. The Yankees payroll was 186% of the league average—for every 100 dollars the average team spent, the Yankees spent $186. The next closest team in spending was the Boston Red Sox, whose salary bill was 160% of the league average at $108 million. Oakland, by contrast, spent 59% of the league average—while the average team spent $100 dollars, the Athletics were constrained to spending $60.

And yet these two teams—one packed with highly-paid star players, and the other staffed with players that other teams had cast off—managed to have very similar and enormously successful seasons. 

It's a great story. Part of the drama is about how the Atheltics started using data analysis to identify those players who were undervalued in the market.

![_Moneyball movie poster_](static\img\moneyball-poster.jpg){width=25%}


Let's use the tidyverse packages and some elements in base R to explore the relationship between payroll spending and team success.

```{r setup_499, eval=FALSE}
library(tidyverse)
```


The first thing we'll do is read in the data file. For this we will use "mlb_pay_wl.csv".^[This data was compiled from [baseball-reference.com](https://www.baseball-reference.com/leagues/MLB/2019.shtml), one of the leading sites on the internet for historical professional baseball data.]

```{r}

mlb_pay_wl <- read_csv("data/mlb_pay_wl.csv")

```

This file has a record (observation) for each of the Major League Baseball (MLB) teams for the seasons 1999 through 2019. This is 630 rows of data. 

For each team season, the file has 

* the attendance per game ("attend_g")

* the estimated payroll ("est_payroll")

* the pay as a percent of the league ("pay_index"), where 100 is the league average for that year. A value of 110 would be 10 percent higher than the league average for the season; a value of 90 would be 10 percent lower.

* wins and losses ("w" and "l"); counts of the number of games won and lost in that season. Note that the Major League Baseball season is 162 games long, but the sum of wins and losses isn't always 162. Sometimes teams play fewer because of cancellations due to weather, and they might play an extra game at the end of the season to determine which team will advance to the playoffs.

* and the winning percentage ("w_l_percent") where 0.500 represents winning as many games as losing. The highest value in this period was the Seattle Mariners, whose win-loss "percentage" was 0.716 (71.6%) in 2001 (116 of 162 games). The worst team in the data was the Detroit Tigers in 2003, who _lost_ 119 of their 162 games (a percentage of 0.265).


```{r}
mlb_pay_wl
```

## Visualizing the data {-}

For our exercise, we will explore the relationship between payroll and wins. Does paying star players always lead to success? 

Or phrased another way, can we predict team success based on the team payroll? In this case, the win-loss percentage is the _dependent_ variable, while team payroll (as a percent of the league average) is the _independent_ variable.

First, let's create a scatterplot to visualize the relationship. We will put payroll on the X axis, and win-loss percent on the Y.

```{r}

ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent)) + 
  geom_point()

```

That's a lot of points on our plot. To help us navigate, in the plot below there are two lines on the plot. 

The red line runs vertically at the "1" point on the X axis. The teams to the left of the line spent below the league average for that season; the teams to the right spent more. As you can see, there have been cases when some teams spent twice as much as the league average.

The blue line runs horizontally at the "0.5" point on the Y axis. Above this line, the teams won more games than they lost. Below the line, they lost more games than they won.


```{r}

ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent)) + 
  geom_point() +
  geom_vline(xintercept  = 100, colour = "red", size = 2) +
  geom_hline(yintercept = 0.5, colour = "blue", size = 2)

```

Based on a quick glance, there seems to be more high payroll teams in the top right quadrant (high payroll plus winning season) than bottom right (high payroll plus losing season). By contrast, there might be a few more low payroll teams with losing seasons than low payroll teams with winning seasons.

Let's use our regression modeling skills to see if there's a consistent relationship with paying star players lots of money, and winning lots of games.

First, we can visualize the relationship using the `geom_smooth(method = lm)` function. The "method = lm" part refers to a linear model—that is to say, a regression model.

```{r}

ggplot(mlb_pay_wl, aes(x = pay_index, y = w_l_percent)) + 
  geom_point() +
  geom_smooth(method = lm)

```

## The regression model {-}

We can use the `lm()` method to create our regression statistics.

```{r}
# run the regression model, assign the output to an object
model_mlb_pay_wl <- lm(w_l_percent ~ pay_index, data = mlb_pay_wl)

# a glance at the regression results
model_mlb_pay_wl

```
The regression equation is the same as an equation for defining a line in Cartesian geometry:

$y = b_{0} + b_{1}x_{1} + e$

How to read this:

* the value of `y` (i.e. the point on the `y` axis) can be predicted by

* the intercept value $b_{0}$, plus

* the coefficient $b_{1}$ times the value of `x`

The distance between that predicted point and the actual point is the error term.


The equation calculated by the `lm()` function, to create the model `model_mlb_pay_wl`, is:

```{r, echo=FALSE}
# The {equatiomatic} is no longer on CRAN
# - it can be installed from github:
# - https://datalorax.github.io/equatiomatic/ 
#remotes::install_github("datalorax/equatiomatic")

equatiomatic::extract_eq(model_mlb_pay_wl, 
                         use_coefs = TRUE,
                         coef_digits = 4)
```


This means that any predicted value of `y` (that is, the y-axis value of any point on the line) is equal to 

`r round(summary(model_mlb_pay_wl)$coefficients[1, 1], 2)` + 
`r round(summary(model_mlb_pay_wl)$coefficients[2, 1], 4)` * the value of `x`.

If `x` = 150, then `y` = `r round(summary(model_mlb_pay_wl)$coefficients[1, 1], 2)` + (`r round(summary(model_mlb_pay_wl)$coefficients[2, 1], 4)` * 150), which is `r round(summary(model_mlb_pay_wl)$coefficients[1, 1], 2) + (round(summary(model_mlb_pay_wl)$coefficients[2, 1], 4) * 150)`. 

Look back at the line of best fit—if `x` = 150, the `y` point on the line is just above the 0.5 mark. 

Or another way to think about it: in a 162 game season, a team spending 50% more on payroll than the league average can expect to win 54% (0.54) of their games. This works out to 87 games (162 * 0.54).


We can get more details from the statistical model by using the `summary()` function, including the P value and the R-squared value:

```{r}

summary(model_mlb_pay_wl)

```


## Questions  {-}

### 1. Moneyball  {-}

(4 marks)

Review the outputs from the model summary above. 

  a. Is this model statistically significant? What is the P value?

  b. What is the R-squared value of this model? What does it tell us about the strength of the relationship between these two variables?

  c. If you were managing the budget of a MLB team, would you feel confident telling the shareholders that spending money on star players would turn into success on the field? Explain your answer.


### 2. Does winning lead to bigger crowds?  {-}

(8 marks)

In any sport, successful teams draw bigger crowds. If we were to model this relationship, we would say attendance is the _dependent_ variable, while winning is the _independent_ variable.

  a. Use the data in the "mlb_pay_wl.csv" file to visualize the relationship between team wins (or winning percentage) and attendance.

   b. Use the data in the "mlb_pay_wl.csv" file to create a regression model that evaluates the relationship between team success on the field and attendance. Display the results in your output.

   c. Describe and analyze the coefficients, the R-squared value, and the P value. Can we say that a successful team draw bigger crowds?


### 3. Data organization in spreadsheets  {-}

(4 marks)

Read the following article: 

Karl W. Broman and Kara H. Woo, "Data Organization in Spreadsheets", _The American Statistician_, Vol 72, Issue 1: Special Issue on Data Science, 2018

* it is available from this link: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989


 a.  What are basic principles for using spreadsheets for good data organization?

 b.  What are good approaches for handling dates in spreadsheets?


## Bonus Mark!  {-}
(1 mark)

Imagine that time travel is possible...
 
You have just been transported back four weeks to the day before the course started. What is one thing you would tell your past self about this course?


-30-