Skip to content

Latest commit

 

History

History
402 lines (343 loc) · 12.9 KB

Intro to Stats.md

File metadata and controls

402 lines (343 loc) · 12.9 KB

GETTING STARTED

First, set your working directory.

setwd("FILE PATH")

Or you can use the drop-down menu: Session -> Set Working Directory -> Choose Directory

We'll be using these libraries.

library(dplyr)
library(ggplot2)
library(janitor)
library(psych)
library(readxl)
library(stringr)
library(tidyr)
library(tidyverse)

We'll work with this data set, which has information on four-year colleges in the US:

Getting started

Read the .csv file into R. In R-speak, we're creating a data frame named colleges. It contains field names, so we include this: col_names = TRUE

colleges <- read_csv("colleges.csv", col_names = TRUE) 

View the data

view(colleges)

What if your file has millions of records and you want to see only the first few? Use head to call up the first rows.

head(colleges, 15)

We see the first several fields in the Console window (lower left of RStudio) but not all. So let's view all columns in first 15 rows.

view(head(colleges, 15))

This way also works.

head(colleges, 15) %>% view()
# A tibble: 15 × 17
   unitid instnm         city    state control ugrads accept_rate act_mid sat_avg
    <dbl> <chr>          <chr>   <chr> <chr>    <dbl>       <dbl>   <dbl>   <dbl>
 1 100654 Alabama A & M… Normal  AL    Public    5271       0.918      17     939
 2 100663 University of… Birmin… AL    Public   13328       0.737      26    1234
 3 100706 University of… Huntsv… AL    Public    7785       0.826      28    1319
 4 100724 Alabama State… Montgo… AL    Public    3750       0.969      17     946
 5 100751 The Universit… Tuscal… AL    Public   31900       0.827      27    1261
 6 100812 Athens State … Athens  AL    Public    2677      NA          NA      NA
 7 100830 Auburn Univer… Montgo… AL    Public    4407       0.904      21    1082
 8 100858 Auburn Univer… Auburn  AL    Public   24209       0.807      28    1300
 9 100937 Birmingham-So… Birmin… AL    Private   1205       0.538      26    1230
10 101116 South Univers… Montgo… AL    For-pr…    251      NA          NA      NA
11 101189 Faulkner Univ… Montgo… AL    Private   1928       0.783      21    1066
12 101435 Huntingdon Co… Montgo… AL    Private   1007       0.593      21    1076
13 101453 Heritage Chri… Floren… AL    Private     59       0.235      11      NA
14 101480 Jacksonville … Jackso… AL    Public    6743       0.548      21    1084
15 101541 Judson College Marion  AL    Private    245       0.372      19    1020
# … with 8 more variables: total_cost <dbl>, net_price <dbl>, ft_faculty <dbl>,
#   pell_grant <dbl>, first_gen <dbl>, default_3yr <dbl>, grad_rate <dbl>,
#   grad_debt <dbl>


Check out the structure of our data - which fields are numeric, character, etc.

str(colleges)

Descriptive statistics

We can get basic statistics for our variables using summary.

summary(colleges)

Note that this works only for numeric fields, not character. We can also look at a single variable.

summary(colleges$total_cost)

Or a single statistic for a single variable.

mean(colleges$total_cost) 

Wait, it returns NA. That's because some colleges have missing values. You have to exclude the null fields with na.rm = TRUE That's Rspeak for Remove NAs? Yes

mean(colleges$total_cost, na.rm = T) 

Other handy stats

median(colleges$total_cost, na.rm = T) # median = middle value
sd(colleges$total_cost, na.rm = T) # standard deviation
min(colleges$total_cost, na.rm = T) # minimum value, i.e. cheapest college
max(colleges$total_cost, na.rm = T) # maximum value, i.e. priciest college

Which gives us

> median(colleges$total_cost, na.rm = T) # median = middle value
[1] 32016
> sd(colleges$total_cost, na.rm = T) # standard deviation
[1] 16482.91
> min(colleges$total_cost, na.rm = T) # minimum value, i.e. cheapest college
[1] 9007
> max(colleges$total_cost, na.rm = T) # maximum value, i.e. priciest college
[1] 78555

Whoa, which college has the highest sticker price of $78,555?

which.max(colleges$total_cost)

[1] 340

It returns 340, which is the row number. Which college is that? This returns the data in that row.

colleges[340,]

Which gives us

> colleges[340,]
# A tibble: 1 × 17
  unitid instnm city  state control ugrads accept_rate act_mid sat_avg total_cost
   <dbl> <chr>  <chr> <chr> <chr>    <dbl>       <dbl>   <dbl>   <dbl>      <dbl>
1 144050 Unive… Chic… IL    Private   6801      0.0617      34    1528      78555
# … with 7 more variables: net_price <dbl>, ft_faculty <dbl>, pell_grant <dbl>,
#   first_gen <dbl>, default_3yr <dbl>, grad_rate <dbl>, grad_debt <dbl>

We can also return just the college name (column 2) in that row.

colleges[340,2]

Same result but we use the column name instead of number.

colleges[340,"instnm"]

Bonus tip: Put those commands together

colleges[which.max(colleges$total_cost),]

Try getting the same info for the cheapest college on your own.

Frequency counts and group statistics

What if you want to know how many colleges there are by control (public, private, for-profit)

colleges %>%
  group_by(control) %>%
  count() 
  
# A tibble: 3 × 2
# Groups:   control [3]
  control        n
  <chr>      <int>
1 For-profit   167
2 Private     1264
3 Public       580

Instead of count, we can use summarize. This is helpful if we want to build on this & add statistics like average, median, etc.

colleges %>%
  group_by(control) %>%
  summarize(freq = n()) 

Note that I called it "freq" but you can call it anything, like count or N

We can also look at the average total cost by type.

colleges %>%
  group_by(control) %>%
  summarize(freq = n(), avg = mean(total_cost, na.rm = T))
  
# A tibble: 3 × 3
  control     freq    avg
  <chr>      <int>  <dbl>
1 For-profit   167 30242.
2 Private     1264 43256.
3 Public       580 22327.

Look only at colleges in Tennessee

colleges %>%
  group_by(control) %>%
  filter(state == "TN") %>%
  summarize(freq = n(), avg = mean(total_cost, na.rm = T))
  
# A tibble: 3 × 3
  control     freq    avg
  <chr>      <int>  <dbl>
1 For-profit     3 25897.
2 Private       34 38152.
3 Public        10 23239.

Compare the average sticker price (total_cost) to net price, which is what students actually pay after grants & scholarships We'll also give our new fields more detailed names

colleges %>%
  group_by(control) %>%
  filter(state == "TN") %>%
  summarize(freq = n(),
            avg_net = mean(net_price, na.rm = T), 
            avg_total = mean(total_cost, na.rm = T),
            avg_default = mean(default_3yr, na.rm = T))
            
# A tibble: 3 × 5
  control     freq avg_net avg_total avg_default
  <chr>      <int>   <dbl>     <dbl>       <dbl>
1 For-profit     3  20338     25897.      0.101 
2 Private       34  19886.    38152.      0.0908
3 Public        10  13871.    23239.      0.0857

Correlations - how much two variables are related to each other

For that we use cor

cor(colleges$total_cost, colleges$net_price)

There's that NA again - we need to specify only complete pairs, i.e. both fields have values

cor(colleges$total_cost, colleges$net_price, use = "complete.obs")


[1] 0.7830763

The correlation is 0.78. What does that mean? Correlations range from -1 to 1.

  • A value of -1 means perfect negative correlation. As one variable goes up, the other goes down.
  • A value of 1 means perfect positive correlation. The two variables go up together.
  • A value of 0 means no correlation - there's no relationship between the two variables. So 0.78 shows a strong correlation between sticker price and net price. Note: This correlation is also known as Pearson's R. It's the default kind with cor()

We can also look at correlations across multiple variables

colnames(colleges)
cor(colleges[7:16], use = "complete.obs")
cor(colleges[7:16], use = "pairwise.complete.obs")

pairwise.complete.obs = Goes pair by pair and excludes missing variables. N will vary. complete.obs = Includes only cases with no missing values at all. N will be consistent.

Let's say we want to see what measures are most strongly correlated with grad_rate. Here's an excerpt from using complete.obs

             ft_faculty pell_grant  first_gen default_3yr  grad_rate
accept_rate -0.04393138  0.1460390  0.2248305  0.16469773 -0.3516391
act_mid      0.17450810 -0.7278785 -0.6772431 -0.67363509  0.8348351
sat_avg      0.18830115 -0.7324468 -0.6709702 -0.67957407  0.8325304
total_cost  -0.03964591 -0.4291535 -0.5164789 -0.41105102  0.5848145
net_price   -0.09809404 -0.4429567 -0.5323789 -0.33552365  0.4594331
ft_faculty   1.00000000 -0.1441360 -0.1880841 -0.04161433  0.1145687
pell_grant  -0.14413599  1.0000000  0.6806644  0.68847561 -0.6889457
first_gen   -0.18808414  0.6806644  1.0000000  0.51159820 -0.6656683
default_3yr -0.04161433  0.6884756  0.5115982  1.00000000 -0.7465700
grad_rate    0.11456872 -0.6889457 -0.6656683 -0.74657005  1.0000000

That would be the median ACT (act_mid) and average SAT scores of incoming students, at 0.83.

The weakest correlation with grad rates is the percent of full-time faculty, at 0.11.

We also want to know if the correlations are significant - did they happen by pure chance? The corr.test function from the psych package shows this. Again, it works only on numeric fields.

corr.test(colleges[11:16], use = "complete.obs")

Generally if you have a lot of cases, it will be statistically significant. What about a smaller slice of the data? Let's create a dataframe with just Tennessee colleges.

tn_colleges <- filter(colleges, state == "TN")
corr.test(tn_colleges[11:16], use = "complete.obs")

We see the correlations again, followed by probability values, also known as p-values. Anything less than 0.05 means it's statistically significant. That means there's less than 5 percent probability that we got these results by pure chance.

Correlation matrix
            net_price ft_faculty pell_grant first_gen default_3yr grad_rate
net_price        1.00      -0.33      -0.36     -0.50       -0.31     **0.39**
ft_faculty      -0.33       1.00      -0.17     -0.13        0.02      0.15
pell_grant      -0.36      -0.17       1.00      0.71        0.82     -0.81
first_gen       -0.50      -0.13       0.71      1.00        0.53     -0.76
default_3yr     -0.31       0.02       0.82      0.53        1.00     -0.76
grad_rate        0.39       0.15      -0.81     -0.76       -0.76      1.00
...

Probability values (Entries above the diagonal are adjusted for multiple tests.) 
            net_price ft_faculty pell_grant first_gen default_3yr grad_rate
net_price        0.00       0.19       0.12      0.01        0.21      **0.07**
ft_faculty       0.03       0.00       1.00      1.00        1.00      1.00
pell_grant       0.02       0.27       0.00      0.00        0.00      0.00
first_gen        0.00       0.41       0.00      0.00        0.00      0.00
default_3yr      0.04       0.92       0.00      0.00        0.00      0.00
grad_rate        0.01       0.32       0.00      0.00        0.00      0.00

Take the moderate correlation of 0.39 between grad rates and net price. It has a p-value of 0.07. That means there's a 7 percent chance we got those results by dumb luck.

VISUALIZE YOUR DATA: Histograms and scatter plots It helps to visualize your data, too. For one variable, draw a histogram.

hist(colleges$total_cost)

For two variables, make a scatter plot.

plot(colleges$total_cost, colleges$net_price)

You can also use pairs. This shows a matrix of columns 12 through 15.

pairs(colleges[12:15])

Or you can select the variables by name.

pairs(~ total_cost + net_price + grad_rate, data = colleges)

More powerful charts with the ggplot2 library

Scatterplots

ggplot(colleges, aes(x=total_cost, y=net_price)) + geom_point()

Change the point size

ggplot(colleges, aes(x=total_cost, y=net_price)) + 
  geom_point(size = 1)

Color by college type

ggplot(colleges, aes(x=total_cost, y=net_price, color = control)) + 
  geom_point(size = 1)

Do the same with Tennessee colleges only Color by college type

ggplot(tn_colleges, aes(x=total_cost, y=net_price, color = control)) + 
  geom_point(size = 1)

Label points with school ID. hjust & vjust reposition the label

ggplot(tn_colleges, aes(x=total_cost, y=net_price, color = control, 
                        label=unitid)) + 
  geom_point(size = 1) +
  geom_text(size = 2, hjust=0, vjust=0)

Histograms

ggplot(colleges, aes(x=total_cost)) +
  geom_histogram()

Change the bin width (based on total dollars)

ggplot(colleges, aes(x=total_cost)) +
  geom_histogram(binwidth=1000)

Change the colors

ggplot(colleges, aes(x=total_cost)) +
  geom_histogram(binwidth=1000, color="blue", fill="white")

Color by group

ggplot(colleges, aes(x=total_cost, color=control)) +
  geom_histogram(binwidth=1000, fill="white")