forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-getting-started.Rmd
362 lines (226 loc) · 23.2 KB
/
02-getting-started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
# Getting Started {#getting-started}
```{r setup_getting_started, include=FALSE}
chap <- 2
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(tidy = FALSE, out.width = '\\textwidth')
# This bit of code is a bug fix on asis blocks, which we use to show/not show LC solutions, which are written like markdown text. In theory, it shouldn't be necessary for knitr versions <=1.11.6, but I've found I still need to for everything to knit properly in asis blocks. More info here:
# https://stackoverflow.com/questions/32944715/conditionally-display-block-of-markdown-text-using-knitr
library(knitr)
knit_engines$set(asis = function(options) {
if (options$echo && options$eval) knit_child(text = options$code)
})
# This controls which LC solutions to show. Options for solutions_shown: "ALL" (to show all solutions), or subsets of c('3-1', '3-2','3-3'), including the null vector c('') to show no solutions.
solutions_shown <- c('')
show_solutions <- function(section){return(solutions_shown == "ALL" | section %in% solutions_shown)}
```
Before we can start exploring data in R, there are some key concepts to understand first: ):
1. What are R and RStudio?
1. How do I code in R?
1. What are R packages?
If you are already familiar with these concepts, feel free to skip to Chapter \@ref(#nycflights13) below introducing some of the datasets we will explore in depth in this book. Much of this chapter is based on two sources which you should feel free to use as references if you are looking for additional details:
1. Ismay's [Getting used to R, RStudio, and R Markdown](http://ismayc.github.io/rbasics-book) [@usedtor2016], which includes GIF screen recordings that you can follow along as you learn.
1. DataCamp's online tutorials. DataCamp is a browser-based interactive platform for learning data science and their tutorials will help facilitate your learning of the above concepts (and other topics in this book). Go to [DataCamp](https://www.datacamp.com/) and create an account before continuing.
## What are R and RStudio?
For much of this book, we will assume that you are using R via RStudio. First time users often confuse the two. At its simplest:
* R is like a car's engine
* RStudio is like a car's dashboard
R: Engine | RStudio: Dashboard
:-------------------------:|:-------------------------:
<img src="images/engine.jpg" alt="Drawing" style="height: 200px;"/> | <img src="images/dashboard.jpg" alt="Drawing" style="height: 200px;"/>
More precisely, R is a programming language that runs computation while RStudio is an *integrated development environment (IDE)* that provides an interface by adding many convenient features and tools. So the way having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio's interface makes using R much easier as well.
Optional: For a more in-depth discussion on the difference between R and RStudio IDE, watch the following [DataCamp video (2m52s)](https://campus.datacamp.com/courses/working-with-the-rstudio-ide-part-1/orientation?ex=1).
### Installing R and RStudio
*If your instructor has provided you with a link and access to RStudio Server, then you can skip this section. If you don't know what RStudio Server is, then please continue.*
You will first need to download and install both R and RStudio (Desktop version) on your computer.
1. [Download and install R](https://cran.r-project.org/).
+ Note: You must do this first.
+ Click on the download link corresponding to your computer's operating system.
1. [Download and install RStudio](https://www.rstudio.com/products/rstudio/download3/).
+ Scroll down to "Installers for Supported Platforms"
+ Click on the download link corresponding to your computer's operating system.
Optional: If you need more detailed instructions on how to install R and RStudio, watch the following [DataCamp video (1m22s)](https://campus.datacamp.com/courses/working-with-the-rstudio-ide-part-1/orientation?ex=3).
### Using R via RStudio
Recall our car analogy from above. Much as we don't drive a car by interacting directly with the engine but rather by using elements the car's dashboard, we won't be using R directly but rather we will use RStudio's interface. After you install R and RStudio on your computer, you'll have two new programs AKA applications you can open. We will always work in RStudio and not R. In other words:
R: Do not open this | RStudio: Open this
:-------------------------:|:-------------------------:
<img src="https://cran.r-project.org/Rlogo.svg" alt="Drawing" style="height: 100px;"/> | <img src="https://www.rstudio.com/wp-content/uploads/2014/06/RStudio-Ball.png" alt="Drawing" style="height: 100px;"/>
After you open RStudio, you should see the following:
![](images/rstudio.png)
Watch the following [DataCamp video (4m10s)](https://campus.datacamp.com/courses/working-with-the-rstudio-ide-part-1/orientation?ex=5) to learn about the different *panes* in RStudio, in particular the *Console pane* where you will later run R code.
## How do I code in R? {#code}
Now that you're set up with R and RStudio, you are probably asking yourself "OK. Now how do I use R?" The first thing to note as that unlike other software like Excel, STATA, or SAS that provide [point and click](https://en.wikipedia.org/wiki/Point_and_click) interfaces, R is an [interpreted language](https://en.wikipedia.org/wiki/Interpreted_language), meaning you have to enter in R commands written in R code i.e. you have to program in R (we use the terms "coding" and "programming" interchangeably in this book).
While it is not required to be a seasoned coder/computer programmer to use R, there is still a set of basic programming concepts that R users need to understand. Consequently, while this book is not a book on programming, you will still learn just enough of these basic programming concepts needed to explore and analyze data effectively.
### Basic Programming Concepts Needed {#programming-concepts}
To introduce you to many of these basic programming concepts, we direct you to the following DataCamp online interactive tutorials. For each of the tutorials, we give a list of the basic programming concepts covered. Note that in this book, we will use a different font to distinguish regular font from computer code.
It is important to note that while these tutorials serve as excellent introductions, a single pass through them is insufficient for long-term learning and retention. The ultimate tools for long-term learning and retention are "learning by doing" and repetition, something we will have you do over the course of the entire book.
* From the [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r) course complete the following chapters:
+ [Chapter 1 Intro to basics](https://campus.datacamp.com/courses/free-introduction-to-r/chapter-1-intro-to-basics-1?ex=1):
+ Console pane: where you enter in commands
+ Variables: where values are saved, how to assign values to variables.
+ Data types: integers, doubles/numerics, logicals, characters.
+ [Chapter 2 Vectors](https://campus.datacamp.com/courses/free-introduction-to-r/chapter-2-vectors-2?ex=1):
+ Vectors: a series of values.
+ [Chapter 4 Factors](https://campus.datacamp.com/courses/free-introduction-to-r/chapter-4-factors-4?ex=1):
+ *Categorical data* (as opposed to *numerical data*) are represented in R as `factor`s.
+ [Chapter 5 Data frames](https://campus.datacamp.com/courses/free-introduction-to-r/chapter-5-data-frames?ex=1):
+ Data frames are analogous to rectangular spreadsheets: they are representations of datasets in R where the rows correspond *observations* and the columns correspond to *variables* that describe the observations. We will revisit this later in Chapter \@ref(#nycflights13).
* From the [Intermediate R](https://www.datacamp.com/courses/intermediate-r) course complete the following chapters:
+ [Chapter 1 Conditionals and Control Flow](https://campus.datacamp.com/courses/intermediate-r/chapter-1-conditionals-and-control-flow?ex=1):
+ Testing for equality in R using `==` (and not `=` which is typically used for assignment). Ex: `2+1 == 3` is correct, while `2+1 = 3` is not.
+ Boolean algebra: `TRUE/FALSE` statements and mathematical operators such as `<` (less than), `<=` (less than or equal), and `!=` (not equal to).
+ Logical operators: `&` representing "and", `|` representing "or". Ex: `2+1 == 3 & 2+1 == 4` returns `FALSE` while `2+1 == 3 & 2+1 == 4` returns `TRUE`.
+ [Chapter 3 Functions](https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions?ex=1):
+ Concept of functions: they take in inputs (called *arguments*) and return outputs.
+ You either manually specify a function's arguments or use the function's *defaults*.
This list is by no means an exhaustive list of all the programming concepts need to become a savvy R user; such a list would be so large it wouldn't be very useful, especially for novices. Rather, we feel this is the bare minimum you need to know before you get started; the rest we feel you can learn as you go.
### Tips on learning to code
Learning to code/program is very much like learning a foreign language, it can be very daunting and frustrating at first. However just as with learning a foreign language, if you put in the effort and are not afraid to make mistakes, anybody can learn. Lastly, there are a few useful things to keep in mind as you learn to program:
* **Computers are stupid**: You have to tell a computer everything it needs to do. Furthermore, your instructions can't have any mistakes in them, nor can they be ambiguous in any way.
* **Do not code from scratch**: Especially when learning a new programming language, it is often much easier to taking existing code and modify it, rather than trying to write new code from scratch. So please take the code we provide throughout this book and play around with it!
## What are R packages? {#packages}
An R package is a collection of functions, data, and documentation that extends the capabilities of R. They are written by a world-wide community of R users. For example, among the many packages we will use in this book are the
* `ggplot2` package for data visualization in Chapter \@ref(viz)
* `dplyr` package for data wrangling in Chapter \@ref(wrangling)
However, there are two key things to remember about R packages:
1. *Installation*: Most packages are not installed by default when you install R and RStudio. You need to install a package before you can use it. Once you've installed it, you don't need to install it again unless you want to update it to a newer version.
1. *Loading*: Packages are not loaded automatically when you open RStudio. You need to load them everytime you open RStudio.
### Package Installation
There are two ways to install an R package. For example, to install the `ggplot2` package:
1. In the Files pane:
a) Click on "Packages"
a) Click on "Install"
a) Type the name of the package under "Packages (separate multiple with space or comma):" In this case, type `ggplot2`
a) Click "Install"
1. Alternatively, in the Console pane run `install.packages("ggplot2")` (you must include the quotation marks).
Repeat this for the `dplyr` package.
**Note**: You only have to install a package once, unless you want to update an already installed package to the latest version. If you want to update a package to the latest version, then re-install it by repeating the above steps.
### Package Loading
After you've installed a package, you can now load it using the `library()` command. For example, to load the `ggplot2` and `dplyr` packages, run the following code in the Console pane:
```{r, eval=FALSE}
library(ggplot2)
library(dplyr)
```
**Note**: You have to reload each package you want to use every time you open a new session of RStudio.
## Putting it all together {#nycflights13}
Let's put everything we've learned so far into practice and start exploring some real data! Data comes to us in a variety of formats, from pictures to text to numbers. Throughout this book, we'll focus on datasets that can be stored in a spreadsheet as that is among the most common way data is collected in the many fields.
Let's first load all the packages needed for this chapter (this assumes you've already installed them). Read Chapter \@ref(packages) for information on how to install and load R packages. At the beginning of all subsequent chapters in this text, we'll always have a list of packages you should have installed and loaded.
```{r message=FALSE}
library(dplyr)
library(nycflights13)
library(knitr)
```
### nycflights13 Package
We likely have all flown on airplanes or know someone who has. Air travel has become an ever-present aspect in many people's lives. If you live in or are visiting a relatively large city and you walk around that city's airport, you see gates showing flight information from many different airlines. And you will frequently see that some flights are delayed because of a variety of conditions. Are there ways that we can avoid having to deal with these flight delays?
We'd all like to arrive at our destinations on time whenever possible. (Unless you secretly love hanging out at airports. If you are one of these people, pretend for the moment that you are very much anticipating being at your final destination.) Throughout this book, we're going to analyze data related to flights contained in the `nycflights13` package [@R-nycflights13]. Specifically, this package contains 5 datasets saved as "data frames" (see Chapter \@ref(#code)) with information about all domestic flights departing from New York City in 2013, from either Newark Liberty International (EWR), John F. Kennedy International (JFK), or LaGuardia (LGA) airports:
* `flights`: information on all `r scales::comma(nrow(nycflights13::flights))` flights
* `airlines`: translation between two letter IATA carrier codes and names (`r nrow(nycflights13::airlines)` in total)
* `planes`: construction information about each of `r scales::comma(nrow(nycflights13::planes))` planes used
* `weather`: hourly meterological data (about `r nycflights13::weather %>% count(origin) %>% .[["n"]] %>% mean() %>% round()` observations) for each of the three NYC airports
* `airports`: airport names and locations
### flights Data Frame
We will begin by exploring the `flights` data frame that is included in the `nycflights13` package and getting an idea of its structure. Run the following in your code in your console: it loads in the `flights` dataset into your Console. Note depending on the size of your monitor, your the output may vary slightly.
```{r load_flights}
flights
```
Let's unpack this output:
* `A tibble: 336,776 x 19`: a `tibble` is a [kind of data frame](https://blog.rstudio.org/2016/03/24/tibble-1-0-0/#tibbles-vs-data-frames). This particular data frame has
+ `336,776` rows
+ `19` columns corresponding to 19 variables describing each observation
* `year month day dep_time sched_dep_time dep_delay arr_time` are different columns, in other words variables, of this data frame.
* We then have the first 10 rows of observations correponding to 10 flights.
* `... with 336,766 more rows, and 11 more variables:` indicating to us that 336,766 more rows of data and 11 more variables could not fit in this screen.
Unfortunately, this output does not allow us to explore the data very well. Let's look at different tools to explore data frames.
### Exploring Data Frames {#explore-dataframes}
Among the many ways of getting a feel for the data contained in a data frame such as `flights`, we present three functions that take as argument the data frame in question:
1. Using the `View()` function. We will use this the most.
1. Using the `glimpse()` function loaded via `dplyr` package
1. Using the `kable()` function in the `knitr` package
**1. `View()`**:
Run `View(flights)` in your Console and explore this data frame in the resulting pop-up viewer. You should get into the habit of always `View`ing any data frames that come your way.
***
```{block lc3-2, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does any *ONE* row in this `flights` dataset refer to?
- A. Data on an airline
- B. Data on a flight
- C. Data on an airport
- D. Data on multiple flights
```{asis lc3-2-solution, include=show_solutions('3-2')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc), ")")` What does any ONE row in this `flights` dataset refer to?** This is data on a flight. Not a flight path! Example:
* a flight path would be United 1545 to Houston
* a flight would be United 1545 to Houston *2013/1/1 at 5:15am*
```
***
By running `View(flights)`, we see the different *variables* listed in the columns and we see that there are different types of variables. Some of the variables like `distance`, `day`, and `arr_delay` are what we will call *quantitative* variables. These variables are numerical in nature. Other variables here are *categorical*.
Note that if you look in the leftmost column of the `View(flights)` output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row corresponds to. In other words, this will allow you to identify what object is being referred to in a given row. This is often called the *observational unit*. The *observational unit* in this example is an individual flight departing New York City in 2013. You can identify the observational unit by determining what the *thing* is that is being measured in each of the variables.
**2. `glimpse()`**:
The second way to explore a data frame is using the `glimpse()` function that you can access after you've loaded the `dplyr` package. It provides us with much of the above information and more.
```{r}
glimpse(flights)
```
***
```{block lc3-3, type='learncheck'}
**_Learning check_**
```
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some examples in this dataset of **categorical** variables? What makes them different than **quantitative** variables?
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does `int`, `dbl`, and `chr` mean in the output above?
```{asis lc3-3-solutions, include=show_solutions('3.3')}
**Learning Check Solutions**
**`r paste0("(LC", chap, ".", (lc - 3), ")")` What are some examples in this dataset of categorical variables? What makes them different than quantitative variables?**
Hint: Type `?flights` in the console to see what all the variables mean!
* Cateogorical:
+ `carrier` the company
+ `dest` the destination
+ `flight` the flight number. Even though this is a number, its simply a label. Example United 1545 isn't "less than" United 1714
* Quantitative:
+ `distance` the distance in miles
+ `time_hour` time
**`r paste0("(LC", chap, ".", (lc - 2), ")")` What does `int`, `dbl`, and `chr` mean in the output above?**
* `int`: integer. Used to count things i.e. a discrete value. Ex: the # of cars parked in a lot
* `dbl`: double. Used to measure things. i.e. a continuous value. Ex: your height in inches
* `chr`: character. i.e. text
```
***
We see that `glimpse` will give you the first few entries of each variable in a row after the variable. In addition, the *data type* (See Chapter \@ref(programming-concepts)) of the variable is given immediately after each variable's name inside `< >`. Here, `int` and `num` refer to quantitative variables. In contrast, `chr` refers to categorical variables. One more type of variable is given here with the `time_hour` variable: `dttm`. As you may suspect, this variable corresponds to a specific date and time of day.
**3. `kable()`**:
The final way to explore a data frame is using the `kable()` function from the `knitr` package. Let's explore the different carrier codes for all the airlines in our dataset two ways. Run both of these in your Console:
```{r eval=FALSE}
airlines
kable(airlines)
```
At first glance of both outputs, it may not appear that there is much difference. However, we'll see later on, especially when using a tool for document production called [R Markdown](http://rmarkdown.rstudio.com/lesson-1.html), that the latter produces output that is much more legible.
### Help Files
Another nice feature of R is the help system. You can get help in R by entering a `?` before the name of a function or data frame in question and you will be presented with a page showing the documentation. For example, let's look at the help file for the `flights` data frame:
```{r eval=FALSE}
?flights
```
A help file should pop-up in the Help pane of RStudio. Note the content of this particular help file is also accessible on the [web](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf) on page 3 of the PDF document.
You should get in the habit of consulting the help file of any function or data frame in R about which you have questions.
## Conclusion
We've given you what we feel are the most essential concepts to know before you can start exploring data in R. Is this chapter exhaustive? Absolutely not. To try to include everything in this chapter would make the chapter so large it wouldn't be useful! However, as we stated earlier, the best way to learn R is to learn by doing. Now let's get into learning about how to create good stories about and with data. In Chapter \@ref(viz), we start with what we feel is the most important tool in a data scientist's toolbox: data visualization.
<!--
### Data Packages
Some of the datasets we will analyze in this class are accessible via R packages. For example:
- flights leaving New York City in 2013 in the `nycflights13` package
- profiles of OKCupid users in San Francisco in the `okcupiddata` package
- IMDB movie ratings in the `ggplot2movies` package
By focusing on a few large data sources, it is our hope that you'll be able to see how each of the chapters is interconnected. You'll see how the data being "tidy" (See Chapter \@ref(tidy)) leads into data visualization and manipulation in exploratory data analysis and how those concepts tie into inference and regression.
We will keep a running list of R packages you will need to have installed to complete the analysis as well here in the `needed_pkgs` character vector. You can check if you have all of the needed packages installed by running all of the lines below in the next chunk of R code. The last lines including the `if` will install them as needed (i.e., download their needed files from the internet to your hard drive and install them for your use).
You can run the `library` function on them to load them into your current analysis. Prior to each analysis where a package is needed, you will see the corresponding `library` function in the text. Make sure to check the top of the chapter to see if a package was loaded there.
-->
```{r, echo=FALSE, warning=FALSE, message=FALSE, results='hide'}
# needed_pkgs <- c("nycflights13", "tibble", "dplyr", "ggplot2", "knitr",
# "okcupiddata", "dygraphs", "rmarkdown", "mosaic",
# "ggplot2movies", "fivethirtyeight", "readr")
#
# new.pkgs <- needed_pkgs[!(needed_pkgs %in% installed.packages())]
#
# if(length(new.pkgs)) {
# install.packages(new.pkgs, repos = "http://cran.rstudio.com")
# }
```