-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLab-2-notes.Rmd
481 lines (317 loc) · 16.1 KB
/
Lab-2-notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
---
title: "Lab 2: Basics of R, again"
author: "Fahd Alhazmi"
output:
html_document:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```
# Core functions you need to know and probably share with 5 of your friends
Let’s talk more about functions. Most of what you’ll do in R will be learning how to use functions. You’ll rarely need to write things up from scratch. Let’s see some common functions that we’ll use here and then.
## Housekeeping: install.packages(), library(), getwd(), setwd()
In R, we often deal with packages (or sets of functions that others wrote to make our lives easier). To use those packages, we first need to install those packages into our computers. To do this, we will use install.packages(x) where x is the name of the package. Let’s say we want to install a package called "babynames". We can simply type:
```{r, eval=FALSE}
install.packages("babynames")
```
And you’ll soon see some colored output from R telling you how things are going.
Now that we have installed the package into our computer, we need to import it to R and start using it. We can use `library` to do this.
```{r, eval=FALSE}
library("babynames")
```
This command will make the functions (or data) in this package available for us to use.
Handing working directories in R can be tricky but let’s prepare for the worst and hope for the best. In R, there is this notion of “working directory” and it is the address in your file system that R is running at. Later, we’ll need to read files and write files and hence we'll need to handle file system properly. To do those we’ll first need to know where we are so that we can point to the right file. I can’t give you directions to the Empire State if I don’t know what is your starting point.
We we’ll use ``getwd()`` to “get the working directory” which R recognizes. Files in this working directory can be read easily by using their names. For example, let’s say I want to read a table called `names.csv` (this following code will not run and it just for illustration):
```{r, eval=FALSE}
data <- read.csv('names.csv')
```
It will only work if there exists a table called names.csv in the current working directory. How to get that? Here we use
```{r, eval=FALSE}
getwd()
```
More often, we want to change that working directory. Let’s say that the table names.csv is in another sub-sub-sub-sub-sub-sub directory. In this case, you will need to read the table using the full path of the file:
```{r, eval=FALSE}
data<- read.csv('folder/subfolder/subsubfolder/subsubsubfolder/names.csv')
```
That’s because R is looking at Franklin Ave station and you are referring to a building at 34th Station. To teleport from a directory to another, we can use setwd() and change the working directory:
```{r, eval=FALSE}
setwd('folder/subfolder/subsubfolder/subsubsubfolder')
```
This will make the subsubsubfolder our new working directory and now we can simply call the file `names.csv` without using any folders before the name.
This is very important as we dive deeper to read and write data. For now, however, you can just move on.
## Statistical functions: mean, median, sum, var, sd, max, min, table, sort, unique
R is statistical language and it has tons of functions doing all sorts of statistics. Let’s discuss a few functions and explore the rest as we progress in this course. Those functions usually require a set of numbers as inputs. Let’s define a set of numbers as we did last time (we called it a vector):
```{r}
numbers <- c(4,3,6,6,3,3,5,3,2,7,3,5,6,3,2,1,5,6,2,12,3,4,5,5)
```
Calculate the sum of those numbers:
```{r}
sum(numbers)
```
Calculate the length of this list of numbers:
```{r}
length(numbers)
```
We can calculate the mean of those numbers by dividing the sum over the length:
```{r}
sum(numbers) / length(numbers)
```
Or we can simply use `mean`
```{r}
mean(numbers)
```
We can also find the median
```{r}
median(numbers)
```
Or the variance of those numbers
```{r}
var(numbers)
```
And from the variance, we can find the standard deviation using another function: the square root
```{r}
sqrt(var(numbers))
```
Or we can use `sd`
```{r}
sd(numbers)
```
We can also use sort to sort those numbers from lowest to highest:
```{r}
sort(numbers)
```
or from the hieghest to lowest by changing the value of `decreasing` input:
```{r}
sort(numbers, decreasing=TRUE)
```
Remember, sort returns another vector that we can easily play with. For example, we can use the result of sort to find the minimum number in the list by selecting the first element after the sorting:
```{r}
sort(numbers)[1]
```
Or the maximum number
```{r}
sort(numbers, decreasing=TRUE)[1]
```
Notice that we can use `length(numbers)` instead of `1` to index the location of the last element in a vector.
We can instead use `max` and `min` functions to get the same results:
```{r}
max(numbers)
```
```{r}
min(numbers)
```
Let’s see one more function: `unique` which returns a list of unique elements in a given set of numbers. We make use of this function all the time.
```{r}
numbers <- c(1,1,1,2,2,2,3,3,3,4,4,4)
unique(numbers)
```
Now, I am going to give you a function and ask you to guess what this function is doing!
```{r}
table(numbers)
```
Do you have any ideas? Well, how do we know what functions do and where to read help? We can type a question mark before the name of the function which will give us a readable explanation of what that function is doing with examples and free donuts.
```{r, eval=FALSE}
?table
```
## Variable Information functions: length, class, is.numeric, as.numeric, is.character, as.charachter
We have already talked about `length` but we have a few more functions that are designed to manipulate variables or test specific things about those variables. For example, we can use `class` to find the recognized type of any variable:
```{r}
class(numbers)
```
And we can use `is.numeric` to ask if R recognize a variable as numeric
```{r}
is.numeric(numbers)
```
There are also `as.numeric` which will convert a given convertible variable to its numeric form. For example, let’s say we have:
```{r}
numbers_in_char_form <- c('100', '-100', '2.5')
```
Now we do recognize those as numerals but they are in R as charachters. We can see that in `class`
```{r}
class(numbers_in_char_form)
```
which gives us character type -- meaning that we can’t really do any calculations on them. Have you ever divided your name by your hieght? How to tell R that those are actually numerals and possibly convert them to numbers? Using `as.numeric`
```{r}
as.numeric(numbers_in_char_form)
```
And now we have those numbers in a numeric form. We will need this later when we get to know different classes of variables.
Similarly, we have `as.charachter()` and `is.charachter()` to do the same with charachter data.
## Custom functions: seq, rep
Now there are still some functions that we’ll use here and then. Take `seq`, short of sequence.
```{r}
seq(from=10, to=1000, by=2)
```
It clearly makes a list of numbers in a given range. We also have `rep`, short of repeat.
```{r}
rep(c(1,2,3), each=3)
```
Which repeats a given sequence a given number of times (hint: try `times=3` instead of `each=3` and see what happens).
So now that we have learned about few functions you should ask: how do you know if a function actually exists? Nobody really knows but we use Google so you should. However, things are less painful if you use a common cheatsheets reference for you to know what functions are out there at your disposal (but you’ll google it anyway so why bother?). I personally use those cheatsheets just to assess how much I know about R’s core functions. You’ll probably need about 10% of those functions in the cheatsheets but you still want to be friends with them.
# Jump from vectors to matrices
A vector is just a set of numbers. In real life we deal with sets of numbers, usually called a table or a matrix. We’ll see in this section how we create a matrix, how to select specific elements in rows and columns
Let’s create a simple matrix:
```{r}
my_matrix <- matrix(seq(from=1,to=20,by=1), nrow=5,ncol=4)
```
Which should create a matrix numbered from 1 to 20 with 5 rows and 4 columns.
```{r}
my_matrix
```
Now that we have a matrix, what can we do with it? A matrix is simply a table of numbers. Let’s see some useful functions that help us handle a matrix (or any table). Meet `dim` which prints the dimensions of the matrix, or the number of rows followed by the number of columns.
```{r}
dim(my_matrix)
```
Now, `sum` will return the sum of the whole matrix
```{r}
sum(my_matrix)
```
To find the sum of rows or columns separately, we need to use special functions: `rowSums` which returns a list of the sum of each row, and `colSums` which does the same with columns.
```{r}
rowSums(my_matrix)
```
```{r}
colSums(my_matrix)
```
## Row and Column Indexing
Let’s now see how we can select specific rows and columns. To select a specific row, let’s say 3rd row, we simply need to type the position inside the bracket:
```{r}
my_matrix[3,]
```
And to select multiple rows, we can type those rows inside a `c()`
```{r}
my_matrix[c(1,3),]
```
Which will return the first and third row of the table. Let’s now select the 2nd and 4th column
```{r}
my_matrix[,c(2,4)]
```
The only thing that have changed is the position of index to be after the comma. So anything before the comma is to index the row, and anything after the comma is used to index the column.
We can also use a logical index. For example, let’s select only the first two rows. First, we need to create an index (or a sequence) of all rows:
```{r}
row_index <- 1:5 # we can also use seq(from=1, to=5, by=1)
```
Now, we want to retrieve the first two rows using logical indexing. To select the first 2 numbers:
```{r}
row_index < 3
```
And we can use that as our index:
```{r}
my_matrix[row_index < 3,]
```
Which will return the first two rows, because only the first two numbers evaluate True.
We can use logical indexing in filtering records from large tables and you'll definitly make use of it all the time.
## Matrices to DataFrames
Now let’s talk about actual data that you might find in the real-world. First of all, we can see that the matrix is already some kind of data, but it lacks labels and names. So let’s make that into a table with names using `data.frame`:
```{r}
df <- data.frame(my_matrix)
df
```
Now, we see that our table has column headings and row numbers. To clean up things a little bit, we can use `names` function to change the column names into some fictional names (sorry real-world):
```{r}
names(df) <- c('age', 'sex', 'day', 'time')
df
```
Which should improve things for us. Do you know why? Becasue we can now select columns by their names, instead of by their positions as we did in matrices. To select the first column (i.e., age), we used this before:
```{r}
df[,c(1)]
```
But now in the new world of data.frames, we can do this:
```{r}
df[,c('age')]
```
Or we can simply write the $ sign and then the column name:
```{r}
df$age
```
All those ways go to Rome. But we aren’t really going there, so I'll stick with the $ notation to select single columns. If we want to select multiple columns, then we can do either one of the first two options (by position or by name).
We can use all the functions we learned about: `dim`, `nrow`, `ncol`, etc. We also have a few more functions to learn about. Let’s use fictional data:
```{r}
x <- data.frame(student_name=c('Roy','Tania','Sara'),
age=c(35, 23, 28),
sex=c('m','f','f'))
x
```
We can select age column and deal with it as a list of numbers:
```{r}
x$age
```
And this means we can filter rows based on age. For example, let’s use logical indexing for rows where age is bigger than 25:
```{r}
x$age > 25
```
And we can use that (with potentially any other conditions) to filter rows:
```{r}
x[x$age>25, ]
```
Let’s find the name of the students whose age is bigger than 25.
```{r}
x$student_name[x$age > 25]
```
Or we can simply type
```{r}
x[x$age > 25, 'student_name']
```
See how in the row section we used a filter and in the column (after the comma) we selected a specific column. We can also do:
```{r}
x[x$age > 25,]$student_name
```
All those are valid ways of filtering and selecting elements in our table.
Now that we have explored this fake dataset, let’s see some real data.
We’ll deal with a baby names datasets that tracks the popularity of individual baby names from the U.S. Social Security Administration To install the data, we’ll install a package and then use `library` command to add the data.
```{r, eval=TRUE}
install.packages('babynames')
library(babynames)
```
We first want to look at the first few rows to see what we have:
```{r}
head(babynames)
```
We have 5 columns: year, sex, name, n (which I assume is the number of babies with that name at the given year and sex) and prop (i.e., proportion).
We can also look at the last few rows using `tail`
```{r}
tail(babynames)
```
## Selecting rows (also known as: filtering)
Just like in matrices, we can filter rows in a dataframe using logical indexing. For example, let's filter only records of 2017:
```{r}
babynames [ babynames$year==2017 , ]
```
Which we can deal with as another table. We can simply ask how many records do we have by using `nrow` or `dim`:
```{r}
nrow(babynames [ babynames$year==2017 , ])
```
Let's make very simple plots with name frequencies across all years. We will use a function called `plot` which will require an x-axis and a y-axis. Both x and y should be a list of numbers. For example, we will plot the frequency of `Sarah` across all years
```{r}
result_sarah_n <- babynames$n[babynames$name=='Sarah' & babynames$sex == 'F']
result_sarah_year <- babynames$year[babynames$name=='Sarah' & babynames$sex == 'F']
```
Now we are ready to use `plot`
```{r}
plot(result_sarah_year, result_sarah_n, type='l')
```
Did you like that simple plot? We'll do more plotting next lab and it won't be this ugly, but now let's master the basics.
I want to know what are the top 5 names in the year 1989. How should we approach this? Here we will combine lots of what we have learned previously: `sort` with logical indexing. To know the most frequent name in 1972, we first need to filter data in 1972:
```{r}
data_subset <- babynames[babynames$year == 1989,]
```
Now, we'll sort the proportions with setting `decreasing=false` and select the first element (or we can use `max(data_subset)`)
```{r}
most_freq_n <- sort(data_subset$n, decreasing = TRUE)[1]
most_freq_n
```
Now, we will look for the records whose `n` equals what we just got and print those records:
```{r}
data_subset[ data_subset$n == most_freq_n , ]
```
If you really want to read few good articles about this dataset, then here are few links:
* [A couple of cool articles using this dataset](https://www.prooffreader.com/category/baby-names/)
* [Kaggle also got few good explorations -- although some are in Python, you still want to be inspired on what analysis you can run. For example, what is the effect of US president on babynames? what about sport Athlets? etc](https://www.kaggle.com/kaggle/us-baby-names/kernels?sortBy=voteCount&group=everyone&pageSize=20&datasetId=13)
* [Here is an interesting article on the most gender neutral names in the U.S.](http://www.randalolson.com/2014/12/06/top-25-most-gender-neutral-names-in-the-u-s/)
We'll continue with this dataset later on -- hopefully after you skim through those links.
# Exercise
* Write a script that computes the mean of each column in a matrix (without using the function `mean`), and compare your result with `colMeans`.
* Using the `babynames` dataset, do the following (and you are free to use any function now):
+ How many records (i.e., rows) we have for female baby names in 1950?
+ What is the most popular male name in 2010? What about the female name?
+ Extract the frequency of the name "Mohammed" across all years and then use `plot` function. What about other names? Just type as many names as you can until you see names that have interesting trends. When you see something interesting, just use it in your final solution and tell me why you think it is interesting (probaby in blackboard when you submit the assignment)