-
Notifications
You must be signed in to change notification settings - Fork 1.1k
/
errors.txt
1046 lines (664 loc) · 26.9 KB
/
errors.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
big change not recorded: all gather/spread were changed to pivot_longer/pivot_wider
Preface, Page xxvii
Zofia Gajdos instead of Zzofia Gajdos
Introduction, Page xxx
In the intro "scrapping" should be "scraping"
Page 3, 31 in book
the New File
Should be
hen New File
Page 24, 52 in pdf
Change this paragraph
Note that the levels have an order that is different from the order of appearance in the factor object. The default is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. We will see several examples of this in the Data Visualization part of the book. The function reorder lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example.
To
Note that the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the `levels` argument when creating the factor with the `factor` function. For example, in the murders dataset regions are ordered from east to west. The function `reorder` lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example, and will see more advanced ones in the Data Visualization part of the book.
Page 48, 77 in pdf
name space
Should be
namespace
Page 49, 77 in pdf
If we want to be absolutely sure we use the __dplyr__ `filter` we can use
Should be
If we want to be absolutely sure that we use the __dplyr__ `filter`, we can use
page 52-53 in pdf
The list subsection (2.4.6) changed quite a bit. Better to look at new version. Here is what it shoudl be now:
Data frames are a special case of _lists_. Lists are useful because you can store any combination of different types. You can create a list using the `list` function like this:
```{r}
record <- list(name = "John Doe",
student_id = 1234,
grades = c(95, 82, 91, 97, 93),
final_grade = "A")
```
The function `c` is described in Section \@ref{vectors}.
This list includes a character, a number, a vector with five numbers, and another character.
```{r}
record
class(record)
```
As with data frames, you can extract the components of a list with the accessor `$`.
```{r}
record$student_id
```
We can also use double square brackets (`[[`) like this:
```{r}
record[["student_id"]]
```
You should get used to the fact that in R, there are often several ways to do the same thing, such as accessing entries.
You might also encounter lists without variable names.
```{r,}
record2 <- list("John Doe", 1234)
record2
```
If a list does not have names, you cannot extract the elements with `$`, but you can still use the brackets method and instead of providing the variable name, you provide the list index, like this:
```{r}
record2[[1]]
```
We won't be using lists until later, but you might encounter one in your own exploration of R. For this reason, we show you some basics here.
page 56
This is not a visible change, but because we added a reference we need to add a label to section:
## Vectors {#vectors}
Page 63, 91 in pdf
this is now a special data frame called a grouped data frame and dplyr functions
Add comma:
this is now a special data frame called a grouped data frame, and dplyr functions
Page 64, 92 in pdf
To see the states by population, from smallest to largest, we arrange by `rate` instead
Should be
To see the states by murder rate, from lowest to highest, we arrange by `rate` instead:
Page 65, 94 in pdf
"Remember that the main summarization function"
Change to
"The `mean` and `sd` functions"
Page 65, 94 in pdf
if left blank, top_n, filters
Remove comma
if left blank, top_n filters
Page 70, 98 in pdf
For example to access
Add comma
For example, to access
Page 71, 99 in pdf
__no argument
No __ and Should be bold
So in markdown right:
__no argument__
Page 73, 101 in pdf
three groups
Should be
four groups
Page 73, 101 in pdf
for example to check
Should be
for example, to check
Page 74, 102 in pdf
In exercise 3 murders is in the wrong font. In code,
murders
Should be
`murders`
Page 82, 110 in pdf
Remove:
Two other functions that are helpful are `scan`.
Page 84, 112 in pdf
Although there are R packages designed to read this format, if you are choosing a file format
to save your own data, you generally want to avoid Microsoft Excel. We recommend Google
Sheets as a free software tool for organizing data. We provide more recommendations in
the section Data Organization with Spreadsheets. This book focuses on data analysis. Yet
often a data scientist needs to collect data or work with others collecting data. Filling out
a spreadsheet by hand is a practice we highly discourage and instead recommend that the
process be automatized as much as possible. But sometimes you just have to do it. In this
section, we provide recommendations on how to store data in a spreadsheet.
Should be
Although this book focuses almost exclusively on data analysis, data management is also an important part of data science. As explained in the introduction, we do not cover this topic. However, quite often data analysts needs to collect data, or work with others collecting data, in a way that is most conveniently stored in a spreadsheet. Although filling out a spreadsheet by hand is a practice we highly discourage, we instead recommend the process be automatized as much as possible, sometimes you just have to do it. Therefore, in this section, we provide recommendations on how to organize data in a spreadsheet. Although there are R packages designed to read Microsoft Excel spreadsheets, we generally want to avoid this format. Instead, we recommend Google Sheets as a free software tool. Below
we summarize the recommendations made in paper by Karl Broman and Kara Woo^[https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375989]. Please read the paper for important details.
page 97 in pdf
Replace
One other important difference is that by default `data.frame` coerces characters into factors without providing a warning or message:
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
class(grades$names)
```
To avoid this, we use the rather cumbersome argument `stringsAsFactors`:
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
class(grades$names)
```
with
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
```
page 110 in pdf
Change
However, there are a couple of important differences. To show this we read-in the data with an R-base function:
```{r}
dat2 <- read.csv(filename)
```
An important difference is that the characters are converted to factors:
```{r}
class(dat2$abb)
class(dat2$region)
```
This can be avoided by setting the argument `stringsAsFactors` to `FALSE`.
```{r}
dat <- read.csv("murders.csv", stringsAsFactors = FALSE)
class(dat$state)
```
In our experience this can be a cause for confusion since a variable that was saved as characters in file is converted to factors regardless of what the variable represents. In fact, we **highly** recommend setting `stringsAsFactors=FALSE` to be your default approach when using the R-base parsers. You can easily convert the desired columns to factors after importing data.
to
You can obtain an data frame like `dat` using:
```{r}
dat2 <- read.csv(filename)
```
page 100 in pdf
Replace the scan subsection.
Change
### `scan`
to
An often useful R-base importing function is `scan`, as it provides much flexibility.
also added space between `=` here:
x <- scan(file.path(path, filename), sep=",", what = "c")
x <- scan(file.path(path, filename), sep = ",", what = "c")
Page 132, 160 in pdf
10\. Notice that the approximation calculated in question *two* is very close to the exact calculation in the first question. Now perform the same task for more extreme values. Compare the exact calculation and the normal approximation for the interval (79,81]. How many times bigger is the actual proportion than the approximation?
Two should be nine
page 108, 136 in pdf
9. Rewrite the code above to abbreviation as the label through aes
should be
9. Rewrite the code above to use abbreviations as the label through aes
page 108, 136 in pdf
10. Change the color of the labels through blue. How will we do this?
should be
10. Change the color of the labels to blue. How will we do this?
Page 132, 160 in pdf
15. In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player. What would be a fair critique of our calculations:
a. Practice and talent are what make a great basketball player, not height.
b. The normal approximation is not appropriate for heights.
c. As seen in question 3, the normal approximation tends to underestimate the extreme values. It’s possible that there are more seven footers than we predicted.
d. As seen in question 3, the normal approximation tends to overestimate the extreme
values. It’s possible that there are fewer seven footers than we predicted.
Question 3 should be question 10
Page171, 199 in pdf
For each year, we are simply comparing four quantities – the four percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:
Should be (four should be five)
For each year, we are simply comparing five quantities – the five percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:
page 182 in pdf
----1----x----10------100---
should be
`----10---x---100------1000---`
Page 199, 147 in pdf
closet should be closest
page 228 in pdf
remove exercises 9 through 12
302 in pdf
There is a - instead of + in CI:
$[\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} - 2\hat{\mbox{SE}}(\bar{X})]$
Should be
$[\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} + 2\hat{\mbox{SE}}(\bar{X})]$
304 in pdf
$1 - (1 - p)/2 + (1 - p)/2 = p$
should be
$1 - (1 - p)/2 - (1 - p)/2 = p$
Page 344, 372 in pdf
get_slope <- function(x, y) cor(x, y) / (sd(x) * sd(y))
Should be
get_slope <- function(x, y) cor(x, y) * sd(y) / sd(x)
307 in pdf
$2*\bar{X}-1 = 0.04$
should be
$2\bar{X}-1 = 0.04$
page 314 in pdf
b. These are not t-tests so the connection between p-value and confidence intervals does not apply.
should be
b. These are based on t-statistics so the connection between p-value and confidence intervals does not apply.
page 346 in pdf
Change
$$
Z = \frac{\bar{X} - d}{s/\sqrt{N}}
$$
to
$$
t = \frac{\bar{X} - d}{s/\sqrt{N}}
$$
Add a sentence.
By substituting $\sigma$ with $s$ we introduce some variability.
should be
This is referred to a _t-statistic_. By substituting $\sigma$ with $s$ we introduce some variability.
The theory tells us that $Z$ follows a t-distribution with $N-1$ _degrees of freedom_.
should be
The theory tells us that $t$ follows a t-distribution with $N-1$ _degrees of freedom_.
then $Z$ follows a _student t-distribution_ with $N-1$ degrees of freedom. So perhaps a better confidence interval for $d$ is:
should be
then $t$ follows a t-distribution with $N-1$ degrees of freedom. So perhaps a better confidence interval for $d$ is:
Page 344, 372 in pdf
summarize(slope = get_slope(R_per_game, BB_per_game))
Should be:
summarize(slope = get_slope(BB_per_game, R_per_game))
Page 344, 372 in pdf
So does this mean that if we go and hire low salary players with many BB, and who therefore increase the number of walks per game by 2, our team will score 4.2 more runs per game?
Should be (4.2 is now 1.5)
So does this mean that if we go and hire low salary players with many BB, and who therefore increase the number of walks per game by 2, our team will score 1.5 more runs per game?
Page 345, 373 in pdf
#> 1 1.3
Should be
#> 1 0.449
Page 346, 374 in pdf
#> 1 0.4 4.19
#> 2 0.5 3.30
#> 3 0.6 2.55
#> 4 0.7 2.03
#> 5 0.8 2.47
Should be
#> 1 0.4 0.734
#> 2 0.5 0.566
#> 3 0.6 0.412
#> 4 0.7 0.285
#> 5 0.8 0.365
Page 346, 374 in pdf
In fact, the values above are closer to the slope we obtained from singles, 1.3, which is more consistent with our intuition.
Should be (1.3 should b3 0.45)
In fact, the values above are closer to the slope we obtained from singles, 0.45, which is more consistent with our intuition.
Page 347, 375 in pdf
#> 1 2.8 6.47
#> 2 2.9 8.87
#> 3 3 7.40
#> 4 3.1 7.34
#> 5 3.2 5.89
Should be:
#> 1 2.8 1.52
#> 2 2.9 1.57
#> 3 3 1.52
#> 4 3.1 1.49
#> 5 3.2 1.58
Page 355 in pdf
72 - 69.1) / 2.5
Should be
(72.0 - 69.1) / 2.5
Page 347, 375 in pdf
#> 1 5.33
Should be:
#> 1 1.84
Page 355, 383 in pdf
get_slope <- function(x, y) cor(x, y) / (sd(x) * sd(y))
Should be
get_slope <- function(x, y) cor(x, y) * sd(y) / sd(x)
Page 373, 401 in pdf
over interpret should be over-interpret
three common ways should be four common ways
page 384
Add
Using the t-distribution and the t-statistic is the basis for _t-tests_, widely used approach for computing p-values. To learn more about t-tests, you can consult any statistics textbook.
right before:
The t-distribution can also be used to model errors
Page 391, 419 in pdf
Add a return after + here
new_tidy_data %>% ggplot(aes(year, fertility, color = country)) +
geom_point()
page 453, 481 in pdf
The make_date function can be used to quickly create a date object. It takes three ar- guments: year, month, day, hour, minute, seconds, and time zone defaulting to the epoch values on UTC time. So create an date object representing, for example, July 6, 2019 we write:
make_date(1970, 7, 6)
#> [1] "1970-07-06"
Should be
make_date(2019, 7, 6)
#> [1] "2019-07-06"
Page 449, 477 in pdf
"It turns them into days since the epoch.The as.Date function can convert a character into a date. So to see that the epoch is day 0 we can type"
There should be a space between "epoch" and "The".
page 472 in pdf
Remove the stringAsFactors
```{r}
new_research_funding_rates <- tab[6:14] %>%
str_trim %>%
str_split("\\s{2,}", simplify = TRUE) %>%
data.frame(stringsAsFactors = FALSE) %>%
setNames(the_names) %>%
mutate_at(-1, parse_number)
new_research_funding_rates %>% as_tibble()
```
should be
```{r}
new_research_funding_rates <- tab[6:14] %>%
str_trim %>%
str_split("\\s{2,}", simplify = TRUE) %>%
data.frame() %>%
setNames(the_names) %>%
mutate_at(-1, parse_number)
new_research_funding_rates %>% as_tibble()
```
Page 485, 513 in pdf
If the outcomes are binary, both RMSE and MSE are equivalent to accuracy, since
Should be
If the outcomes are binary, both RMSE and MSE are equivalent to one minus accuracy, since
page 484
Turn data frame to tibble. Change
Now we are now ready to extract the words for all our tweets.
```{r}
campaign_tweets <- trump_tweets %>%
extract(source, "source", "Twitter for (.*)") %>%
filter(source %in% c("Android", "iPhone") &
created_at >= ymd("2015-06-17") &
created_at < ymd("2016-11-08")) %>%
filter(!is_retweet) %>%
arrange(created_at)
```
to
```{r}
campaign_tweets <- trump_tweets %>%
extract(source, "source", "Twitter for (.*)") %>%
filter(source %in% c("Android", "iPhone") &
created_at >= ymd("2015-06-17") &
created_at < ymd("2016-11-08")) %>%
filter(!is_retweet) %>%
arrange(created_at) %>%
as_tibble()
```
page 485
remove ds_theme_set()
Page 538, 566 in pdf
fit_glm <- glm(y ~ x_1 + x_2, data=mnist_27$train, family = "binomial")
p_hat_glm <- predict(fit_glm, mnist_27$test)
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2))
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]
#> Accuracy #> 0.76
Should be (type response, and accuracy changes a bit)
fit_glm <- glm(y ~ x_1 + x_2, data=mnist_27$train, family = "binomial")
p_hat_glm <- predict(fit_glm, mnist_27$test, type = "response")
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2))
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]
#> Accuracy #> 0.75
page 518 in pdf
"These are the images of the digits with the largest and smallest values for $X_1$:
And here are the original images corresponding to the largest and smallest value of $X_2$:"
should be
"To illustrate how to interpret $X_1$ and $X_2$, we include four example images. On the left are the original images of the two digits with the largest and smallest values for $X_1$ and on the right we have the images corresponding to the largest and smallest values of $X_2$:"
Page 558 in pdf
$$ \hat{f}(x) = 52 + 0.25 x $$
change to
$$ \hat{f}(x) = 35 + 0.5 x $$
page 558 change squared loss to mean squared loss
Our root mean squared error is:
```{r}
sqrt(mean((m - test_set$son)^2))
```
add sqrt to
```{r}
y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
mean((y_hat - test_set$son)^2)
```
```{r}
y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
sqrt(mean((y_hat - test_set$son)^2))
```
Page 560 in pdf
add sqrt to mean squared
```{r}
y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
sqrt(mean((y_hat - test_set$son)^2))
```
page 561 in pdf
To illustrate logistic regression, we will apply it to our previous predicting sex example:
to
To illustrate logistic regression, we will apply it to our previous predicting sex example defined in Section \@ref(training-test).
Change unseen code from
library(dslabs)
data("heights")
y <- heights$height
set.seed(2)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- heights %>% slice(-test_index)
test_set <- heights %>% slice(test_index)
to
data(heights)
y <- heights$sex
set.seed(2007)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
test_set <- heights[test_index, ]
train_set <- heights[-test_index, ]
In confusion-matrix.Rmd change
### Training and test sets
to
### Training and test sets {#training-test}
change
filter(round(height)==66) %>%
filter(round(height) == 66) %>%
change
confusionMatrix(y_hat, test_set$sex)[["Accuracy"]]
to
confusionMatrix(y_hat, test_set$sex)$overall[["Accuracy"]]
change
The function $\beta_0 + \beta_1 x$ can take any value including negatives and values larger than 1. In fact, the estimate $\hat{p}(x)$ computed in the linear regression section does indeed become negative at around 76 inches.
to
The function $\beta_0 + \beta_1 x$ can take any value including negatives and values larger than 1. In fact, the estimate $\hat{p}(x)$ computed in the linear regression section does indeed become negative.
change
confusionMatrix(y_hat_logit, test_set$sex)[["Accuracy"]]
to
confusionMatrix(y_hat_logit, test_set$sex)$overall[["Accuracy"]]
change
```{r glm-prediction}
to
```{r glm-prediction, echo = FALSE}
page 587 in pdf
sum of square
should be
sum of squares
page 590 in pdf
"chose cp" should be "choose cp".
Page 621 in pdf
The first two numbers are seven and the third one is a 2.
should be
The first and third numbers are sevens and the second one is a two.
page 592 in pdf
change
Let us look at how a classification tree performs on the digits example we examined before:
```{r, echo=FALSE}
library(dslabs)
data("mnist_27")
```
```{r, echo=FALSE}
plot_cond_prob <- function(p_hat=NULL){
tmp <- mnist_27$true_p
if(!is.null(p_hat)){
tmp <- mutate(tmp, p=p_hat)
}
tmp %>% ggplot(aes(x_1, x_2, z=p, fill=p)) +
geom_raster(show.legend = FALSE) +
scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
stat_contour(breaks=c(0.5),color="black")
}
```
We can use this code to run the algorithm and plot the resulting tree:
to (note we are moving the sentece below)
Let us look at how a classification tree performs on the digits example we examined before by using this code to run the algorithm and plot the resulting accuracy:
```{r, echo=FALSE}
library(dslabs)
data("mnist_27")
```
```{r, echo=FALSE}
plot_cond_prob <- function(p_hat=NULL){
tmp <- mnist_27$true_p
if(!is.null(p_hat)){
tmp <- mutate(tmp, p=p_hat)
}
tmp %>% ggplot(aes(x_1, x_2, z=p, fill=p)) +
geom_raster(show.legend = FALSE) +
scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
stat_contour(breaks=c(0.5),color="black")
}
```
page 604 in pdf
Change:
The accuracy is almost 0.95!
```{r}
y_hat_knn <- predict(fit_knn, x_test[, col_index], type="class")
cm <- confusionMatrix(y_hat_knn, factor(y_test))
cm$overall["Accuracy"]
```
We now achieve an accuracy of about 0.95.
to
We now achieve a high accuracy:
```{r}
y_hat_knn <- predict(fit_knn, x_test[, col_index], type="class")
cm <- confusionMatrix(y_hat_knn, factor(y_test))
cm$overall["Accuracy"]
```
page 605
to make compilation faster
change
```{r mnist-rf, message=FALSE, warning=FALSE, cache=TRUE}
library(randomForest)
control <- trainControl(method="cv", number = 5)
grid <- data.frame(mtry = c(1, 5, 10, 25, 50, 100))
train_rf <- train(x[, col_index], y,
method = "rf",
ntree = 150,
trControl = control,
tuneGrid = grid,
nSamp = 5000)
ggplot(train_rf)
train_rf$bestTune
```
Now that we have optimized our algorithm, we are ready to fit our final model:
```{r, cache=TRUE}
fit_rf <- randomForest(x[, col_index], y,
minNode = train_rf$bestTune$mtry)
```
To check that we ran enough trees we can use the plot function:
```{r, eval=FALSE}
plot(fit_rf)
```
to
page 637
Change
"The predictors for each observation are measured on the same device and experimental procedure. This introduces biases that can affect all the predictors from one observation. For each observation, compute the average across all predictors and then plot this against the first PC with color representing tissue. Report the correlation.
should be"
to
"The predictors for each observation are measured on the same measurement device (a gene expression microarray) after an experimental procedure. A different device and procedure is used for each observation. This may introduce biases that affect all predictors for each observation in the same way. To explore the effect of this potential bias, for each observation, compute the average across all predictors and then plot this against the first PC with color representing tissue. Report the correlation."
Page 667, 695 in pdf
Remove sentence "We have already shown how we can do this with RStudio."
page 656
### Factors analysis
Should be
## Factor analysis {#factor-analysis}
page 656
Add the following
"Note that `p` and `q` are equivalent to the patterns and weights we described in Section \@ref(pca).""
after
"is that we can almost reconstruct $r$, which has 60 values, with a couple of vectors totaling 17 values.""
716 in pdf
```{r echo=FALSE}
summary(pressure)
```
Should be
```{r, echo=FALSE}
summary(pressure)
```
639 in pdf
"seven users"
should be
"six users"
687 in pdf
mv ../cv.tex ./
Should be
mv ../resumes/cv.tex ./
page 648
First formal should not have 1/N should be
$$ \sum_{u,i} \left(y_{u,i} - \mu - b_i\right)^2 + \lambda \sum_{i} b_i^2 $$
and "The first term is just least squares"
should be "The first term is just the sum of squares"
"
page 643 in pdf
Change
Let's compute the average rating for user $u$ for those that have rated over 100 movies:
```{r user-effect-hist}
train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
filter(n()>=100) %>%
ggplot(aes(b_u)) +
geom_histogram(bins = 30, color = "black")
```
to
Let's compute the average rating for user $u$ for those that have rated 100 or more movies:
```{r user-effect-hist}
train_set %>%
group_by(userId) %>%
filter(n()>=100) %>%
summarize(b_u = mean(rating)) %>%
ggplot(aes(b_u)) +
geom_histogram(bins = 30, color = "black")
```
page 651 in pdf
first formula should not have 1/N
should be:
$$
\sum_{u,i} \left(y_{u,i} - \mu - b_i - b_u \right)^2 +
\lambda \left(\sum_{i} b_i^2 + \sum_{u} b_u^2\right)
$$
NEED TO FIND PAGE:
This package is intended for relatively simple tasks such as converging data into tables.
should be
This package is intended for relatively simple tasks such as converting data into tables.
In section 28.3.2 change
geom_smooth(span = 0.15, method.args = list(degree=1))
To
geom_smooth(method = "loess", span = 0.15, method.args = list(degree=1))
Section 27.4.5
Change
A cutoff of 65 makes more sense than 64. Furthermore, it balances the specificity and sensitivity of our confusion matrix:
To
A cutoff of `r best_cutoff` makes more sense than 64. Furthermore, it balances the specificity and sensitivity of our confusion matrix:
page 502 in pdf
Exercise 3 code need to read mnist so should be:
library(dslabs)
mnist <- read_mnist()
y <- mnist$train$labels
Page 626 in the pdf
The average distance function has i,j and instead should have 1,j and 2,j
$$\sqrt{ \frac{1}{2} \sum_{j=1}^2 (X_{i,j}-X_{i,j})^2 },$$
Should be
$$\sqrt{ \frac{1}{2} \sum_{j=1}^2 (X_{1,j}-X_{2,j})^2 },$$
page 659 in pdf
The m shoud be an n at the end of the first equation:
$$
r_{u,i} = p_{u,1} q_{1,i} + p_{u,2} q_{2,i} + \dots + p_{u,m} q_{m,i}
$$
should be
$$
r_{u,i} = p_{u,1} q_{1,i} + p_{u,2} q_{2,i} + \dots + p_{u,n} q_{n,i}
$$
page 669 in pdf
change
To interpret this graph we do the following. To find the distance between any two movies, find the first location from top to bottom that the movies split into two different groups. The height of this location is the distance between the groups. So the distance between the _Star Wars_ movies is 8 or less, while the distance between _Raiders of the Lost of Ark_ and _Silence of the Lambs_ is about 17.
to