errors.txt

big change not recorded: all gather/spread were changed to pivot_longer/pivot_wider
Preface, Page xxvii 

Zofia Gajdos instead of Zzofia Gajdos


Introduction, Page xxx 

In the intro "scrapping" should be "scraping"
Page 3, 31 in book
the New File
Should be
hen New File

Page 24, 52 in pdf

Change this paragraph


Note that the levels have an order that is different from the order of appearance in the factor object. The default is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. We will see several examples of this in the Data Visualization part of the book. The function reorder lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example.

To

Note that the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the `levels` argument when creating the factor with the `factor` function. For example, in the murders dataset regions are ordered from east to west.  The function `reorder` lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example, and will see more advanced ones in the Data Visualization part of the book. 


Page 48, 77 in pdf
name space
Should be 
namespace

Page 49, 77 in pdf
If we want to be absolutely sure we use the __dplyr__ `filter` we can use
Should be
If we want to be absolutely sure that we use the __dplyr__ `filter`, we can use

page 52-53 in pdf

The list subsection (2.4.6) changed quite a bit. Better to look at new version. Here is what it shoudl be now:


Data frames are a special case of _lists_. Lists are useful because you can store any combination of different types. You can create a list using the `list` function like this:

```{r}
record <- list(name = "John Doe",
             student_id = 1234,
             grades = c(95, 82, 91, 97, 93),
             final_grade = "A")
```

The function `c` is described in Section \@ref{vectors}.

This list includes a character, a number, a vector with five numbers, and another character.

```{r}
record
class(record)
```


As with data frames, you can extract the components of a list with the accessor `$`. 

```{r}
record$student_id
```

We can also use double square brackets (`[[`) like this:

```{r}
record[["student_id"]]
```

You should get used to the fact that in R, there are often several ways to do the same thing, such as accessing entries.

You might also encounter lists without variable names.

```{r,}
record2 <- list("John Doe", 1234)
record2
```

If a list does not have names, you cannot extract the elements with `$`, but you can still use the brackets method and instead of providing the variable name, you provide the list index, like this:

```{r}
record2[[1]]
```


We won't be using lists until later, but you might encounter one in your own exploration of R. For this reason, we show you some basics here. 


page 56

This is not a visible change, but because we added a reference we need to add a label to section:

##  Vectors {#vectors}


Page 63, 91 in pdf
this is now a special data frame called a grouped data frame and dplyr functions
Add comma:
this is now a special data frame called a grouped data frame, and dplyr functions

Page 64, 92 in pdf
To see the states by population, from smallest to largest, we arrange by `rate` instead
Should be
To see the states by murder rate, from lowest to highest, we arrange by `rate` instead:

Page 65, 94 in pdf
"Remember that the main summarization function"
Change to
"The `mean` and `sd` functions"

Page 65, 94 in pdf
if left blank, top_n, filters
Remove comma
if left blank, top_n filters

Page 70, 98 in pdf
For example to access
Add comma
For example, to access

Page 71, 99 in pdf
__no argument 
No __ and Should be bold
So in markdown right:
__no argument__


Page 73, 101 in pdf
three groups
Should be
four groups

Page 73, 101 in pdf
for example to check
Should be 
for example, to check

Page 74, 102 in pdf
In exercise 3 murders is in the wrong font. In code, 
murders
Should be 
`murders`

Page 82, 110 in pdf


Remove:
Two other functions that are helpful are `scan`. 


Page 84, 112 in pdf


Although there are R packages designed to read this format, if you are choosing a file format
to save your own data, you generally want to avoid Microsoft Excel. We recommend Google
Sheets as a free software tool for organizing data. We provide more recommendations in
the section Data Organization with Spreadsheets. This book focuses on data analysis. Yet
often a data scientist needs to collect data or work with others collecting data. Filling out
a spreadsheet by hand is a practice we highly discourage and instead recommend that the
process be automatized as much as possible. But sometimes you just have to do it. In this
section, we provide recommendations on how to store data in a spreadsheet. 

Should be

Although this book focuses almost exclusively on data analysis, data management is also an important part of data science. As explained in the introduction, we do not cover this topic. However, quite often data analysts needs to collect data, or work with others collecting data, in a way that is most conveniently stored in a spreadsheet. Although filling out a spreadsheet by hand is a practice we highly discourage, we instead recommend the process be automatized as much as possible, sometimes you just have to do it. Therefore, in this section, we provide recommendations on how to organize data in a spreadsheet. Although there are R packages designed to read Microsoft Excel spreadsheets, we generally want to avoid this format. Instead, we recommend Google Sheets as a free software tool. Below 
we summarize the recommendations made in paper by Karl Broman and Kara Woo^[https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375989]. Please read the paper for important details.

page 97 in pdf

Replace

One other important difference is that by default `data.frame` coerces characters into factors without providing a warning or message:

```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), 
                     exam_1 = c(95, 80, 90, 85), 
                     exam_2 = c(90, 85, 85, 90))
class(grades$names)
```

To avoid this, we use the rather cumbersome argument `stringsAsFactors`:
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), 
                     exam_1 = c(95, 80, 90, 85), 
                     exam_2 = c(90, 85, 85, 90),
                     stringsAsFactors = FALSE)
class(grades$names)
```

with 


```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), 
                     exam_1 = c(95, 80, 90, 85), 
                     exam_2 = c(90, 85, 85, 90))
```


page 110 in pdf
Change 

However, there are a couple of important differences. To show this we read-in the data with an R-base function:

```{r}
dat2 <- read.csv(filename)
```

An important difference is that the characters are converted to factors:

```{r}
class(dat2$abb)
class(dat2$region)
```

This can be avoided by setting the argument `stringsAsFactors` to `FALSE`. 

```{r}
dat <- read.csv("murders.csv", stringsAsFactors = FALSE)
class(dat$state)
```

In our experience this can be a cause for confusion since a variable that was saved as characters in file is converted to factors regardless of what the variable represents. In fact, we **highly** recommend setting `stringsAsFactors=FALSE` to be your default approach when using the R-base parsers. You can easily convert the desired columns to factors after importing data.

to

You can obtain an data frame like `dat` using:

```{r}
dat2 <- read.csv(filename)
```

page 100 in pdf

Replace the scan subsection.


Change 

### `scan` 

to 


An often useful R-base importing function is `scan`, as it provides much flexibility.


also added space between `=` here:

x <- scan(file.path(path, filename), sep=",", what = "c")
x <- scan(file.path(path, filename), sep = ",", what = "c")

Page 132, 160 in pdf

10\. Notice that the approximation calculated in question *two* is very close to the exact calculation in the first question. Now perform the same task for more extreme values. Compare the exact calculation and the normal approximation for the interval (79,81]. How many times bigger is the actual proportion than the approximation?

Two should be nine

page 108, 136 in pdf

9. Rewrite the code above to abbreviation as the label through aes
should be
9. Rewrite the code above to use abbreviations as the label through aes

page 108, 136 in pdf
10. Change the color of the labels through blue. How will we do this?
should be
10. Change the color of the labels to blue. How will we do this?


Page 132, 160 in pdf

15. In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player. What would be a fair critique of our calculations:
a. Practice and talent are what make a great basketball player, not height.
b. The normal approximation is not appropriate for heights.
c. As seen in question 3, the normal approximation tends to underestimate the extreme values. It’s possible that there are more seven footers than we predicted.
d. As seen in question 3, the normal approximation tends to overestimate the extreme
values. It’s possible that there are fewer seven footers than we predicted.

Question 3 should be question 10

Page171, 199 in pdf

For each year, we are simply comparing four quantities – the four percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:
Should be (four should be five)
For each year, we are simply comparing five quantities – the five percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:

page 182 in pdf

----1----x----10------100---

should be

`----10---x---100------1000---`


Page 199, 147 in pdf

closet should be closest

page 228 in pdf
remove exercises 9 through 12

302 in pdf

There is a - instead of + in CI:

 $[\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} - 2\hat{\mbox{SE}}(\bar{X})]$ 
Should be
 $[\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} + 2\hat{\mbox{SE}}(\bar{X})]$ 

304 in pdf

$1 - (1 - p)/2 + (1 - p)/2 = p$
should be 
$1 - (1 - p)/2 - (1 - p)/2 = p$

Page 344, 372 in pdf

get_slope <- function(x, y) cor(x, y) / (sd(x) * sd(y)) 
Should be
get_slope <- function(x, y) cor(x, y) * sd(y) / sd(x)

307 in pdf

$2*\bar{X}-1 = 0.04$ 
should be
$2\bar{X}-1 = 0.04$ 

page 314 in pdf

b. These are not t-tests so the connection between p-value and confidence intervals does not apply.

should be

b. These are based on t-statistics so the connection between p-value and confidence intervals does not apply.

page 346 in pdf

Change

$$
Z = \frac{\bar{X} - d}{s/\sqrt{N}}
$$

to 


$$
t = \frac{\bar{X} - d}{s/\sqrt{N}}
$$


Add a sentence.

By substituting $\sigma$ with $s$ we introduce some variability.

should be 

This is referred to a _t-statistic_. By substituting $\sigma$ with $s$ we introduce some variability.


The theory tells us that $Z$ follows a t-distribution with $N-1$ _degrees of freedom_. 

should be 

The theory tells us that $t$ follows a t-distribution with $N-1$ _degrees of freedom_. 


then $Z$ follows a _student t-distribution_ with $N-1$ degrees of freedom. So perhaps a better confidence interval for $d$ is:

should be

then $t$ follows a t-distribution with $N-1$ degrees of freedom. So perhaps a better confidence interval for $d$ is:


Page 344, 372 in pdf

summarize(slope = get_slope(R_per_game, BB_per_game))
Should be:
summarize(slope = get_slope(BB_per_game, R_per_game))


Page 344, 372 in pdf

So does this mean that if we go and hire low salary players with many BB, and who therefore increase the number of walks per game by 2, our team will score 4.2 more runs per game?

Should be (4.2 is now 1.5)

So does this mean that if we go and hire low salary players with many BB, and who therefore increase the number of walks per game by 2, our team will score 1.5 more runs per game?

Page 345, 373 in pdf

#> 1 1.3
Should be
#> 1 0.449

Page 346, 374 in pdf

#> 1       0.4 4.19
#> 2       0.5 3.30
#> 3       0.6 2.55
#> 4       0.7 2.03
#> 5       0.8 2.47

Should be

#> 1       0.4 0.734
#> 2       0.5 0.566
#> 3       0.6 0.412
#> 4       0.7 0.285
#> 5       0.8 0.365

Page 346, 374 in pdf

In fact, the values above are closer to the slope we obtained from singles, 1.3, which is more consistent with our intuition.
Should be (1.3 should b3 0.45)
 In fact, the values above are closer to the slope we obtained from singles, 0.45, which is more consistent with our intuition.


Page 347, 375 in pdf

#> 1       2.8  6.47
#> 2       2.9  8.87
#> 3       3    7.40
#> 4       3.1  7.34
#> 5       3.2  5.89
Should be:
#> 1       2.8  1.52
#> 2       2.9  1.57
#> 3       3    1.52
#> 4       3.1  1.49
#> 5       3.2  1.58


Page 355 in pdf

72 - 69.1) / 2.5

Should be

(72.0 - 69.1) / 2.5


Page 347, 375 in pdf

#> 1 5.33
Should be:
#> 1  1.84

Page 355, 383 in pdf

get_slope <- function(x, y) cor(x, y) / (sd(x) * sd(y)) 
Should be
get_slope <- function(x, y) cor(x, y) * sd(y) / sd(x)


Page 373, 401 in pdf

over interpret should be over-interpret
three common ways should be four common ways


page 384

Add 

Using the t-distribution and the t-statistic is the basis for _t-tests_, widely used approach for computing p-values. To learn more about t-tests, you can consult any statistics textbook.

right before:

The t-distribution can also be used to model errors

Page 391, 419 in pdf

Add a return after + here

new_tidy_data %>% ggplot(aes(year, fertility, color = country)) + 
  geom_point()

page 453, 481 in pdf

The make_date function can be used to quickly create a date object. It takes three ar- guments: year, month, day, hour, minute, seconds, and time zone defaulting to the epoch values on UTC time. So create an date object representing, for example, July 6, 2019 we write:

 make_date(1970, 7, 6) 
#> [1] "1970-07-06"

Should be 

 make_date(2019, 7, 6) 
#> [1] "2019-07-06"

Page 449, 477 in pdf

"It turns them into days since the epoch.The as.Date function can convert a character into a date. So to see that the epoch is day 0 we can type"

There should be a space between "epoch" and "The".

page 472 in pdf

Remove the stringAsFactors


```{r}
new_research_funding_rates <- tab[6:14] %>%
  str_trim %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>%
  mutate_at(-1, parse_number)
new_research_funding_rates %>% as_tibble()
```

should be


```{r}
new_research_funding_rates <- tab[6:14] %>%
  str_trim %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame() %>%
  setNames(the_names) %>%
  mutate_at(-1, parse_number)
new_research_funding_rates %>% as_tibble()
```


Page 485, 513 in pdf
 
If the outcomes are binary, both RMSE and MSE are equivalent to accuracy, since  
Should be
If the outcomes are binary, both RMSE and MSE are equivalent to one minus accuracy, since  

page 484

Turn data frame to tibble. Change

Now we are now ready to extract the words for all our tweets. 


```{r}
campaign_tweets <- trump_tweets %>% 
  extract(source, "source", "Twitter for (.*)") %>%
  filter(source %in% c("Android", "iPhone") &
           created_at >= ymd("2015-06-17") & 
           created_at < ymd("2016-11-08")) %>%
  filter(!is_retweet) %>%
  arrange(created_at)
```

to


```{r}
campaign_tweets <- trump_tweets %>% 
  extract(source, "source", "Twitter for (.*)") %>%
  filter(source %in% c("Android", "iPhone") &
           created_at >= ymd("2015-06-17") & 
           created_at < ymd("2016-11-08")) %>%
  filter(!is_retweet) %>%
  arrange(created_at) %>% 
  as_tibble()
```

page 485

remove ds_theme_set()


Page 538, 566 in pdf


fit_glm <- glm(y ~ x_1 + x_2, data=mnist_27$train, family = "binomial") 
p_hat_glm <- predict(fit_glm, mnist_27$test)
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2)) 
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]
#> Accuracy #> 0.76


Should be (type response, and accuracy changes a bit)

fit_glm <- glm(y ~ x_1 + x_2, data=mnist_27$train, family = "binomial") 
p_hat_glm <- predict(fit_glm, mnist_27$test, type = "response")
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2)) 
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]
#> Accuracy #> 0.75

page 518 in pdf


"These are the images of the digits with the largest and smallest values for $X_1$:
And here are the original images corresponding to the largest and smallest value of $X_2$:"

should be

"To illustrate how to interpret $X_1$ and $X_2$, we include four example images. On the left are the original images of the two digits with the largest and smallest values for $X_1$ and on the right we have the images corresponding to the largest and smallest values of $X_2$:"

Page 558 in pdf

$$ \hat{f}(x) = 52 + 0.25 x $$

change to

$$ \hat{f}(x) = 35 + 0.5 x $$

page 558 change squared loss to mean squared loss

Our root mean squared error is: 

```{r}
sqrt(mean((m - test_set$son)^2))
```

add sqrt to
```{r}
y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
mean((y_hat - test_set$son)^2)
```

```{r}
y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
sqrt(mean((y_hat - test_set$son)^2))
```

Page 560 in pdf

add sqrt to mean squared

```{r}
y_hat <- fit$coef[1] + fit$coef[2]*test_set$father
sqrt(mean((y_hat - test_set$son)^2))
```

page 561 in pdf


To illustrate logistic regression, we will apply it to our previous predicting sex example:
to
To illustrate logistic regression, we will apply it to our previous predicting sex example defined in Section \@ref(training-test).

Change unseen code from
library(dslabs)
data("heights")

y <- heights$height
set.seed(2)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- heights %>% slice(-test_index)
test_set <- heights %>% slice(test_index)


to

data(heights)

y <- heights$sex

set.seed(2007)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
test_set <- heights[test_index, ]
train_set <- heights[-test_index, ]


In confusion-matrix.Rmd change 
### Training and test sets 
to
### Training and test sets {#training-test}

change 

filter(round(height)==66) %>%

filter(round(height)  == 66) %>%


change
confusionMatrix(y_hat, test_set$sex)[["Accuracy"]]
to
confusionMatrix(y_hat, test_set$sex)$overall[["Accuracy"]]


change
The function $\beta_0 + \beta_1 x$ can take any value including negatives and values larger than 1. In fact, the estimate $\hat{p}(x)$ computed in the linear regression section does indeed become negative at around 76 inches.
to
The function $\beta_0 + \beta_1 x$ can take any value including negatives and values larger than 1. In fact, the estimate $\hat{p}(x)$ computed in the linear regression section does indeed become negative.

change

confusionMatrix(y_hat_logit, test_set$sex)[["Accuracy"]]
to
confusionMatrix(y_hat_logit, test_set$sex)$overall[["Accuracy"]]

change
```{r glm-prediction}
to
```{r glm-prediction, echo = FALSE}

page 587 in pdf

sum of square

should be 
sum of squares


page 590 in pdf

"chose cp" should be "choose cp".
Page 621 in pdf
The first two numbers are seven and the third one is a 2. 
should be

The first and third numbers are sevens and the second one is a two.

page 592 in pdf

change 
Let us look at how a classification tree performs on the digits example we examined before:

```{r, echo=FALSE}
library(dslabs)
data("mnist_27")
```

```{r, echo=FALSE}
plot_cond_prob <- function(p_hat=NULL){
  tmp <- mnist_27$true_p
  if(!is.null(p_hat)){
    tmp <- mutate(tmp, p=p_hat)
  }
  tmp %>% ggplot(aes(x_1, x_2, z=p, fill=p)) +
  geom_raster(show.legend = FALSE) +
  scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
  stat_contour(breaks=c(0.5),color="black")
}
```

We can use this code to run the algorithm and plot the resulting tree:


to (note we are moving the sentece below)

Let us look at how a classification tree performs on the digits example we examined before by using this code to run the algorithm and plot the resulting accuracy:

```{r, echo=FALSE}
library(dslabs)
data("mnist_27")
```

```{r, echo=FALSE}
plot_cond_prob <- function(p_hat=NULL){
  tmp <- mnist_27$true_p
  if(!is.null(p_hat)){
    tmp <- mutate(tmp, p=p_hat)
  }
  tmp %>% ggplot(aes(x_1, x_2, z=p, fill=p)) +
  geom_raster(show.legend = FALSE) +
  scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
  stat_contour(breaks=c(0.5),color="black")
}
```
page 604 in pdf

Change:

The accuracy is almost 0.95!
```{r}
y_hat_knn <- predict(fit_knn, x_test[, col_index], type="class")
cm <- confusionMatrix(y_hat_knn, factor(y_test))
cm$overall["Accuracy"]
```

We now achieve an accuracy of about 0.95. 

to

We now achieve a high accuracy:
```{r}
y_hat_knn <- predict(fit_knn, x_test[, col_index], type="class")
cm <- confusionMatrix(y_hat_knn, factor(y_test))
cm$overall["Accuracy"]
```

page 605 

to make compilation faster

change


```{r mnist-rf, message=FALSE, warning=FALSE, cache=TRUE}
library(randomForest)
control <- trainControl(method="cv", number = 5)
grid <- data.frame(mtry = c(1, 5, 10, 25, 50, 100))

train_rf <-  train(x[, col_index], y, 
                   method = "rf", 
                   ntree = 150,
                   trControl = control,
                   tuneGrid = grid,
                   nSamp = 5000)

ggplot(train_rf)
train_rf$bestTune
```

Now that we have optimized our algorithm, we are ready to fit our final model:

```{r, cache=TRUE}
fit_rf <- randomForest(x[, col_index], y, 
                       minNode = train_rf$bestTune$mtry)
```

To check that we ran enough trees we can use the plot function:

```{r, eval=FALSE}
plot(fit_rf)
```


to


page 637

Change
"The predictors for each observation are measured on the same device and experimental procedure. This introduces biases that can affect all the predictors from one observation. For each observation, compute the average across all predictors and then plot this against the first PC with color representing tissue. Report the correlation.
should be"

to 

"The predictors for each observation are measured on the same measurement device (a gene expression microarray) after an experimental procedure. A different device and procedure is used for each observation. This may introduce biases that affect all predictors for each observation in the same way. To explore the effect of this potential bias, for each observation, compute the average across all predictors and then plot this against the first PC with color representing tissue. Report the correlation."

Page 667, 695 in pdf

Remove sentence "We have already shown how we can do this with RStudio."

page 656

### Factors analysis 

Should be 

## Factor analysis {#factor-analysis} 


page 656

Add the following
"Note that `p` and `q` are equivalent to the patterns and weights we described in Section \@ref(pca).""

after 

"is that we can almost reconstruct $r$, which has 60 values, with a couple of vectors totaling 17 values.""
716 in pdf

```{r echo=FALSE}
summary(pressure)
```

Should be 

```{r, echo=FALSE}
summary(pressure)
```
639 in pdf

"seven users"

should be

"six users"
687 in pdf

mv ../cv.tex ./

Should be 

mv ../resumes/cv.tex ./


page 648  

First formal should not have 1/N should be

$$ \sum_{u,i} \left(y_{u,i} - \mu - b_i\right)^2 + \lambda \sum_{i} b_i^2 $$

and "The first term is just least squares"
should be "The first term is just the sum of squares"
"

page 643 in pdf

Change 

Let's compute the average rating for user $u$ for those that have rated over 100 movies: 

```{r user-effect-hist}
train_set %>% 
  group_by(userId) %>% 
  summarize(b_u = mean(rating)) %>% 
  filter(n()>=100) %>%
  ggplot(aes(b_u)) + 
  geom_histogram(bins = 30, color = "black")
```

to

Let's compute the average rating for user $u$ for those that have rated 100 or more movies: 

```{r user-effect-hist}
train_set %>% 
  group_by(userId) %>% 
  filter(n()>=100) %>%
  summarize(b_u = mean(rating)) %>% 
  ggplot(aes(b_u)) + 
  geom_histogram(bins = 30, color = "black")
```

page 651 in pdf

first formula should not have 1/N

should be:

$$
\sum_{u,i} \left(y_{u,i} - \mu - b_i - b_u \right)^2 + 
\lambda \left(\sum_{i} b_i^2 + \sum_{u} b_u^2\right)
$$


NEED TO FIND PAGE:

This package is intended for relatively simple tasks such as converging data into tables. 
should be
This package is intended for relatively simple tasks such as converting data into tables. 


In section 28.3.2 change
  geom_smooth(span = 0.15, method.args = list(degree=1))
To
  geom_smooth(method = "loess", span = 0.15, method.args = list(degree=1))


Section 27.4.5
Change 
A cutoff of 65 makes more sense than 64. Furthermore, it balances the specificity and sensitivity of our confusion matrix:
To
A cutoff of `r best_cutoff` makes more sense than 64. Furthermore, it balances the specificity and sensitivity of our confusion matrix:

page 502 in pdf

Exercise 3 code need to read mnist so should be:

library(dslabs)
mnist <- read_mnist()
y <- mnist$train$labels

Page 626 in the pdf
The average distance function has i,j and instead should have 1,j and 2,j

$$\sqrt{ \frac{1}{2} \sum_{j=1}^2 (X_{i,j}-X_{i,j})^2 },$$

Should be 

$$\sqrt{ \frac{1}{2} \sum_{j=1}^2 (X_{1,j}-X_{2,j})^2 },$$

page 659 in pdf

The m shoud be an n at the end of the first equation: 

$$
r_{u,i} = p_{u,1} q_{1,i} + p_{u,2} q_{2,i} + \dots + p_{u,m} q_{m,i} 
$$

should be 
$$
r_{u,i} = p_{u,1} q_{1,i} + p_{u,2} q_{2,i} + \dots + p_{u,n} q_{n,i} 
$$


page 669 in pdf

change

To interpret this graph we do the following. To find the distance between any two movies, find the first location from top to bottom that the movies split into two different groups. The height of this location is the distance between the groups. So the distance between the _Star Wars_ movies is 8 or less, while the distance between _Raiders of the Lost of Ark_ and _Silence of the Lambs_ is about 17. 

to


This graph gives us an approximation between the distance between any two movies. To find this distance we find the first location, from top to bottom, where these movies split into two different groups. The height of this location is the distance between these two groups. So, for example, the distance between the three _Star Wars_ movies is 8 or less, while the distance between _Raiders of the Lost of Ark_ and _Silence of the Lambs_ is about 17.

page 703 in pdf

 the staging and working directory areas,
 

to 

staging area and working directory


NEED to FIX JSON PART. Search for citi_bike
ALSO IN SPANISH VERSION
=====================================
Improvements (not errors)


Page 67

describe in the next 
Should be
Describe next

Page 73, 101 in pdf

case_when(x < 0 ~ "Negative", x > 0 ~ "Positive", TRUE ~ "Zero") 

Should be

case_when(x < 0 ~ "Negative", 
          x > 0 ~ "Positive", 
          TRUE ~ "Zero") 


Page 81, 109 in pdf 
it will overwrite existing files without warning.
Should be bolded:
__it will overwrite existing files without warning__.


page 442, 470 in pdf
paste0("http://www.pnas.org/content/suppl/2015/09/16/",
should be 
paste0("https://www.pnas.org/content/suppl/2015/09/16/",