Mission498Solutions.Rmd

---
title: "Guided Project Solutions: Creating An Efficient Data Analysis Workflow"
output: html_document
---

```{r}
library(tidyverse)
reviews <- read_csv("book_reviews.csv")
```


# Getting Familiar With The Data

```{r}
# How big is the dataset?
dim(reviews)

# What are the column names?
colnames(reviews)

# What are the column types?
for (c in colnames(reviews)) {
  print(typeof(reviews[[c]]))
}
```

```{r}
# What are the unique values in each column?
for (c in colnames(reviews)) {
  print("Unique values in the column:")
  print(c)
  print(unique(reviews[[c]]))
  print("")
}
```

All of the columns seem to contain strings. The `reviews` column represents what the score that the reviewer gave the book. The `book` column indicates which particular textbook was purchased. The `state` column represents the state where the book was purchased. The `price` column represents the price that the book was purchased for.

# Handling Missing Data

From the previous exercise, it's apparent that that the `review` column contains some `NA` values. We don't want any missing values in the dataset, so we need to get rid of them.

```{r}
complete_reviews = reviews %>% 
  filter(!is.na(review))

dim(complete_reviews)
```

There were about 200 reviews that were removed from the dataset. This is about 10% of the original dataset. This isn't too big of an amount, so we would feel comfortable continuing with our analysis.

# Dealing With Inconsistent Labels

We'll use the shortened postal codes instead since they're shorter.

```{r}
complete_reviews <- complete_reviews %>% 
  mutate(
    state = case_when(
      state == "California" ~ "CA",
      state == "New York" ~ "NY",
      state == "Texas" ~ "TX",
      state == "Florida" ~ "FL",
      TRUE ~ state # ignore cases where it's already postal code
    )
  )
```

# Transforming the Review Data

```{r}
complete_reviews <- complete_reviews %>% 
  mutate(
    review_num = case_when(
      review == "Poor" ~ 1,
      review == "Fair" ~ 2,
      review == "Good" ~ 3,
      review == "Great" ~ 4,
      review == "Excellent" ~ 5
    ),
    is_high_review = if_else(review_num >= 4, TRUE, FALSE)
  )
```

# Analyzing The Data

We'll define most profitable book in terms of how many books there was sold. 

```{r}
complete_reviews %>% 
  group_by(book) %>% 
  summarize(
    purchased = n()
  ) %>% 
  arrange(-purchased)
```

The books are relatively well matched in terms of purchasing, but "Fundamentals of R For Beginners" has a slight edge over everyone else.