PA1_template.Rmd

# Reproducible Research: Peer Assessment 1


## Loading and preprocessing the data
```{r}
library(data.table)

raw_data <- read.csv("activity.csv")
clean_data <- raw_data[!is.na(raw_data$steps), ]

raw_data <- data.table(raw_data)
clean_data <- data.table(clean_data)
```

## What is mean total number of steps taken per day?
```{r}
daily_data <- clean_data[, sum(steps), by = date]
setnames(daily_data, c("date", "steps"))
mean_total_per_day <- daily_data[, mean(steps), ]
median_total_per_day <- daily_data[, median(steps), ]
```

The mean total number of steps taken per day is `r mean_total_per_day` and the median is `r median_total_per_day`.

## What is the average daily activity pattern?
```{r}
interval_data <- clean_data[, mean(steps), by = interval]
plot(interval_data, type="l")

max_value = interval_data[, max(V1),]
max_interval = interval_data[V1 == max_value]$interval
```
The interval time with maximum mean value (`r max_value`) is `r max_interval`.
## Imputing missing values

### Rows with missing values
```{r}
na_data <- raw_data[is.na(raw_data$steps), ]
na_num_rows <- nrow(na_data)
```

There are `r na_num_rows` rows with missing values.

### Inputing missing values strategy 
The missing values will be replaced by the mean of the same interval across all the same *weekday* in the dataset

Add weekday column to data tables:
```{r}
raw_data$wday <- wday(as.Date(raw_data$date))
na_data$wday <- wday(as.Date(na_data$date))
clean_data$wday <- wday(as.Date(clean_data$date))

head(raw_data, n=10)
```

Compute the mean steps for all interval per weekday:

```{r}
data_wdays <- merge(na_data, clean_data, by=c("interval", "wday"), allow.cartesian=TRUE)

interval_weekday_mean <- data_wdays[, mean(steps.y), by = c("interval", "wday")]
setnames(interval_weekday_mean, c("interval", "wday", "steps"))

head(interval_weekday_mean, n=10)
```

Impute those means to the intervals with missing steps, joining by weekday and interval:
```{r}
inputted_data <- merge(na_data, interval_weekday_mean, by=c("wday", "interval"))
inputted_data <- inputted_data[, c("steps.y", "date", "interval", "wday"), with=FALSE]
setnames(inputted_data, c("steps", "date", "interval", "wday"))

head(inputted_data, n=10)
```

Append the inputted data on missing values with the known data:

```{r}
final_data <- rbind(inputted_data, clean_data)
head(final_data)
```

### Histogram of the total number of steps taken each day

```{r}
daily_inputted <- final_data[, sum(steps), by = date]
setnames(daily_inputted, c("date", "steps"))
hist(daily_inputted$steps, main="Histogram of total steps per day", xlab="Steps")
```

Compute median and mean:
```{r}
new_mean_daily <- daily_inputted[, mean(steps), ]
new_median_daily <- daily_inputted[, median(steps), ]

diff_mean =  new_mean_daily - mean_total_per_day
diff_median =  new_median_daily - median_total_per_day
```

When imputing missing values, the mean total number of steps taken per day is `r new_mean_daily` and the median is `r new_median_daily`. Compared to the estimates from the first part, the new mean is incremented by `r diff_mean` and the median by `r diff_median`.

## Are there differences in activity patterns between weekdays and weekends?

Create new factor variable to determine if the data is either a weekday or a weekend day. Wekeend is 1 (sunday) and 6 (saturday):
```{r}

final_data$wday_type <- c("weekend", "weekday", "weekend")[ findInterval(final_data$wday, c(1, 2, 6, Inf)) ]
```

Aggregate per interval and weekday_type:
```{r}
agg <- final_data[, sum(steps), by = c("interval", "wday_type")]
setnames(agg, c("interval", "wday_type", "steps"))
head(agg, n=10)
```

Time series plot panel:
```{r}
library(lattice) 
xyplot(agg$steps ~ agg$interval|agg$wday_type, main="Activity pattern", type="l", layout=(c(1,2)), ylab="steps", xlab="Time interval")
```