forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
121 lines (90 loc) · 3.74 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# Reproducible Research: Peer Assessment 1
## Loading and preprocessing the data
```{r}
library(data.table)
raw_data <- read.csv("activity.csv")
clean_data <- raw_data[!is.na(raw_data$steps), ]
raw_data <- data.table(raw_data)
clean_data <- data.table(clean_data)
```
## What is mean total number of steps taken per day?
```{r}
daily_data <- clean_data[, sum(steps), by = date]
setnames(daily_data, c("date", "steps"))
mean_total_per_day <- daily_data[, mean(steps), ]
median_total_per_day <- daily_data[, median(steps), ]
```
The mean total number of steps taken per day is `r mean_total_per_day` and the median is `r median_total_per_day`.
## What is the average daily activity pattern?
```{r}
interval_data <- clean_data[, mean(steps), by = interval]
plot(interval_data, type="l")
max_value = interval_data[, max(V1),]
max_interval = interval_data[V1 == max_value]$interval
```
The interval time with maximum mean value (`r max_value`) is `r max_interval`.
## Imputing missing values
### Rows with missing values
```{r}
na_data <- raw_data[is.na(raw_data$steps), ]
na_num_rows <- nrow(na_data)
```
There are `r na_num_rows` rows with missing values.
### Inputing missing values strategy
The missing values will be replaced by the mean of the same interval across all the same *weekday* in the dataset
Add weekday column to data tables:
```{r}
raw_data$wday <- wday(as.Date(raw_data$date))
na_data$wday <- wday(as.Date(na_data$date))
clean_data$wday <- wday(as.Date(clean_data$date))
head(raw_data, n=10)
```
Compute the mean steps for all interval per weekday:
```{r}
data_wdays <- merge(na_data, clean_data, by=c("interval", "wday"), allow.cartesian=TRUE)
interval_weekday_mean <- data_wdays[, mean(steps.y), by = c("interval", "wday")]
setnames(interval_weekday_mean, c("interval", "wday", "steps"))
head(interval_weekday_mean, n=10)
```
Impute those means to the intervals with missing steps, joining by weekday and interval:
```{r}
inputted_data <- merge(na_data, interval_weekday_mean, by=c("wday", "interval"))
inputted_data <- inputted_data[, c("steps.y", "date", "interval", "wday"), with=FALSE]
setnames(inputted_data, c("steps", "date", "interval", "wday"))
head(inputted_data, n=10)
```
Append the inputted data on missing values with the known data:
```{r}
final_data <- rbind(inputted_data, clean_data)
head(final_data)
```
### Histogram of the total number of steps taken each day
```{r}
daily_inputted <- final_data[, sum(steps), by = date]
setnames(daily_inputted, c("date", "steps"))
hist(daily_inputted$steps, main="Histogram of total steps per day", xlab="Steps")
```
Compute median and mean:
```{r}
new_mean_daily <- daily_inputted[, mean(steps), ]
new_median_daily <- daily_inputted[, median(steps), ]
diff_mean = new_mean_daily - mean_total_per_day
diff_median = new_median_daily - median_total_per_day
```
When imputing missing values, the mean total number of steps taken per day is `r new_mean_daily` and the median is `r new_median_daily`. Compared to the estimates from the first part, the new mean is incremented by `r diff_mean` and the median by `r diff_median`.
## Are there differences in activity patterns between weekdays and weekends?
Create new factor variable to determine if the data is either a weekday or a weekend day. Wekeend is 1 (sunday) and 6 (saturday):
```{r}
final_data$wday_type <- c("weekend", "weekday", "weekend")[ findInterval(final_data$wday, c(1, 2, 6, Inf)) ]
```
Aggregate per interval and weekday_type:
```{r}
agg <- final_data[, sum(steps), by = c("interval", "wday_type")]
setnames(agg, c("interval", "wday_type", "steps"))
head(agg, n=10)
```
Time series plot panel:
```{r}
library(lattice)
xyplot(agg$steps ~ agg$interval|agg$wday_type, main="Activity pattern", type="l", layout=(c(1,2)), ylab="steps", xlab="Time interval")
```