-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
126 lines (101 loc) · 4.48 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{r load_preprocess}
# load data
steps <- read.csv('./activity.csv', header = T)
# packages
suppressPackageStartupMessages(library('dplyr'))
suppressPackageStartupMessages(library('ggplot2'))
suppressPackageStartupMessages(library('knitr'))
```
## What is mean total number of steps taken per day?
```{r steps_a_day}
# Number of steps/day
daystepcount <- steps %>% group_by(date) %>% summarise(day.count = sum(steps))
```
The total number of steps were calculated in `daystepcount` and the output of this calculation is shown
in the histogram.
```{r steps_a_day_hist}
ggplot(daystepcount) + geom_histogram(aes(x = day.count), binwidth = 1000) +
theme_bw() + ggtitle('Total number of steps per day') + xlab('Number of steps per day')
```
```{r mean_steps_a_day}
# Mean and median number of steps/day
mstepcount <- daystepcount %>% ungroup() %>%
summarise(mean = mean(day.count, na.rm = T), median = median(day.count, na.rm = T))
mstepcount %>% data.frame
```
The mean total number of steps taken per day were `r format(mstepcount$mean, scientific = F)` and the median total steps taken per day were
`r mstepcount$median`.
## What is the average daily activity pattern?
```{r meansteps_interval}
# Mean step count per inteval (misc)
misc <- steps %>% group_by(interval) %>%
summarise(meansteps = mean(steps, na.rm = T))
maxinterval <- misc[which(misc$meansteps == max(misc$meansteps)),]
maxinterval$label <- paste('Maximum occurs at',round(maxinterval$meansteps,0),
'steps, at interval',maxinterval$interval)
ggplot(misc, aes(x = interval, y = meansteps)) + geom_line() +
geom_point(data = maxinterval, colour = 'red', size = 2) +
geom_text(data = maxinterval, aes(x = interval + 20, label=label),
hjust = 0, colour = 'red', size = 4) +
ylab('Average step count') + xlab('5-min interval during day') + theme_bw() +
ggtitle('Average step count over time intervals')
```
`r maxinterval$label`
## Imputing missing values
There are `r sum(is.na(steps$steps))` missing values in the dataset.
I will impute using the rounded mean number of steps at each interval.
These values are calculated above in the variable `misc`.
```{r missing_values}
#Add the meansteps at each inteval to steps
steps.imputed <- left_join(steps, misc)
# If step is NA change to meansteps
steps.imputed <- steps.imputed %>% mutate(steps = ifelse(is.na(steps),round(meansteps,0),steps))
```
The new dataset is stored in `steps.imputed`
```{r imputed_daycount}
# Number of steps/day imputed
daystepcount.imp <- steps.imputed %>% group_by(date) %>%
summarise(day.count = sum(steps))
# Mean and median number of steps/day imputed
mstepcount.imp <- daystepcount.imp %>% ungroup() %>%
summarise(mean = mean(day.count, na.rm = T), median = median(day.count, na.rm = T))
ggplot(daystepcount.imp) + geom_histogram(aes(x = day.count), binwidth = 500) +
ggtitle('Histogram of total number of steps taken each day\nAfter imputation') + theme_bw() +
xlab('Daily step count')
```
The mean and median number of steps per day of the imputed data
```{r}
mstepcount.imp %>% data.frame
```
To compare with the mean and median number of steps per day of the original data
```{r}
mstepcount %>% data.frame
```
Pretty close! The effect on the mean and median was not that big but we got more data. That is always good :)
## Are there differences in activity patterns between weekdays and weekends?
I will use the original dataset as I did not take weekday into account when imputing
```{r weekdays}
# set the locale to english for english weekdays
Sys.setlocale("LC_TIME",'en_US.UTF-8')
#Make dates dateobjects and calculate weekday
steps.wd <- steps %>% mutate(date = as.Date(date), weekday = weekdays(date)) %>%
# Calculate if weekend or weekday
mutate(weekend.day = ifelse(weekday %in% c('Saturday','Sunday'),'Weekend','Weekday'))
# Reorder weekend/weekday factor
steps.wd$weekend.day <- relevel(factor(steps.wd$weekend.day), ref = 'Weekend')
# Average number of steps during weekend/weekdays
steps.wd.mean <- steps.wd %>% group_by(interval,weekend.day) %>%
summarise(mean.steps = mean(steps, na.rm = T))
# Plot the average steps during weekend/weekday with a panelplot
ggplot(steps.wd.mean) + geom_line(aes(x = interval, y = mean.steps)) +
facet_wrap(~weekend.day,nrow = 2) + theme_bw() +
ggtitle('Mean number of steps per interval by weekday or weekend') +
ylab('Mean number of steps')
```