Skip to content

This repo is set up for the course project in the Coursera course Getting and Cleaning Data.

Notifications You must be signed in to change notification settings

laurenfitch/tidydata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

title author date output
README.md
Lauren Fitch
Sunday, March 22, 2015
html_document

I started by getting the list of files in the test and train directories.

test_files <- list.files("tidy_data/test", 
                         pattern = "*.txt", full.names = TRUE)
training_files <- list.files("tidy_data/train",
                             pattern = "*.txt", full.names = TRUE)

Then I read those files into R, which required the stringr package.

install.packages("stringr")
library(stringr)

I made names for the data frames that match the original filenames.

test_file_names <- str_sub(test_files, 16, -5)
train_file_names <- str_sub(training_files, 17, -5)

ldf_test <- lapply(test_files, read.table)
names(ldf_test) <- test_file_names

ldf_train <- lapply(training_files, read.table)
names(ldf_train) <- train_file_names

Read in features

features <- read.table("tidy_data//features.txt")

Just take the second column

features <- features[ ,2]

Remove special characters

features <- str_replace_all(features, "[-(),]", "")

Concatenate together into one data frame

merge_data_test <- data.frame(ldf_test[1], ldf_test[2], ldf_test[3])
names(merge_data_test) <- c("subject", features, "y")

merge_data_train <- data.frame(ldf_train[1], ldf_train[2], ldf_train[3])
names(merge_data_train) <- c("subject", features, "y")

merge_data <- rbind(merge_data_test, merge_data_train)

Extracts only the measurements on the mean and standard deviation for each measurement. I looked for either "mean" or "std" in the variable name. I did not remove any variables from the resulting list.

col_names <- str_detect(names(merge_data), "mean|std")
col_names <- names(merge_data)[col_names]
col_names <- c("subject", col_names, "y")

extract_data <- merge_data[ , col_names]

I use descriptive activity names to name the activities in the data set, by reading in activity labels and matching them to the right row.

activity <- read.table("tidy_data/activity_labels.txt")
names(activity) <- c("id", "name")
extract_data$activity <- activity$name[match(extract_data$y, activity$id)]

I labeled the data set with descriptive variable names

names(extract_data) <- c(names(extract_data)[1:80], 
                         "activity_code", "activity")

From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Requires the dplyr package.

install.packages("dplyr")
library(dplyr)

Group the data by subject and activity

who <- group_by(extract_data, subject, activity)
tidy <- summarise_each(who, funs(mean))

Remove the activity code column

tidy <- tidy[ ,1:81]

Output the tidy data set

write.table(tidy, "tidy.txt", row.name = FALSE)

About

This repo is set up for the course project in the Coursera course Getting and Cleaning Data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages