Skip to content

jcrotinger/clean_data_course_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting and Cleaning Data - Course Project

TL;DR

  1. Check out this repository.
  2. Download the raw dataset from here and unzip into working copy of the repository.
  3. From the working copy of the repository, run R. From R...
  4. Run the analysis: source('run_analysis.R')
  5. View the data: View(tidy.data)
  6. Load the tidy_data.txt file: test <- read.table("tidy_data.txt", header=TRUE)
  7. View the data: View(test)

Table of Contents

Goal

Our course project is to demonstrate the steps of creating a "tidy data set" (as per Hadley Wickham's paper) using the Human Activity Recognition Using Smartphones Dataset (data downloadable here).

The project requires an R script, run_analysis.R that, quoting from the project description:

  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Repository Contents

The repository includes the following files:

  • ReadMe.ms: This file.
  • CodeBook.md: Code Book describing the data set tidy_data.txt in detail.
  • run_analysis.R: This script runs the analysis and creates a tidy data set tidy_data.txt containing the results. Details in Steps below.
  • loadActivityLabels.R: Defines a function that is used to load and clean the activity factor labels.
  • filterFeatures.R: Defines the filterFeaturs function that loads the feature list from the raw data and returns a list containing the tidy variable names and a vector of classes that is used to filter the columns when they are loaded.
  • normalizeName.R: Defines the normalizeName function that is used in filterFeatures to process the raw variable names into R-friendly variable names.

Steps

The run_analysis.R script proceeds in several steps.

Activity Labels

loadActivityLabels is run to load the activities from the original data set's activity_labels.txt file. The original data are in upper-case, separated by underscores. These are transformed to pseudo-camel-case, with the understcores retained for readability. I didn't use "." for a seperator as these aren't row or column labels, so they will always be quoted, unlike column names.

This step satisfies Step 3.

Features

Next the set of variables of interest is calculated by filterFeatures. This function finds the features in the original data set that are "mean" or "std" statistics of the raw measurements.

The original feature names are in the form:

"([tf])(.*)-(mean|std)\\(\\)(-[XYZ])?"

These are converted to:

MeasurementName.[Time|Freq][.(X|Y|Z)]?.[Mean|StdDev]

For example

fBodyGyro-std()-X 

becomes

BodyGyro.X.Freq.StdDev

Also, a few variables have "BodyBody", which appears to be a mistake and the duplicated "Body" is repaired.

filterFeatures returns the list of normalized variable names along with a character array that is used when loading the actual data to specify the colClasses. Filtering the data on load significantly sped up the load step and also reduces memory use.

This step satisfies Step 4.

Building the Data Frames

After the activity labels and features are loaded, the script uses the nested function loadData to load the training and test data into two dataframes, train and test. This is done by creating three dataframes:

  • The Subject data, read from the appropriate "subject" file (e.g. test/subject_test.txt)
  • The Activity data, read from the corresponding "y" file (e.g. test/y_test.txt)
  • The measurement data, read from the corresponding "X" file (e.g. test/X_test.txt)

Care is taken to turn the Subject and Activity data into factor data.

We use the filter supplied by filterFeatures to only load the "mean" and "std" data, as per Step 2, and we take care to rename the columns with the tidy-data names.

Finally these dataframes are merged with cbind and returned.

Merging and Averaging the data

Finally, as per Step 1, we merge the test and train datasets. The resulting merged dataframe is "piped" to the grouping and averaging steps required by Step 5. This is quite succinct using dplyr:

tidy.data <-
    rbind(train, test) %>%
    group_by(Subject, Activity) %>%
    summarise_each(funs(mean))

rbind does the row-merge of the train and test datasets, group_by performs the required grouping, and summarise_each applies the specified function(s) to each variable in the resulting set.

Writing the data

Finally we write the data to a space-delimited text file named tidy_data.txt:

write.table(tidy.data, row.name=FALSE, file="tidy_data.txt")

Dependencies

References

About

Course Project for Coursera "Getting and Cleaning Data" course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages