This repository contains code for cleaning a dataset as required for the course project for "Getting and Cleaning Data".
The user should download the raw data and use the run_analysis.R script in this repository to clean the data and produce the tidy dataset. The script can be run as
run_analysis()
This code will clean the raw dataset and write a single text file containing the tidy data called "tidy_data.txt", which is also included in this repository. This file contains 66 different variables from the raw dataset which have been averaged over subject and activity.
There is a code book in this repository called "CodeBook.md" which contains a description of the raw and tidy data, the cleaning procedure, and descriptions of the variables.
-
Load activity labels and description from "activity_labels.txt" in the raw dataset. This matches an ID number to an activity description.
-
Load variable names from "features.txt" in the raw dataset.
-
Load data files for the training and test datasets ("train/X_train.txt" and "test/X_test.txt") and assign to data tables.
-
Load the activity IDs for each observation in the training and test datasets ("train/y_train.txt" and "test/y_test.txt").
-
Load the subject IDs for each observation in the training and test datasets ("train/subject_train.txt" and "test/subject_test.txt").
-
Assign column names to each data table using the variable names from step 2.
-
Pick out only variables that are mean or standard deviation measurements.
-
Add the activity IDs from step 4 to a new variable in each data table.
-
Add the subject IDs from step 5 to a new variable in each data table.
-
Bind the training and test data tables into one data table.
-
Use a left_join to add activity descriptions to the data table based on the activity IDs.
-
Remove activity IDs since they are not useful.
-
Group observations by subject ID and activity.
-
Calculate the mean of each measurement in each group.
-
Clean up variable names using the following format:
- Replace "t" and "f" prefixes with "time" and "freq", respectively.
- Replace erroneous occurrences of "BodyBody" with "Body".
- Replace "-" by "_".
- Replace "Acc" by "Accel".
- Remove occurrences of "()".
These steps were taken to improve variable readability and remove problematic variable name formatting.
-
Save the new tidy data in a space-separated text file.