Response to the Coursera's course "Getting and Cleaning Data" Project
run_analysis.R is written to clean up and prepare for further anlysis the Human Activity Recognition Using Smartphones Data Set originally collected from the accelerometers from the Samsung Galaxy S smartphone on 30 volunteers within an age bracket of 19-48 years http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones and provided for students via https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
- download the script into your working directory
- download the data in the above link in a subdirectory 'data' under your working directory
- open the script in Rstudio
- make sure the current working directory is correctly set at the top of the script
- run the script
The script results in two files written out to the working directory
- tidy_data.txt, which is a cleaned data before any aggrigation is done.
- aggrigated_by_mean_tidy_data.txt, the final cleaned and aggrigated data.
##The Goal for the run_analysis.R script is :
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive activity names.
- Creates a second, independent tidy data set with the average of each variable for each activity and each subject.
##The detail steps followed while writting to met the above set goal are given below: ###set your working directory. I used setwd("D:/Trainings/R/Project") ###reading data from files
- read both training and test data
- read both training and test activities data
- read subjects of both training and test
- read Features list
- read Activities names from file
###Merges the training and the test sets to create one data set.
- Merging the training and test data
- Merging the training and test activities data
- Merging the training and test subjects
###Extracts only the measurements on the mean and standard deviation for each measurement.
- Indentify columns that represent means or standard deviations by greping out from the Features list those that contain mean(), or std()
- Filter out the portion of the dataset that represents only the mean and standard deviations identified above
- Format column header by removing non alphanumeric characters such as "(,),-" and camale casing for multi-word headers
###Uses descriptive activity names to name the activities in the data set
- review the list of activities read from file to check for inconsistancy and less human readable once
- remove all none alphanumeric characters such as underscores and convert all tolower case
- change to camele case format for the multi-word names
###Appropriately labels the data set with descriptive activity names.
- apply the formated activity names above to the merged activity data labels 2 give proper column name for activities data and jubject data
- now we are ready to combine the pices into a single tidy data
- write out the output dataset to file, in my case in to a file called "tidy_data.txt"
###Creates a second, independent tidy data set with the average of each variable for each activity and each subject. NOTE: I have used the reshape2 library to reshape and produce the aggreted dataframe
- install if not available and include the reshape2 library
- melt the tidy dataframe produced above with ids "subject", "activity" to prepare the dataset for aggrigation
- dcast the melted dataframe with "mean" function to produce the required aggrigate data.
- writting the tidy aggrigated data to file, in my case into a file named "aggrigated_by_mean_tidy_data.txt"