This repository is created as a part of the course project for Coursera Getting and Cleaning Data course.
It contains automated steps for creating a tidy data set from a raw data set.
The raw data is not reproduced here. It is available from https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip. The zipfile contains a detailed description of the raw data.
A general description of the raw data can be found on http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones.
In short, the raw data is based on samplings from the accelerometer and gyroscope in a smartphone during physical acitivties performed by test subjects. Each subject has produced more than one data series per activity. The variables in the raw data set were each based on one such data series. The variables were, among others, mean and standard deviation of the data series or their Fourier transforms.
The tidy data set is based on the mean and standard deviation variables in the raw data set. Each variable in the tidy set is calculated as the mean of the raw data values with respect to subject and activity, i.e. generating one value for each combination of subject and activity.
Example: If subject 1 produced 17 data series for the activity "walking", then there are 17 values on e.g. the standard deviation of the untransformed (i.e. time domain) accelerometer readings in the X direction for gravity. The tidy data set only contains one value for the corresponding variable, calculated as the arithmetic mean of the 17 values.
README.md
This file.CodeBook.md
Describes the variables in thetidy_set.txt
file.run_analysis.R
Script that can be used to recreatetidy_set.txt
according to below.tidy_set.txt
Tidy data set, bases on the raw data set above. It is described in CodeBook.md.
The script run_analysis.R
, when sourced, creates two symbols in the global R scope, and one text file in the current working directory.
It creates the following symbols in the global environment:
- The function
run_analysis
. - The variable
run_analysis_result
, which contains the result fromrun_analysis()
. The variable contains a list with the following entries:run_analysis_result$merged_set
: Intermediate result, containing mean and standard deviation variables for all training and test data in the raw data set. These are the data that are used to calculate the mean values in the tidy set. There is thus more than one row for each combination ofsubject
andactivity
here.run_analysis_result$tidy_set
: The tidy set, which is also available intidy_set.txt
.
The script automatically calls the run_analysis
function. It also creates the file tidy_set.txt
in the current working directory.
General instructions:
- Download and unpack the raw data set from https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip.
- Clone this repository, using the command
git clone https://github.com/andreaskalin/GettingAndCleaningDataCourseProject.git
- Start
R
. - Change directory to the unzipped raw data directory, by typing something similar to
setwd("/path/to/UCI HAR Dataset/")
from theR
prompt. (The directoryUCI HAR Dataset
is the unzipped directory.) - Source
run_analysis.R
by typing something similar tosource "/path/to/GettingAndCleaningDataCourseProject/run_analysis.R
from theR
prompt. (The directoryGettingAndCleaningDataCourseProject
is where you cloned the repository to.) - The script automatically generates the
tidy_data.txt
file according to the previous section.
Detailed commands, to use in bash
(in Linux, on Mac, or in Windows/Cygin, provided that git, wget, and zip are installed):
bash$ wget https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip -O raw.zip
bash$ unzip raw.zip
bash$ git clone https://github.com/andreaskalin/GettingAndCleaningDataCourseProject.git
bash$ cd "UCI HAR Dataset"
bash$ ln -s ../GettingAndCleaningDataCourseProject/run_analysis.R # To avoid specifying the path to run_analysis.R
bash$ R
> source("run_analysis.R")