parkinson.Rmd

---
title: "Parkinson's telemonitoring"
author: "Rushan Shakya, Tahrima Mustafa, Cam Moy, Praveen Hariharasubramanian"
date: "April 21, 2018"
output:
  word_document: default
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction: 

### Description and literature:

Parkinson's disease is a neurodegenerative disorder of central nervous system that causes partial or full loss of motor reflexes, speech, behavior, mental processing, and other vital functions [1]. It is generally observed in elderly people and causes disorders in speech and motor abilities (writing, balance, etc.) of 90% of the patients [2]. Ensuing Alzheimer, PD is the second common neurological health problem in elder ages and it is estimated that nearly 10 million people all around the world and approximately 100k people in Turkey are suffering from this disease [3], [4]. Particularly, PD is generally seen in one out of every hundred people aged over 65. Currently, there is no known cure for the disease [5], [6]. Although, there is significant amount of drug therapies to decrease difficulties caused by the disorder, PD is usually diagnosed and treated using invasive methods [7]. Therefore, this complicates the process of diagnosis and treatment of patients who are grieving from the disease. Our main motivation for working with this dataset is to find significant variables in identifying patients suffering from Parkinson's disease with the means of multivariate analysis. 

In this study, we will analyze the patients' data who are diagnosed with the disease. Using speech data from subjects is expected to help the development of a noninvasive diagnostic. There are important examples of these kinds of Alzheimer and PD studies all around the world [8]. The studies based on the PD focus on symptoms like slowness in movement, poor balance, trembling, or stiffness of some body parts but especially voice problems. The main reason behind the popularity of PD diagnosis from speech impairments is that tele-diagnosis and tele-monitoring systems based on speech signals are low in cost and easy to self-use [6], [8]. Such systems lower the inconvenience and cost of physical visits of PD patients to the medical clinic, enable the early diagnosis of the disease, and also lessen the workload of medical personnel [7], [8]. People with Parkinsonism (PWP) suffer from speech impairments like dysphonia (defective use of the voice), hypophonia (reduced volume), monotone (reduced pitch range), and dysarthria (difficulty with articulation of sounds or syllables). Even though there are many studies aiming at diagnosing and monitoring PD using these impairments, the origin of these studies leans to diagnose basic voice disorders [8]. Therefore, our analysis in this project will be based on voice parameters of the affected. The following section will illustrate a short description of the dataset formation and how we are planning to approach the problem. 

### Data:

The dataset was created by Athanasios Tsanas and Max Little of the University of Oxford, in collaboration with 10 medical centers in the US and Intel Corporation who developed the tele-monitoring device to record the speech signals. The original study [9] used a range of linear and nonlinear regression methods to predict the clinician's Parkinson's disease symptom score on the UPDRS scale.

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a tele-monitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.

Columns in the dataset contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. 

The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around 200 recordings per patient, the subject number of the patient is identified in the first column [10], [11]. 

### Attribute Information: 

Subject: Integer that uniquely identifies each subject

Age: Subject age

Sex: Subject gender '0' - male, '1' - female

Test_time: Time since recruitment into the trial. The integer part is the number of days since recruitment

Motor_UPDRS: Clinician's motor UPDRS score, linearly interpolated

Total_UPDRS: Clinician's total UPDRS score, linearly interpolated

Jitter (%), Jitter(Abs), Jitter. RAP, Jitter. PPQ5, Jitter. DDP:	Several measures of variation in fundamental frequency (Frequency parameters)

Shimmer, Shimmer (dB), Shimmer. APQ3, Shimmer. APQ5, Shimmer. APQ11, Shimmer. DDA:	Several measures of variation in amplitude (Amplitude parameters)

NHR, HNR: Two measures of ratio of noise to tonal components in the voice

RPDE: A nonlinear dynamical complexity measure

DFA: Signal fractal scaling exponent

PPE: A nonlinear measure of fundamental frequency variation

### Checking Multivariate normality

```{r}
#Read the data file
parkinsons <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data", header = TRUE)
#parkinsons <- read.csv("C:\\Users\\Praveen\\Documents\\MSDS\\ISQS6350-multivariateanalysis\\project\\parkinsons_updrs.csv", header = TRUE, stringsAsFactors = TRUE)
str(parkinsons)
```


```{r}
library(mvnormtest)
x <- parkinsons
cm <- colMeans(x)
S <- cov(x)
d <- apply(x, 1, function(x) t(x - cm) %*% solve(S) %*% (x - cm))

# Chi-Square plot:
plot(qchisq((1:nrow(x) - 1/2) / nrow(x), df = ncol(x)), 
     sort(d),
     xlab = expression(paste(chi[22]^2, 
                             " Quantile")), 
     ylab = "Ordered distances")

abline(a = 0, b = 1)
```

From the above figure, we can see that our multivariate data is not perfectly normally distributed. There may be some outliers in our data set.

# Data Cleaning and Outlier Removal:

Our first step is going through the dataset and identify any missing value or outlier to take necessary measures. This step is essential to prepare the data for fruitful analysis.


#### Check null values
```{r}
missing <- apply(parkinsons, 2, function(x) 
  round(100 * (length(which(is.na(x))))/length(x) , digits = 1))
as.data.frame(missing)
```
There are no missing values in our dataset.

#### Check correlations between the variables
```{r}
library(corrplot)
corrplot(cor(parkinsons), type="full", method ="color", title = "Parkinson correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")
```
We can see that all the jitter variables highly correlate with Shimmer variables.

#### Outlier detection
```{r}
summary(parkinsons[,-3])
```

The total_UPDRS (Unified Parkinson's Disease Ratings Score) is the main variable of interest, which determines the clinical impression of Parkinson's disease (PD) severity. Thus, we plot total_UPDRS scores against other variables in our data set to find out outliers.

```{r}
#Scattered plot to look into data distribution
plot(jitter(total_UPDRS)~., parkinsons)
```

In our scattered plot between total_UPDRS and Jitter, it looks like, we can see out outlier observations in our data. Similarly, in our plots with total_UPDRS vs Shimmer, total_UPDRS vs NHR, total_UPDRS vs RPDE, total_UPDRS vs DFA, and total_UPDRS vs PPE, we can see some outlier observations.

We will now look into bivariate boxplots in our data to look for outlier observations in our data.
```{r}
library(MVA)
#boxplots
bvbox(parkinsons[,6:7], xlab = "total_UPDRS", ylab = "Jitter")

bvbox(parkinsons[,c(6,12)], xlab = "total_UPDRS", ylab = "Shimmer")

bvbox(parkinsons[,c(6,18)], xlab = "total_UPDRS", ylab = "NHR")

bvbox(parkinsons[,c(6,20)], xlab = "total_UPDRS", ylab = "RPDE")

bvbox(parkinsons[,c(6,21)], xlab = "total_UPDRS", ylab = "DFA")

bvbox(parkinsons[,c(6,22)], xlab = "total_UPDRS", ylab = "PPE")
```
The bivariate boxplot is showing a lot of our observations as outliers. Thus, we want to check our results with Convex Hull method as we don't want to change the distribution of our data by removing the outliers.

```{r}
#Convex hull method

hull1 <- chull(parkinsons[,6:7])

parkhull <- match(lab <- rownames(parkinsons[hull1,])
, rownames(parkinsons))

plot(parkinsons[,6:7], xlab = "total_UPDRS", ylab = "Jitter")
polygon(parkinsons$Jitter...[hull1]~parkinsons$total_UPDRS[hull1])
text(parkinsons[parkhull,6:7], labels = lab
, pch=".", cex = 0.9)
```

```{r}
#Removing outlier observations according to Convex hull
outlier <- parkinsons[-hull1,]
dim(outlier)
dim(parkinsons)

hull2 <- chull(outlier[,c(6,12)])
parkinsons <- outlier[-hull2,]

hull3 <- chull(parkinsons[,c(6,18)])
outlier <- parkinsons[-hull3,]

hull4 <- chull(outlier[,c(6,20)])
parkinsons <- outlier[-hull4,]

hull5 <- chull(parkinsons[,c(6,21)])
outlier <- parkinsons[-hull5,]

hull6 <- chull(outlier[,c(6,22)])
parkinsons <- outlier[-hull6,]

dim(parkinsons)
```

# Dimensionality Reduction: 

Our next step is dimensionality reduction. The dataset is very large with 22 variables and some of the variables have high correlations between them. So we are expecting to reduce the number of dimensions for better interpretation of the data. 

#### Multi-dimensional scaling

First we try Multi-dimensional scaling which can help us visualizing the variable relationships in 2D graphs. 

```{r}
#Multi dimensional scaling
parkcorr <- cor(parkinsons)
colnames(parkcorr) <- row.names(parkcorr) <- parklabs <- c(colnames(parkinsons))
rge <- sapply(parkinsons, function(x) diff(range(x)))
sparkinsons <- sweep(parkinsons, 2, rge, FUN = "/")
parkdist <- dist(sparkinsons)
parkdist_mds <- cmdscale(parkdist, k = 21, eig = TRUE)
parkdistpoints <- parkdist_mds$points
lam <- parkdist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
#criterion 2 suggests that the first two coordinates are required to represent majority of the data points since the cummulative proportion is 0.78 and close to the threshold value of 0.8 
#hence the MDS plot can be on a 2D scatterplot
x <- parkdist_mds$points[,1]
y <- parkdist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = parkinsons[,3], cex=0.8)
#the MDS plot clearly shows that age is creating a deviation between the datasets with female on the right and male on the left
#this significant deviation is because the voice pictch, frequency and amplitude totally differs by being in different ranges for different genders
```

```{r}
parkcorr <- cor(parkinsons)
colnames(parkcorr) <- row.names(parkcorr) <- parklabs <- c(colnames(parkinsons))
rge <- sapply(parkinsons, function(x) diff(range(x)))
sparkinsons <- sweep(parkinsons, 2, rge, FUN = "/")
parkdist <- dist(parkcorr)
parkdist_mds <- cmdscale(parkdist, k = 21, eig = TRUE)
parkdistpoints <- parkdist_mds$points
lam <- parkdist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
#criterion 1 and criterion 2 suggests that the first two coordinates can represents majority of the data points since the cummulative proportion is above the threshold value of 0.8 
#hence the MDS plot can be on a 2D scatterplot
x <- parkdist_mds$points[,1]
y <- parkdist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = colnames(parkcorr), cex=0.8)
```

Below are the findings from multidimensional scaling on the parkinsons data: 

The statitical technique of multi dimensional scaling through this plot has confirmed that attributes form groups with similar patterns
1. All jitter variables follow a pattern
2. All Shimmer variables follow a patter
3. PPE and NHR form a pattern
4. total UPDRS and motor UPDRS form a pattern
5. test time and sex form a pattern
6. Age contributes as a seperate attribute to the variation in data
7. HNR contributes as a seperate attribute to the variation in data

These findings can be logically understood as
1. Jitter is related to frequency measure so they form a pattern and it is logically valid
2. Shimmer is related to amplitude measure so they form a pattern and it is logically valid
3. Pitch of voice cord and noise to harmonics ratio are having an underlying relationship and that is observable through this finding
4. Motor UPDRS score is impacting the total score and it makes sense
5. Test time and sex are two independent factors and they are showing correlation by coincidence and this inference can be ignored
6. Age contributes as a seperate attribute to the variation in data
7. HNR contributes as a seperate attribute to the variation in data


# Random forest 

To confirm that the observations in the multi dimensional scaling can be used to better identify patients with disease and ideal measures to estimate progression of the disease and interpret the similarity between cetain measures let us apply another technique called random forests algorithm that uses gini index which is another way of measuring dissimilarity in the data like multi dimensional scaling.

Random forest helps in measuring and identifying the correct measure that can aid in differentiating the data and better differentiate patients and ranking the different variables by there contribution to the variance of the dataset and also on their significance in measuring progression of the disease.

```{r}
library(randomForest)
#scaling the data
rge <- sapply(parkinsons, function(x) diff(range(x)))
sparkinsons <- sweep(parkinsons, 2, rge, FUN = "/")
# Create the forest.
output.forest <- randomForest(sparkinsons$total_UPDRS~age+sex+test_time+Jitter...+Jitter.Abs.+Jitter.RAP+Jitter.PPQ5+Jitter.DDP+Shimmer+Shimmer.dB.+Shimmer.APQ3+Shimmer.APQ5+Shimmer.APQ11+Shimmer.DDA+NHR+HNR+RPDE+DFA+PPE, data = sparkinsons,mtry = 6)

# View the forest results.
print(output.forest) 

# Importance of each predictor.
impfactors <- importance(output.forest,type = 2)
impfactors <- data.frame(impfactors)
impfactorsranked <- impfactors[order(-impfactors$IncNodePurity),,drop=FALSE]
print(impfactorsranked)
#On applying random forest we can observe that certain attributes contribute higher to the split in the dataset i.e. certain observations help better in categorising patients based on UPDRS scores and contribute higher to disease progression or severity
#motor UPDRS ignored since its a subset of total UPDRS
#subject id ignored since its the id of each patient
#These attributes are
#Age - Subject age
#Sex - Subject gender '0' - male, '1' - female
#test_time - Time since recruitment into the trial. The integer part is the number of days since recruitment.
#HNR - Harmonics to noise ratio (HNR) quantify the ratio of actual signal information over noise
#RPDE - Recurrence Period Density Entropy (RPDE) Quantify the stochastic component of the deviation of vocal fold periodicity. Vocal fold periodicity is the periodicity/frequency of vibration of the vocal cord
#DFA - Detrended Fluctuation Analysis (DFA) Quantify the stochastic self-similarity of the noise caused by turbulent airflow
#PPE - Pitch Period Entropy (PPE). In speech disorders it is very difficult to sustain stable pitch due to incomplete vocal fold closure. PPE quantifies the impaired control of stabilized pitch
#Jitter(Abs) - Jitter absolute is the average absolute difference between consecutive periods, divided by the average period in fundamental frequency
```

Random forest has helped in indentifying the top factors that influence the disease progression and that are DFA, Age, JItter.Abs.,Sex, PPE, HNR, RPDE and test_time.

Age - Subject age

Sex - Subject gender '0' - male, '1' - female

test_time - Time since recruitment into the trial. The integer part is the number of days since recruitment.

Jitter (%) - Mean absolute difference of successive cycles divided by the mean of Fundamental frequency (expressed in percentage)

HNR - Harmonics to noise ratio (HNR) quantify the ratio of actual signal information over noise

RPDE - Recurrence Period Density Entropy (RPDE) Quantify the stochastic component of the deviation of vocal fold periodicity. Vocal fold periodicity is the periodicity/frequency of vibration of the vocal cord

DFA - Detrended Fluctuation Analysis (DFA) Quantify the stochastic self-similarity of the noise caused by turbulent airflow

PPE - Pitch Period Entropy (PPE). In speech disorders it is very difficult to sustain stable pitch due to incomplete vocal fold closure. PPE quantifies the impaired control of stabilized pitch

# Exploratory Factor Analysis

Next we try exploratory factor analysis on the data to identify important factors. 

```{r}
#Exploratory factor analysis
library(MVA)
options(digits = 3)
# EFA
#head(parkinsons) #2:4,8,16,18:22
parkinson.EFA <- factanal(parkinsons[, c(2:5,8,16,18:22)], 3, n.obs = nrow(parkinsons), rotation="varimax", control=list(trace=T))
parkinson.EFA
print(parkinson.EFA$loadings, cut = 0.45)
```

First, when we try to do exploratory factor analysis with all the variables, the model doesn't run. After some research we have come to the conclusion that due to high multicolinearity between some variables (specificaly jitter and shimmer), the algorithm is not converging. So we decided to reduce the values that have high correlation between them. From the correlation plot we can see that jitter and shimmer variables have high correlation (0.9+) between themselves. So we tried building the model with one jitter and one shimmer variable. From the random forest analysis, we saw that Jitter.Abs. and Shimmer.APQ11 have highest significance in their corresponding frequency and amplitude groups. So we took these 2 variables in the exploratory factor analysis. Also both the updrs variables have 0.9+ correlation between them. So we took one from that group too.

With the above mentioned variables we explored different number of factors. But if we take 2 or 3 factors then only 40-45% data is explained. Also the age, sex and test_time have very small factor coefficient and large (0.8+) uniqueness. If we have 5/6 factors then the uniqueness of these variables lessen but still they are greater than 0.7.

From 3 factor analysis we can see that jitter, shimmer, NHR, HNR, RPDE and PPE have higher coefficents with Factor 1. However, from the random forest analysis, we have seen that age and DFA are the most significant variables, which have really low coefficient in these analysis. 


# Confirmatory Factor Analysis

Next we try confirmatory factor analysis to compare if the results from EFA are correct. 

```{r}
parkinson.EFA <- factanal(parkinsons[, c(2:8,17,18:22)], 2, n.obs = nrow(parkinsons), rotation="varimax", control=list(trace=T))
parkinson.EFA
print(parkinson.EFA$loadings, cut = 0.5)

#Implementing Confirmatory actor analysis from the observed factors in the exploratory factor analysis

library(sem)
library(semPlot)
parkinsonscov<- cor(parkinsons[,-c(1)])
parkinson_model <- specifyModel(file = "C:\\ttu\\spring18\\MVA\\project\\ParkinsonsDiseaseDataAnalysis-master\\parkinson_sem_model_efa2.txt")
opt <- options(fit.indices = c("GFI", "AGFI", "SRMR"))
parkinson_sem <- sem(parkinson_model, parkinsonscov, nrow(parkinsons))
summary(parkinson_sem)
# restricted Cor matrix
rescor <- parkinson_sem$C
# non-restricted Cor matrix
nonrescor <- parkinson_sem$S

#differences of the elements of the observed covariance matrix and the covariance matrix of the fitted model
covresiduals <- round(parkinson_sem$S - parkinson_sem$C, 3)

semPaths(parkinson_sem, "est",edge.label.cex=1.5)

```


```{r}
library(sem)
library(semPlot)
parkinsonscov<- cor(parkinsons[,-c(1)])
parkinson_model <- specifyModel(file = "C:\\ttu\\spring18\\MVA\\project\\ParkinsonsDiseaseDataAnalysis-master\\parkinson_sem_model_efa1.txt")
opt <- options(fit.indices = c("GFI", "AGFI", "SRMR"))
parkinson_sem <- sem(parkinson_model, parkinsonscov, nrow(parkinsons))
summary(parkinson_sem)
# restricted Cor matrix
rescor <- parkinson_sem$C
# non-restricted Cor matrix
nonrescor <- parkinson_sem$S

#differences of the elements of the observed covariance matrix and the covariance matrix of the fitted model
covresiduals <- round(parkinson_sem$S - parkinson_sem$C, 3)

semPaths(parkinson_sem, "est",edge.label.cex=1.5)
```

Since age, sex have very high uniqueness, we didn't include them in the model. Also NHR and HNR have high correlation (0.9), so we included one of them. By reducing the highly correlated and highly unique values we got only 6 variables. Then we tried exploratory factor analysis on them with 2 factors and found that shimmer, HNR and RPD have higher coefficients with factor 1. So we named factor 1 as amplitude. The jitter, DFA and PPE variables have higher coefficients with factor 2. So we named factor 2 as frequency. Next we tried to build the model for confirmatory factor analysis with these observations. 

From the summary we can see that this model has GFI and AGFI index greater than 0.95, which indicates the model is good. SRMR is 0.032, which is less than 0.05. It also indicates that this is a good model. 

However, we know age is the most significant variable to identify if a patient has parkinsons disease or not. It is the most important factor in calculating the updrs score that helps identifying parkinsons in a patient. But dimensionality reduction with factor analysis is not able to acknowledge age due to high uniqueness in the variable. Also high multicollinearity is another possible reason for which the factor analysis algorithm does not converge with all the variables of the dataset. So we can reach to the conclusion that exploratory and confirmatory factor analysis are not suitable dimensionality reduction techniques for this dataset. 


# Principal Component Analysis:

Next we try pca which can be a possible solution for the multi-collinearity problem. 

```{r}
#Principal Componenets Analysis
library(stats)
#outliers have alreaady been removed so PCA does not requiere any changes in the data
#standard deviations of data set
p_sd <- sd(is.numeric(parkinsons))
# creating covariance matrix for entire dataset
p_cov <- cov(parkinsons, use = "everything")
#creating correlation matrix for the entire dataset
p_corr <-cor(parkinsons, use = "everything")

#Principal Components Analysis for correlation matrix
# we have chosen to utilize the correlation matrix for the PCA since the variables have different scales and variances
parkinsons_pca_corr <- princomp(parkinsons, cor = T, scores = TRUE)
summary(parkinsons_pca_corr, loadings = T)


```

Below are our findins from Principal Componenets Analysis: 

##### Covariance Matrix Interpretation:

All of the variables showcase small magnitudes which can have positive or negative linear releationships.
 
##### Correlation Matrix Interpretation:

From the correlation matrix we have the following observatins:

1. Variable HNR which is the ratio of noise to tonal components in the voice. This variable has an inverse releationship with all the variables.

2. Variables containing the word "Shimmer" have high positive correlation values with each other.

3. Variables containing the word "Jitter" have high positive correlation values with each other

4. High correlation values between variables containing the word Shimer with variables containing the word "Jitter"

5. The PPE variable represents a nonlinear measure of fundamental frequency variation. This variable has strong positive correlations with varaibles containing the word "jitters" or "Shimmers" and the variable RPDE. RPDE represents a nonlinear dynamical complexity measure.

6. "total_UPDRS" and "motor_UPDRS" showcase a strong positive correlation. Where, "total_UPDRS" represents clinician's total UPDRS score  and "motor_UPDRS"  the clinician's motor UPDRS score both, linearly interpolated. 

 
##### PCA Interpretation:
 
1. The amount of components was determined according to the following rule. The total amount of variation that the components represent must be within 70% to 90%.

2. According to this parameter components 1 through 4 where selected. These components amount to a total of 76% of the total variation.

# Cluster Analysis
```{r}
pca.scores <- parkinsons_pca_corr$scores[,1:4] #since we used princomp(parkinsons, cor = T, scores = TRUE), the scores are already scaled
park.dist <- dist(pca.scores)
```

### Hierarchical clustering
```{r}
hc1 <- hclust(park.dist, "complete")
hc2 <- hclust(park.dist, "average")
hc3 <- hclust(park.dist, "single")
```
#### Dendograms
```{r}
plot(hc1, cex = 0.5, main= "complete")
plot(hc2, cex = 0.5, main= "average")
plot(hc3, cex = 0.5, main= "single")
```
The dendogram is a mess, thus we perform an Elbow test to find out the number of clusters.

#### Elbow test to determine the number of clusters 
```{r}
plot(rev(hc1$height), xlim = c(0,20), main="Elbow test with complete method")
plot(rev(hc2$height), xlim = c(0,20), main="Elbow test with average method")
plot(rev(hc3$height), xlim = c(0,20), main="Elbow test with single method")
```
From the above figure, it looks like there are about 6 clusters for hierarchical clustering using Complete method, about 4 clusters using average method, and about 4 clusters using single method.

```{r}
clust1 <- cutree(hc1, 6)
table(clust1)

clust2 <- cutree(hc2, 4)
table(clust2)

clust3 <- cutree(hc3, 4)
table(clust3)
```

We can graphically show 2-D plot of the clustering using PC1 and PC2 scores.

```{r}
xlim <- range(pca.scores[,1])
plot(pca.scores[,1:2], xlim = xlim, ylim = xlim, col = clust1)
plot(pca.scores[,1:2], xlim = xlim, ylim = xlim, col = clust2)
plot(pca.scores[,1:2], xlim = xlim, ylim = xlim, col = clust3)
```

The Hierarchical clustering does not seem like a good method.

## K-Means Clustering
Determining the number of clusters
```{r}
plot.wgss = function(mydata, maxc) {
wss = numeric(maxc)
for (i in 1:maxc) wss[i] = kmeans(mydata,centers=i, nstart = 10)$tot.withinss 
plot(1:maxc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Scree Plot") }
```
##### Elbow test
```{r}
plot.wgss(pca.scores, 20) # Elbow test
```

From the abve Scree Plot, looks like we can use about 8 clusters for our data.
```{r}
km <- kmeans(pca.scores, 8)
table(km$cluster)
```

We can graphically show 2-D plot of the clustering using PC1 and PC2 scores.

```{r}
plot(pca.scores[,1:2], col=km$cluster)
```

## Model-based clustering
```{r}
library(mclust)
mc <- Mclust(pca.scores, 8)
summary(mc)
```

We can graphically show 2-D plot of the clustering using PC1 and PC2 scores.
```{r}
plot(pca.scores[,1:2], col = mc$classification)
```

We can see that the model based clustering is much more better for our data set.

## Pros and Cons of the study:

1. From our study on the dataset, we have come to the conclusion that age, sex, test-time and DFA are the variables that are most difficult to capture in a factor analysis model. Though we know from Random forest analysis and research papers that these variables are most significant in deciding the UPDRS score.

2. In PCA, the age variable has high coefficient in component 7. But if we consider principal components with 1+ standard deviation, we have to consider the first 5 components, discarding the 7th component. Accordingly, test time and DFA has high coefficient in 6th and 8th component. 

3. The data has high correlation between the variable groups "Jitter" and "Shimmer" which caused problems related to multi-collinearity during the analysis. 

4. Future study on the parkinsons dataset should include more detailed analysis on the above mentioned variables with high-uniqueness and how to tackle the issue.

5. Our data set is not perfectly normally distributed (mutivariate normality).

6. While performing cluster analysis on our redued dimensions based on Principle components, we only used 4 main principle components that explained about 76% variance in our data set. Thus, there may be values in our clusters that would not be able to explain all the variance in our data using Clustering.

7. Determining the number of clusters was a challenge. Sometimes it was not clear at all, as to how may clusters to use, even with the elbow test.

8. Through dimensionality reduction using MDS we have arrived at the observation that men and women create seperate sets of data points on the plot because the frequency and amplitude of the male and female voices are in different ranges and this key observation is needed to identify progression of the disease on gender basis.

### References:

[1] J. Jankovic, "Parkinson's disease: Clinical features and diagnosis," J. Neurol. Neurosurgery Psychiatry, vol. 79, no. 4, pp. 368-376, 2007.

[2] S. B. O'Sullivan and T. J. Schmitz, "Parkinson disease," in Physical Rehabilitation, 5th ed. Philadelphia, PA, USA: F. A. Davis Company, 2007, pp. 856-894.

[3] Parkinson Derne??gi. (2011). [Online]. Available: 
http://www. parkinsondernegi.org/Icerik.aspx?Page=parkinsonnedir&ID=5

[4] L. M. de Lau and M. M. Breteler, "Epidemiology of Parkinson's disease," Lancet Neurol., vol. 5, no. 6, pp. 525-535, 2006.

[5] N. Singh, V. Pillay, and Y. E. Choonara, "Advances in the treatment of Parkinson's disease," Prog. Neurobiol., vol. 81, no. 1, pp. 29-44, 2007.

[6] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman, and L. O. Ramig, "Suitability of dysphonia measurements for telemonitoring of Parkinson's disease," IEEE Trans. Biomed. Eng., vol. 56, no. 4, pp. 1010-1022, Apr. 2009.

[7] National Collaborating Centre for Chronic Conditions, Parkinson's Disease, London, U.K.: Royal College of Physicians, 2006.

[8] Betul Erdogdu, SakarMuhammed, Erdem Isenkul, Muhammed Erdem, IsenkulC. Okan, SakarC. and Okan Sakar, " Collection and Analysis of a Parkinson Speech Dataset With Multiple Types of Sound Recordings", July 2013, IEEE Journal of Biomedical and Health Informatics 17(4):828-834, DOI: 10.1109/JBHI.2013.2245674

[9] Athanasios Tsanas and Max Little, 'Accurate telemonitoring of Parkinson's disease symptom severity using nonlinear speech signal processing and statistical machine learning' 

[10] Parkinsons Telemonitoring Data Set , Online link: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.names

[11] Athanasios Tsanas, Max A. Little, Patrick E. McSharry, Lorraine O. Ramig (2009), 'Accurate telemonitoring of Parkinson.s disease progression by non-invasive speech tests', IEEE Transactions on Biomedical Engineering. 

[12] Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2009), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering, 56(4):1015-1022