forked from pqrst/ParkinsonsDiseaseDataAnalysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathparkinson.Rmd
570 lines (404 loc) · 30.8 KB
/
parkinson.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
---
title: "Parkinson's telemonitoring"
author: "Rushan Shakya, Tahrima Mustafa, Cam Moy, Praveen Hariharasubramanian"
date: "April 21, 2018"
output:
word_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction:
### Description and literature:
Parkinson's disease is a neurodegenerative disorder of central nervous system that causes partial or full loss of motor reflexes, speech, behavior, mental processing, and other vital functions [1]. It is generally observed in elderly people and causes disorders in speech and motor abilities (writing, balance, etc.) of 90% of the patients [2]. Ensuing Alzheimer, PD is the second common neurological health problem in elder ages and it is estimated that nearly 10 million people all around the world and approximately 100k people in Turkey are suffering from this disease [3], [4]. Particularly, PD is generally seen in one out of every hundred people aged over 65. Currently, there is no known cure for the disease [5], [6]. Although, there is significant amount of drug therapies to decrease difficulties caused by the disorder, PD is usually diagnosed and treated using invasive methods [7]. Therefore, this complicates the process of diagnosis and treatment of patients who are grieving from the disease. Our main motivation for working with this dataset is to find significant variables in identifying patients suffering from Parkinson's disease with the means of multivariate analysis.
In this study, we will analyze the patients' data who are diagnosed with the disease. Using speech data from subjects is expected to help the development of a noninvasive diagnostic. There are important examples of these kinds of Alzheimer and PD studies all around the world [8]. The studies based on the PD focus on symptoms like slowness in movement, poor balance, trembling, or stiffness of some body parts but especially voice problems. The main reason behind the popularity of PD diagnosis from speech impairments is that tele-diagnosis and tele-monitoring systems based on speech signals are low in cost and easy to self-use [6], [8]. Such systems lower the inconvenience and cost of physical visits of PD patients to the medical clinic, enable the early diagnosis of the disease, and also lessen the workload of medical personnel [7], [8]. People with Parkinsonism (PWP) suffer from speech impairments like dysphonia (defective use of the voice), hypophonia (reduced volume), monotone (reduced pitch range), and dysarthria (difficulty with articulation of sounds or syllables). Even though there are many studies aiming at diagnosing and monitoring PD using these impairments, the origin of these studies leans to diagnose basic voice disorders [8]. Therefore, our analysis in this project will be based on voice parameters of the affected. The following section will illustrate a short description of the dataset formation and how we are planning to approach the problem.
### Data:
The dataset was created by Athanasios Tsanas and Max Little of the University of Oxford, in collaboration with 10 medical centers in the US and Intel Corporation who developed the tele-monitoring device to record the speech signals. The original study [9] used a range of linear and nonlinear regression methods to predict the clinician's Parkinson's disease symptom score on the UPDRS scale.
This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a tele-monitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.
Columns in the dataset contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals.
The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.
The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around 200 recordings per patient, the subject number of the patient is identified in the first column [10], [11].
### Attribute Information:
Subject: Integer that uniquely identifies each subject
Age: Subject age
Sex: Subject gender '0' - male, '1' - female
Test_time: Time since recruitment into the trial. The integer part is the number of days since recruitment
Motor_UPDRS: Clinician's motor UPDRS score, linearly interpolated
Total_UPDRS: Clinician's total UPDRS score, linearly interpolated
Jitter (%), Jitter(Abs), Jitter. RAP, Jitter. PPQ5, Jitter. DDP: Several measures of variation in fundamental frequency (Frequency parameters)
Shimmer, Shimmer (dB), Shimmer. APQ3, Shimmer. APQ5, Shimmer. APQ11, Shimmer. DDA: Several measures of variation in amplitude (Amplitude parameters)
NHR, HNR: Two measures of ratio of noise to tonal components in the voice
RPDE: A nonlinear dynamical complexity measure
DFA: Signal fractal scaling exponent
PPE: A nonlinear measure of fundamental frequency variation
### Checking Multivariate normality
```{r}
#Read the data file
parkinsons <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data", header = TRUE)
#parkinsons <- read.csv("C:\\Users\\Praveen\\Documents\\MSDS\\ISQS6350-multivariateanalysis\\project\\parkinsons_updrs.csv", header = TRUE, stringsAsFactors = TRUE)
str(parkinsons)
```
```{r}
library(mvnormtest)
x <- parkinsons
cm <- colMeans(x)
S <- cov(x)
d <- apply(x, 1, function(x) t(x - cm) %*% solve(S) %*% (x - cm))
# Chi-Square plot:
plot(qchisq((1:nrow(x) - 1/2) / nrow(x), df = ncol(x)),
sort(d),
xlab = expression(paste(chi[22]^2,
" Quantile")),
ylab = "Ordered distances")
abline(a = 0, b = 1)
```
From the above figure, we can see that our multivariate data is not perfectly normally distributed. There may be some outliers in our data set.
# Data Cleaning and Outlier Removal:
Our first step is going through the dataset and identify any missing value or outlier to take necessary measures. This step is essential to prepare the data for fruitful analysis.
#### Check null values
```{r}
missing <- apply(parkinsons, 2, function(x)
round(100 * (length(which(is.na(x))))/length(x) , digits = 1))
as.data.frame(missing)
```
There are no missing values in our dataset.
#### Check correlations between the variables
```{r}
library(corrplot)
corrplot(cor(parkinsons), type="full", method ="color", title = "Parkinson correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")
```
We can see that all the jitter variables highly correlate with Shimmer variables.
#### Outlier detection
```{r}
summary(parkinsons[,-3])
```
The total_UPDRS (Unified Parkinson's Disease Ratings Score) is the main variable of interest, which determines the clinical impression of Parkinson's disease (PD) severity. Thus, we plot total_UPDRS scores against other variables in our data set to find out outliers.
```{r}
#Scattered plot to look into data distribution
plot(jitter(total_UPDRS)~., parkinsons)
```
In our scattered plot between total_UPDRS and Jitter, it looks like, we can see out outlier observations in our data. Similarly, in our plots with total_UPDRS vs Shimmer, total_UPDRS vs NHR, total_UPDRS vs RPDE, total_UPDRS vs DFA, and total_UPDRS vs PPE, we can see some outlier observations.
We will now look into bivariate boxplots in our data to look for outlier observations in our data.
```{r}
library(MVA)
#boxplots
bvbox(parkinsons[,6:7], xlab = "total_UPDRS", ylab = "Jitter")
bvbox(parkinsons[,c(6,12)], xlab = "total_UPDRS", ylab = "Shimmer")
bvbox(parkinsons[,c(6,18)], xlab = "total_UPDRS", ylab = "NHR")
bvbox(parkinsons[,c(6,20)], xlab = "total_UPDRS", ylab = "RPDE")
bvbox(parkinsons[,c(6,21)], xlab = "total_UPDRS", ylab = "DFA")
bvbox(parkinsons[,c(6,22)], xlab = "total_UPDRS", ylab = "PPE")
```
The bivariate boxplot is showing a lot of our observations as outliers. Thus, we want to check our results with Convex Hull method as we don't want to change the distribution of our data by removing the outliers.
```{r}
#Convex hull method
hull1 <- chull(parkinsons[,6:7])
parkhull <- match(lab <- rownames(parkinsons[hull1,])
, rownames(parkinsons))
plot(parkinsons[,6:7], xlab = "total_UPDRS", ylab = "Jitter")
polygon(parkinsons$Jitter...[hull1]~parkinsons$total_UPDRS[hull1])
text(parkinsons[parkhull,6:7], labels = lab
, pch=".", cex = 0.9)
```
```{r}
#Removing outlier observations according to Convex hull
outlier <- parkinsons[-hull1,]
dim(outlier)
dim(parkinsons)
hull2 <- chull(outlier[,c(6,12)])
parkinsons <- outlier[-hull2,]
hull3 <- chull(parkinsons[,c(6,18)])
outlier <- parkinsons[-hull3,]
hull4 <- chull(outlier[,c(6,20)])
parkinsons <- outlier[-hull4,]
hull5 <- chull(parkinsons[,c(6,21)])
outlier <- parkinsons[-hull5,]
hull6 <- chull(outlier[,c(6,22)])
parkinsons <- outlier[-hull6,]
dim(parkinsons)
```
# Dimensionality Reduction:
Our next step is dimensionality reduction. The dataset is very large with 22 variables and some of the variables have high correlations between them. So we are expecting to reduce the number of dimensions for better interpretation of the data.
#### Multi-dimensional scaling
First we try Multi-dimensional scaling which can help us visualizing the variable relationships in 2D graphs.
```{r}
#Multi dimensional scaling
parkcorr <- cor(parkinsons)
colnames(parkcorr) <- row.names(parkcorr) <- parklabs <- c(colnames(parkinsons))
rge <- sapply(parkinsons, function(x) diff(range(x)))
sparkinsons <- sweep(parkinsons, 2, rge, FUN = "/")
parkdist <- dist(sparkinsons)
parkdist_mds <- cmdscale(parkdist, k = 21, eig = TRUE)
parkdistpoints <- parkdist_mds$points
lam <- parkdist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
#criterion 2 suggests that the first two coordinates are required to represent majority of the data points since the cummulative proportion is 0.78 and close to the threshold value of 0.8
#hence the MDS plot can be on a 2D scatterplot
x <- parkdist_mds$points[,1]
y <- parkdist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = parkinsons[,3], cex=0.8)
#the MDS plot clearly shows that age is creating a deviation between the datasets with female on the right and male on the left
#this significant deviation is because the voice pictch, frequency and amplitude totally differs by being in different ranges for different genders
```
```{r}
parkcorr <- cor(parkinsons)
colnames(parkcorr) <- row.names(parkcorr) <- parklabs <- c(colnames(parkinsons))
rge <- sapply(parkinsons, function(x) diff(range(x)))
sparkinsons <- sweep(parkinsons, 2, rge, FUN = "/")
parkdist <- dist(parkcorr)
parkdist_mds <- cmdscale(parkdist, k = 21, eig = TRUE)
parkdistpoints <- parkdist_mds$points
lam <- parkdist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
#criterion 1 and criterion 2 suggests that the first two coordinates can represents majority of the data points since the cummulative proportion is above the threshold value of 0.8
#hence the MDS plot can be on a 2D scatterplot
x <- parkdist_mds$points[,1]
y <- parkdist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = colnames(parkcorr), cex=0.8)
```
Below are the findings from multidimensional scaling on the parkinsons data:
The statitical technique of multi dimensional scaling through this plot has confirmed that attributes form groups with similar patterns
1. All jitter variables follow a pattern
2. All Shimmer variables follow a patter
3. PPE and NHR form a pattern
4. total UPDRS and motor UPDRS form a pattern
5. test time and sex form a pattern
6. Age contributes as a seperate attribute to the variation in data
7. HNR contributes as a seperate attribute to the variation in data
These findings can be logically understood as
1. Jitter is related to frequency measure so they form a pattern and it is logically valid
2. Shimmer is related to amplitude measure so they form a pattern and it is logically valid
3. Pitch of voice cord and noise to harmonics ratio are having an underlying relationship and that is observable through this finding
4. Motor UPDRS score is impacting the total score and it makes sense
5. Test time and sex are two independent factors and they are showing correlation by coincidence and this inference can be ignored
6. Age contributes as a seperate attribute to the variation in data
7. HNR contributes as a seperate attribute to the variation in data
# Random forest
To confirm that the observations in the multi dimensional scaling can be used to better identify patients with disease and ideal measures to estimate progression of the disease and interpret the similarity between cetain measures let us apply another technique called random forests algorithm that uses gini index which is another way of measuring dissimilarity in the data like multi dimensional scaling.
Random forest helps in measuring and identifying the correct measure that can aid in differentiating the data and better differentiate patients and ranking the different variables by there contribution to the variance of the dataset and also on their significance in measuring progression of the disease.
```{r}
library(randomForest)
#scaling the data
rge <- sapply(parkinsons, function(x) diff(range(x)))
sparkinsons <- sweep(parkinsons, 2, rge, FUN = "/")
# Create the forest.
output.forest <- randomForest(sparkinsons$total_UPDRS~age+sex+test_time+Jitter...+Jitter.Abs.+Jitter.RAP+Jitter.PPQ5+Jitter.DDP+Shimmer+Shimmer.dB.+Shimmer.APQ3+Shimmer.APQ5+Shimmer.APQ11+Shimmer.DDA+NHR+HNR+RPDE+DFA+PPE, data = sparkinsons,mtry = 6)
# View the forest results.
print(output.forest)
# Importance of each predictor.
impfactors <- importance(output.forest,type = 2)
impfactors <- data.frame(impfactors)
impfactorsranked <- impfactors[order(-impfactors$IncNodePurity),,drop=FALSE]
print(impfactorsranked)
#On applying random forest we can observe that certain attributes contribute higher to the split in the dataset i.e. certain observations help better in categorising patients based on UPDRS scores and contribute higher to disease progression or severity
#motor UPDRS ignored since its a subset of total UPDRS
#subject id ignored since its the id of each patient
#These attributes are
#Age - Subject age
#Sex - Subject gender '0' - male, '1' - female
#test_time - Time since recruitment into the trial. The integer part is the number of days since recruitment.
#HNR - Harmonics to noise ratio (HNR) quantify the ratio of actual signal information over noise
#RPDE - Recurrence Period Density Entropy (RPDE) Quantify the stochastic component of the deviation of vocal fold periodicity. Vocal fold periodicity is the periodicity/frequency of vibration of the vocal cord
#DFA - Detrended Fluctuation Analysis (DFA) Quantify the stochastic self-similarity of the noise caused by turbulent airflow
#PPE - Pitch Period Entropy (PPE). In speech disorders it is very difficult to sustain stable pitch due to incomplete vocal fold closure. PPE quantifies the impaired control of stabilized pitch
#Jitter(Abs) - Jitter absolute is the average absolute difference between consecutive periods, divided by the average period in fundamental frequency
```
Random forest has helped in indentifying the top factors that influence the disease progression and that are DFA, Age, JItter.Abs.,Sex, PPE, HNR, RPDE and test_time.
Age - Subject age
Sex - Subject gender '0' - male, '1' - female
test_time - Time since recruitment into the trial. The integer part is the number of days since recruitment.
Jitter (%) - Mean absolute difference of successive cycles divided by the mean of Fundamental frequency (expressed in percentage)
HNR - Harmonics to noise ratio (HNR) quantify the ratio of actual signal information over noise
RPDE - Recurrence Period Density Entropy (RPDE) Quantify the stochastic component of the deviation of vocal fold periodicity. Vocal fold periodicity is the periodicity/frequency of vibration of the vocal cord
DFA - Detrended Fluctuation Analysis (DFA) Quantify the stochastic self-similarity of the noise caused by turbulent airflow
PPE - Pitch Period Entropy (PPE). In speech disorders it is very difficult to sustain stable pitch due to incomplete vocal fold closure. PPE quantifies the impaired control of stabilized pitch
# Exploratory Factor Analysis
Next we try exploratory factor analysis on the data to identify important factors.
```{r}
#Exploratory factor analysis
library(MVA)
options(digits = 3)
# EFA
#head(parkinsons) #2:4,8,16,18:22
parkinson.EFA <- factanal(parkinsons[, c(2:5,8,16,18:22)], 3, n.obs = nrow(parkinsons), rotation="varimax", control=list(trace=T))
parkinson.EFA
print(parkinson.EFA$loadings, cut = 0.45)
```
First, when we try to do exploratory factor analysis with all the variables, the model doesn't run. After some research we have come to the conclusion that due to high multicolinearity between some variables (specificaly jitter and shimmer), the algorithm is not converging. So we decided to reduce the values that have high correlation between them. From the correlation plot we can see that jitter and shimmer variables have high correlation (0.9+) between themselves. So we tried building the model with one jitter and one shimmer variable. From the random forest analysis, we saw that Jitter.Abs. and Shimmer.APQ11 have highest significance in their corresponding frequency and amplitude groups. So we took these 2 variables in the exploratory factor analysis. Also both the updrs variables have 0.9+ correlation between them. So we took one from that group too.
With the above mentioned variables we explored different number of factors. But if we take 2 or 3 factors then only 40-45% data is explained. Also the age, sex and test_time have very small factor coefficient and large (0.8+) uniqueness. If we have 5/6 factors then the uniqueness of these variables lessen but still they are greater than 0.7.
From 3 factor analysis we can see that jitter, shimmer, NHR, HNR, RPDE and PPE have higher coefficents with Factor 1. However, from the random forest analysis, we have seen that age and DFA are the most significant variables, which have really low coefficient in these analysis.
# Confirmatory Factor Analysis
Next we try confirmatory factor analysis to compare if the results from EFA are correct.
```{r}
parkinson.EFA <- factanal(parkinsons[, c(2:8,17,18:22)], 2, n.obs = nrow(parkinsons), rotation="varimax", control=list(trace=T))
parkinson.EFA
print(parkinson.EFA$loadings, cut = 0.5)
#Implementing Confirmatory actor analysis from the observed factors in the exploratory factor analysis
library(sem)
library(semPlot)
parkinsonscov<- cor(parkinsons[,-c(1)])
parkinson_model <- specifyModel(file = "C:\\ttu\\spring18\\MVA\\project\\ParkinsonsDiseaseDataAnalysis-master\\parkinson_sem_model_efa2.txt")
opt <- options(fit.indices = c("GFI", "AGFI", "SRMR"))
parkinson_sem <- sem(parkinson_model, parkinsonscov, nrow(parkinsons))
summary(parkinson_sem)
# restricted Cor matrix
rescor <- parkinson_sem$C
# non-restricted Cor matrix
nonrescor <- parkinson_sem$S
#differences of the elements of the observed covariance matrix and the covariance matrix of the fitted model
covresiduals <- round(parkinson_sem$S - parkinson_sem$C, 3)
semPaths(parkinson_sem, "est",edge.label.cex=1.5)
```
```{r}
library(sem)
library(semPlot)
parkinsonscov<- cor(parkinsons[,-c(1)])
parkinson_model <- specifyModel(file = "C:\\ttu\\spring18\\MVA\\project\\ParkinsonsDiseaseDataAnalysis-master\\parkinson_sem_model_efa1.txt")
opt <- options(fit.indices = c("GFI", "AGFI", "SRMR"))
parkinson_sem <- sem(parkinson_model, parkinsonscov, nrow(parkinsons))
summary(parkinson_sem)
# restricted Cor matrix
rescor <- parkinson_sem$C
# non-restricted Cor matrix
nonrescor <- parkinson_sem$S
#differences of the elements of the observed covariance matrix and the covariance matrix of the fitted model
covresiduals <- round(parkinson_sem$S - parkinson_sem$C, 3)
semPaths(parkinson_sem, "est",edge.label.cex=1.5)
```
Since age, sex have very high uniqueness, we didn't include them in the model. Also NHR and HNR have high correlation (0.9), so we included one of them. By reducing the highly correlated and highly unique values we got only 6 variables. Then we tried exploratory factor analysis on them with 2 factors and found that shimmer, HNR and RPD have higher coefficients with factor 1. So we named factor 1 as amplitude. The jitter, DFA and PPE variables have higher coefficients with factor 2. So we named factor 2 as frequency. Next we tried to build the model for confirmatory factor analysis with these observations.
From the summary we can see that this model has GFI and AGFI index greater than 0.95, which indicates the model is good. SRMR is 0.032, which is less than 0.05. It also indicates that this is a good model.
However, we know age is the most significant variable to identify if a patient has parkinsons disease or not. It is the most important factor in calculating the updrs score that helps identifying parkinsons in a patient. But dimensionality reduction with factor analysis is not able to acknowledge age due to high uniqueness in the variable. Also high multicollinearity is another possible reason for which the factor analysis algorithm does not converge with all the variables of the dataset. So we can reach to the conclusion that exploratory and confirmatory factor analysis are not suitable dimensionality reduction techniques for this dataset.
# Principal Component Analysis:
Next we try pca which can be a possible solution for the multi-collinearity problem.
```{r}
#Principal Componenets Analysis
library(stats)
#outliers have alreaady been removed so PCA does not requiere any changes in the data
#standard deviations of data set
p_sd <- sd(is.numeric(parkinsons))
# creating covariance matrix for entire dataset
p_cov <- cov(parkinsons, use = "everything")
#creating correlation matrix for the entire dataset
p_corr <-cor(parkinsons, use = "everything")
#Principal Components Analysis for correlation matrix
# we have chosen to utilize the correlation matrix for the PCA since the variables have different scales and variances
parkinsons_pca_corr <- princomp(parkinsons, cor = T, scores = TRUE)
summary(parkinsons_pca_corr, loadings = T)
```
Below are our findins from Principal Componenets Analysis:
##### Covariance Matrix Interpretation:
All of the variables showcase small magnitudes which can have positive or negative linear releationships.
##### Correlation Matrix Interpretation:
From the correlation matrix we have the following observatins:
1. Variable HNR which is the ratio of noise to tonal components in the voice. This variable has an inverse releationship with all the variables.
2. Variables containing the word "Shimmer" have high positive correlation values with each other.
3. Variables containing the word "Jitter" have high positive correlation values with each other
4. High correlation values between variables containing the word Shimer with variables containing the word "Jitter"
5. The PPE variable represents a nonlinear measure of fundamental frequency variation. This variable has strong positive correlations with varaibles containing the word "jitters" or "Shimmers" and the variable RPDE. RPDE represents a nonlinear dynamical complexity measure.
6. "total_UPDRS" and "motor_UPDRS" showcase a strong positive correlation. Where, "total_UPDRS" represents clinician's total UPDRS score and "motor_UPDRS" the clinician's motor UPDRS score both, linearly interpolated.
##### PCA Interpretation:
1. The amount of components was determined according to the following rule. The total amount of variation that the components represent must be within 70% to 90%.
2. According to this parameter components 1 through 4 where selected. These components amount to a total of 76% of the total variation.
# Cluster Analysis
```{r}
pca.scores <- parkinsons_pca_corr$scores[,1:4] #since we used princomp(parkinsons, cor = T, scores = TRUE), the scores are already scaled
park.dist <- dist(pca.scores)
```
### Hierarchical clustering
```{r}
hc1 <- hclust(park.dist, "complete")
hc2 <- hclust(park.dist, "average")
hc3 <- hclust(park.dist, "single")
```
#### Dendograms
```{r}
plot(hc1, cex = 0.5, main= "complete")
plot(hc2, cex = 0.5, main= "average")
plot(hc3, cex = 0.5, main= "single")
```
The dendogram is a mess, thus we perform an Elbow test to find out the number of clusters.
#### Elbow test to determine the number of clusters
```{r}
plot(rev(hc1$height), xlim = c(0,20), main="Elbow test with complete method")
plot(rev(hc2$height), xlim = c(0,20), main="Elbow test with average method")
plot(rev(hc3$height), xlim = c(0,20), main="Elbow test with single method")
```
From the above figure, it looks like there are about 6 clusters for hierarchical clustering using Complete method, about 4 clusters using average method, and about 4 clusters using single method.
```{r}
clust1 <- cutree(hc1, 6)
table(clust1)
clust2 <- cutree(hc2, 4)
table(clust2)
clust3 <- cutree(hc3, 4)
table(clust3)
```
We can graphically show 2-D plot of the clustering using PC1 and PC2 scores.
```{r}
xlim <- range(pca.scores[,1])
plot(pca.scores[,1:2], xlim = xlim, ylim = xlim, col = clust1)
plot(pca.scores[,1:2], xlim = xlim, ylim = xlim, col = clust2)
plot(pca.scores[,1:2], xlim = xlim, ylim = xlim, col = clust3)
```
The Hierarchical clustering does not seem like a good method.
## K-Means Clustering
Determining the number of clusters
```{r}
plot.wgss = function(mydata, maxc) {
wss = numeric(maxc)
for (i in 1:maxc) wss[i] = kmeans(mydata,centers=i, nstart = 10)$tot.withinss
plot(1:maxc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Scree Plot") }
```
##### Elbow test
```{r}
plot.wgss(pca.scores, 20) # Elbow test
```
From the abve Scree Plot, looks like we can use about 8 clusters for our data.
```{r}
km <- kmeans(pca.scores, 8)
table(km$cluster)
```
We can graphically show 2-D plot of the clustering using PC1 and PC2 scores.
```{r}
plot(pca.scores[,1:2], col=km$cluster)
```
## Model-based clustering
```{r}
library(mclust)
mc <- Mclust(pca.scores, 8)
summary(mc)
```
We can graphically show 2-D plot of the clustering using PC1 and PC2 scores.
```{r}
plot(pca.scores[,1:2], col = mc$classification)
```
We can see that the model based clustering is much more better for our data set.
## Pros and Cons of the study:
1. From our study on the dataset, we have come to the conclusion that age, sex, test-time and DFA are the variables that are most difficult to capture in a factor analysis model. Though we know from Random forest analysis and research papers that these variables are most significant in deciding the UPDRS score.
2. In PCA, the age variable has high coefficient in component 7. But if we consider principal components with 1+ standard deviation, we have to consider the first 5 components, discarding the 7th component. Accordingly, test time and DFA has high coefficient in 6th and 8th component.
3. The data has high correlation between the variable groups "Jitter" and "Shimmer" which caused problems related to multi-collinearity during the analysis.
4. Future study on the parkinsons dataset should include more detailed analysis on the above mentioned variables with high-uniqueness and how to tackle the issue.
5. Our data set is not perfectly normally distributed (mutivariate normality).
6. While performing cluster analysis on our redued dimensions based on Principle components, we only used 4 main principle components that explained about 76% variance in our data set. Thus, there may be values in our clusters that would not be able to explain all the variance in our data using Clustering.
7. Determining the number of clusters was a challenge. Sometimes it was not clear at all, as to how may clusters to use, even with the elbow test.
8. Through dimensionality reduction using MDS we have arrived at the observation that men and women create seperate sets of data points on the plot because the frequency and amplitude of the male and female voices are in different ranges and this key observation is needed to identify progression of the disease on gender basis.
### References:
[1] J. Jankovic, "Parkinson's disease: Clinical features and diagnosis," J. Neurol. Neurosurgery Psychiatry, vol. 79, no. 4, pp. 368-376, 2007.
[2] S. B. O'Sullivan and T. J. Schmitz, "Parkinson disease," in Physical Rehabilitation, 5th ed. Philadelphia, PA, USA: F. A. Davis Company, 2007, pp. 856-894.
[3] Parkinson Derne??gi. (2011). [Online]. Available:
http://www. parkinsondernegi.org/Icerik.aspx?Page=parkinsonnedir&ID=5
[4] L. M. de Lau and M. M. Breteler, "Epidemiology of Parkinson's disease," Lancet Neurol., vol. 5, no. 6, pp. 525-535, 2006.
[5] N. Singh, V. Pillay, and Y. E. Choonara, "Advances in the treatment of Parkinson's disease," Prog. Neurobiol., vol. 81, no. 1, pp. 29-44, 2007.
[6] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman, and L. O. Ramig, "Suitability of dysphonia measurements for telemonitoring of Parkinson's disease," IEEE Trans. Biomed. Eng., vol. 56, no. 4, pp. 1010-1022, Apr. 2009.
[7] National Collaborating Centre for Chronic Conditions, Parkinson's Disease, London, U.K.: Royal College of Physicians, 2006.
[8] Betul Erdogdu, SakarMuhammed, Erdem Isenkul, Muhammed Erdem, IsenkulC. Okan, SakarC. and Okan Sakar, " Collection and Analysis of a Parkinson Speech Dataset With Multiple Types of Sound Recordings", July 2013, IEEE Journal of Biomedical and Health Informatics 17(4):828-834, DOI: 10.1109/JBHI.2013.2245674
[9] Athanasios Tsanas and Max Little, 'Accurate telemonitoring of Parkinson's disease symptom severity using nonlinear speech signal processing and statistical machine learning'
[10] Parkinsons Telemonitoring Data Set , Online link: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.names
[11] Athanasios Tsanas, Max A. Little, Patrick E. McSharry, Lorraine O. Ramig (2009), 'Accurate telemonitoring of Parkinson.s disease progression by non-invasive speech tests', IEEE Transactions on Biomedical Engineering.
[12] Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2009), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering, 56(4):1015-1022