forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
chapter5.Rmd
198 lines (165 loc) · 7.42 KB
/
chapter5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
## Chapter 5: Dimensionality reduction techniques
Let's start by loading some libraries might use.
```{r}
library(MASS)
library(ggplot2)
library(GGally)
library(corrplot)
library(dplyr)
# Set default figure size
knitr::opts_chunk$set(fig.width=12, fig.height=8)
```
### Data wrangling
The data wrangling exercise is performed by the
``data/create_human.R`` script in my github repository, as specified
by the instructions. It can be found
[here](https://github.com/tatuylonen/IODS-project/blob/master/data/create_human.R).
The first part of the script was prepared as part of Exercise 4; the
new code for this Exercise 5 is towards the end of the file.
### Analysis
Let's start the analysis by loading the dataset we created as part of
the data wrangling exercise. It should have 155 observations and 8
variables, with country as the row name.
```{r}
human <- read.csv("data/human5.csv", row.name=1)
```
#### (1) Show graphical overview and summaries of the variables
```{r}
# Plot an overview of the relationships and distribution of the variables
ggpairs(human)
# Plot correlations of the variables
corrplot(cor(human))
# Look at the variables numerically
str(human)
summary(human)
```
We can see that some variables are highly correlated, for example
``Ado.Birth`` and ``Mat.Mor``. Some variables are not significantly
correlated (e.g., ``edu2.FM`` and ``Labo.FM``). Some variables show
strong negative correlation, e.g., ``Mat.Mor`` and ``Edu2.FM``).
Looking at the distributions of the variables, ``GNI`` has a very
large range (581 to 123124), while the other variables have much
smaller ranges, e.g., ``Edu2.FM`` has a range from 0.1717 to 1.4967.
#### (2) Perform PCA on the original (non-standardized) data
```{r}
# Perform PCA on the original (non-standardized data)
pca1 <- prcomp(human)
summary(pca1)
```
We can see that the first principal component (PC1) alone captures 99.99% of the
variance. Let's plot the observations using PC1 and PC2.
```{r}
# Draw a biplot of the PCA representation (first two components) with arrows
biplot(pca1, choices=1:2)
```
It is easy to see from the plot that PC1 is essentially same as ``GNI``.
This makes sense, as ``GNI`` had much higher range and variance than the
other variables (we did not actually plot the variance, but the
quartiles for the original ``human`` data tell the story).
#### (3) Standardize the variables and repeat the above analysis
```{r}
# Normalize the mean and variance of each variable
human_std <- scale(human)
# Perform PCA on the normalized data
pca2 <- prcomp(human_std)
summary(pca2)
```
We can see that PC1 now explains 53.6% of the variance, PC2 16.24%,
PC3 9.6%, etc. This is much more sensible. Let's see the plots.
This time we also include custom captions for the labels to show what they
describe and what percentage of variance they cover.
```{r}
# Draw a biplot of the normalized PCA (first two components) with arrows and
# and useful captions
s = summary(pca2)
pca_pr <- round(100 * s$importance[2,], digits=1)
pca_lab <- paste0("Human data ", names(pca_pr), " (", pca_pr, "%)")
biplot(pca2, choices=1:2, xlab=pca_lab[1], ylab=pca_lab[2])
```
We can see that result is very different from the non-standardized
case. This is because PCA identifies the dimensions with the highest
variance. When some variable has much higher variance than the
others, the highest overall variance will be closely aligned with that
variable, and PC1 will be closely aligned with that variable. The
plot will not give very much information about the overall
relationships of the variables. The scale of the values for the
original variable does not necessarily reflect the importance of the
variable in any way. If the variables are normalized, each variable
will have equal weight in how the principal components are selected.
This usually gives much more useful results.
#### (4) Give your personal interpretations of the first two PCA components
Wwe can see that PC1 and PC2 are not directly aligned with any
particular variable, even though PC2 has high correlation with
``Parli.F`` and ``Labo.FM`` and PC1 has high correlation with the
other variables.
I think we can interpret PC1 as describing variables that relate to
material and technological wellbeing. ``Mat.Mor``, ``Ado.Birth``, and
``Life.Exp`` reflect quality of health care, which probably correlates
with general technical and scientific prowess. ``Edu.Exp`` and
``Edu2.FM`` indicate the quality and availability of education, and it
makes sense that that is correlated with medical well being, health,
and life expectancy. ``GNI`` is also correlated with the same axis;
it describes economic wealth, which usually correlates strongly with
education and health (higher income enables providing health care and
education, and education and health care also help achieve higher
income due to likely having technology, medicines, and other things to
sell others and the ability to produce them).
On the other hand, PC2 is correlated with ``Parli.F`` and ``Labo.F``,
which relate to the role of women in the society and particularly to
the equality of women in the society. There are countries that are
wealthy, have good education, and good health care, but where women
don't have a big role in public life (politics and workplace). I
think some resource-rich countries (e.g., oil producers) may fall into
this category. It could also have described many Western countries
quite well just a few decades ago. While female education seems to be
strongly correlated with economic and medical well-being, in this data
it would seem that female participation in politics and labor force
would not necessarily be so.
#### (5) Multiple correspondence analysis using the tea dataset
Let's first load the tea dataset from FactMineR, look at its
structure, and visualize the dataset.
```{r}
# Load the tea dataset from FactoMineR
library(FactoMineR)
data(tea)
# Look at its structure
str(tea)
# Visualize the dataset. There are too many variables to visualize it using
# ggpairs reasonably. Let's instead visualize it with a barplot.
library(miscset)
ggplotGrid(ncol=4,
lapply(colnames(tea),
function(col) {
ggplot(tea, aes_string(col)) + geom_bar()
}))
```
Let's then do Multiple Correspondence Analysis on the data. I'm going
to only calculate it for the ``Tea``, ``How``, ``how``, ``sugar``,
``where``, and ``lunch`` columns, as trying all columns does not seem
to work out-of-the-box.
```{r}
# Extract a subset of the data with only the selected columns
keep <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep))
summary(tea_time)
str(tea_time)
# Compute Multiple Correspondence Analysis of the tea data
mca1 <- MCA(tea_time, graph=FALSE)
# Print a summary of the model
summary(mca1)
```
Looking at the summary, we can see that no dimension explains more
than 15.3% of the variance, and that the 7th dimension still explains
7.8%. Thus there does not appear to be a simple way to map the
observations to a very low-dimensional space cleanly.
Let's then draw the a biplot containing the variables. We also
include the individuals in the same plot.
```{r}
# Visualize the model using a biplot from the factoextra package
library(factoextra)
fviz_mca_biplot(mca1, repel=TRUE, ggtheme=theme_minimal())
```
We can see that the two first dimensions together account for roughly
30% of the variance. Quite a few individuals are mapped to the exact
same locations. It seems likely that they have identical values for
variables that looked at.