This repository has been archived by the owner on Jul 20, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
sotu_cluster.Rmd
319 lines (259 loc) · 15 KB
/
sotu_cluster.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
title: "State of the Union Text Clustering"
author: "Frank D. Evans"
output: html_document
---
**Overview**
The President's State of the Union address has changed considerably over time, more reflecting
the era in which it is delivered than the individual president or the political affiliation
of the president delivering the speech. However, there is one major exception in President George
W. Bush, whose style and content marks a sharp departure from both his predecessors and contemporaries.
This analysis takes all Presidential State of the Union addresses from Harry S Truman to Barack
Obama and clusters them by the content of the text in order to better understand how they have
changed over the course of history.
**Environment Setup & Pre-processing**
Load necessary libraries and set random seed (necessary to make K-Means deterministic & replicable).
```{r echo = TRUE, warning = FALSE, message = FALSE}
library(stringr)
library(plyr)
library(dplyr)
library(magrittr)
library(tm)
library(proxy)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)
set.seed(1300)
```
Gather list of all document files to be processed.
```{r}
files <- list.files(path = './data/sotu/')
```
Register parse function closure for vectorization. Each instance of the parse function will open
a given file, parse its contents, and output the results as a single entry DataFrame. The DataFrames
are then bound into a single object. The output object is mutated to extract the 4 digit year from
the file name, which may be useful for modeling the time dimension in Exploratory Analysis.
```{r}
parse_sotu_text <- function(file_name) {
full_path <- paste('./data/sotu/', file_name, sep = '')
raw <- scan(file = full_path, what = character(), sep = '\n')
raw <- raw[nchar(raw) > 1] # delete empty lines, single whitespace char
sotu_content <- raw[3:length(raw)] # first 2 lines are headers
output <- data_frame(file_name = file_name,
head1 = raw[1],
head2 = raw[2],
char_content = sum(nchar(sotu_content)),
content = paste0(sotu_content, sep = ' ', collapse = ' '))
return(output)
}
sotu <- rbind_all(alply(.data = files,
.margins = 1,
.fun = parse_sotu_text))
sotu <- sotu %>%
mutate(year = as.integer(str_extract(string = file_name, pattern = '[0-9]{4}')))
```
**Data Cleaning**
To clean the data, the Text Mining `tm` library is leveraged. This library includes multiple function
closures to do the main text pre-processing steps. Using the `magrittr` library pipeline operator
`%>%` allows this pre-processing series of function maps to bind to a single variable. The result is
a Corpus object with each document an element of the corpus.
```{r}
sotu_corpus <- Corpus(VectorSource(sotu$content)) %>%
tm_map(x = ., FUN = PlainTextDocument) %>%
tm_map(x = ., FUN = removePunctuation) %>%
tm_map(x = ., FUN = removeNumbers) %>%
tm_map(x = ., FUN = removeWords, stopwords(kind = 'en')) %>%
tm_map(x = ., FUN = stripWhitespace)
```
**Model: TF-IDF**
The first step in modeling is to create a Document Term Matrix, with the documents as the rows,
the individual words along the columns, and a frequency count as the content. The Corpus object
populates the term meta-data automatically during the tokenization process. However, it will be
necessary to manually import the document names into the meta-data of the matrix.
```{r}
doc_term <- DocumentTermMatrix(sotu_corpus)
doc_term$dimnames$Docs <- sotu$file_name
```
From the Document Term Matrix, it is possible to create a Term Frequency - Inverse Document Frequency
matrix. This object is a matrix of the same dimentions as the Document Term Frequency matrix above,
except each frequency has been normalized to the frequency of the term in the entire document. This
gives additional weight to a term that is common in a given document, but comparably rare in the
entire corpus. At the same time, terms that are common across many or all documents are penalized in
frequency for any single document as they provide a smaller amount of unique information about a
given document. At the same time, a native matrix object version is created and stored to be more
compatable with further processing needs.
```{r}
tf_idf <- weightTfIdf(m = doc_term, normalize = TRUE)
tf_idf_mat <- as.matrix(tf_idf)
```
Once the frequencies are calculated, a square matrix of all documents is processed as a distance
lookup for clustering. The documents have a large number of terms, which means that the distance
exists in very high dimensional space. It is for this reason that distance is computed as cosine
similarity rather than normal euclidean distance.
```{r}
tf_idf_dist <- dist(tf_idf_mat, method = 'cosine')
```
**Model: Hierarchical Clustering**
The first model is to create a hierarchical cluster, which will show the cohesion among clusters at
all levels since the hierarchical method preserves all intermediate clusters. The Ward Method of
clustering is used to place a heavy weight on the cohesiveness of a formed cluster at each step of
the process. The goal of clustering is to learn implicit features about groups of documents, and thus
the most interpretable clustering should weight its decisions on those documents that could be merged
and would create the least mutated new document.
```{r fig.align='center', fig.height=8, fig.width=10}
clust_h <- hclust(d = tf_idf_dist, method = 'ward.D2')
plot(clust_h,
main = 'Cluster Dendrogram: Ward Cosine Distance',
xlab = '', ylab = '', sub = '')
```
From this cluster dendrogram the first patterns are apparent:
1. Speeches made by the same president are almost always the first to cluster.
2. Speeches from the same Political Party and Era are commonly next to cluster (Obama & Clinton,
Reagan & Bush Sr., Eisenhower & Truman)
3. There is a set of potential outliers in 6 of George W. Bush's speeches that are the last to merge
with any others, though the balance of his speeches cluster as would be expected with Bush Sr. and
Reagan.
_Cutting the Tree_: to find an optimal number of clusters, it is useful to investigate the curve of
cluster cohesiveness that exists within the space of where the tree can be cut and separate clusters
are defined. To produce this curve, a loop will cycle through all possible levels the hierarchical
tree can be cut (from 1 to number-of-documents minus 1). For each cut level, the mean number of
documents in each cluster and the mean distance between sets of points in each cluster is stored.
```{r}
dist_mat <- as.matrix(tf_idf_dist)
df_clust_cuts <- data_frame(cut_level = 1:length(sotu$file_name),
avg_size = 0,
avg_dist = 0)
for (i in 1:(nrow(df_clust_cuts) - 1)) {
df_clust_cuts[df_clust_cuts$cut_level == i, 'avg_size'] <- mean(table(cutree(tree = clust_h, k = i)))
df_dist <- data_frame(doc_name = doc_term$dimnames$Docs,
clust_cut = cutree(tree = clust_h, k = i)) %>%
inner_join(x = ., y = ., by = 'clust_cut') %>%
filter(doc_name.x != doc_name.y)
df_dist$cos_dist <- NA
for (t in 1:nrow(df_dist)) {
df_dist$cos_dist[t] <- dist_mat[df_dist$doc_name.x[t], df_dist$doc_name.y[t]]
}
df_dist <- df_dist %>%
group_by(clust_cut) %>%
summarise(cos_dist = mean(cos_dist))
df_clust_cuts[df_clust_cuts$cut_level == i, 'avg_dist'] <- mean(df_dist$cos_dist)
}
```
With this data frame of outcomes, it is straightforward to graph the space of possibilities.
```{r fig.align='center', fig.height=5, fig.width=8}
ggplot(data = df_clust_cuts, aes(x = cut_level, y = avg_dist)) +
geom_line(color = 'steelblue', size = 2) +
labs(title = 'Cosine Distance of Hierarchical Clusters by Tree Cut',
x = 'Tree Cut Level / Number of Final Clusters',
y = 'Intra-Cluster Mean Cosine Distance')
```
The above graph shows an expected gradual decline as the tree is cut from 20 clusters up to
approaching the number of clusters as there are documents (i.e. each document in its own cluster).
Optimally, it is preferable to have as few clusters as necessary such that the Mean Cosine Distance
within each cluster is minimized (indicating maximum cluster cohesiveness). There is a local minimum
at about 5 clusters that is not surpassed in cohesiveness until closer to 40 clusters are computed.
Having the smaller number of clusters is far more preferable.
```{r fig.align='center', fig.height=5, fig.width=8}
ggplot(data = df_clust_cuts, aes(x = avg_size, y = avg_dist, color = cut_level)) +
geom_point(size = 4) +
labs(title = 'Mean Cluster Size & Cosine Distance by Tree Cut',
x = 'Mean Number of Documents per Cluster',
y = 'Intra-Cluster Mean Cosine Distance')
```
This is confirmed when looking at the balance of cluster size for the chosen cut level. A cut that
yields about 5 clusters results in about 15-18 documents per cluster, which appears to be a solid
balance agaist higher cut levels.
**Model: K-Means Cluster**
To confirm the hierarchical cluster analysis, a K-Means cluster analysis is computed which will
provide a more visual representation of the cluster space. Since K-Means relies on Euclidean Distance
rather than Cosine Dissimilarity, it is first necesary to normalize the TF-IDF matrix. The K-Means
process itself will cluster for 5 centroids, and increase the maximum iterations from the default
of 10 to 25.
```{r}
tf_idf_norm <- tf_idf_mat / apply(tf_idf_mat, MARGIN = 1, FUN = function(x) sum(x^2)^0.5)
km_clust <- kmeans(x = tf_idf_norm, centers = 5, iter.max = 25)
```
The data contains thousands of dimensions, a dimension for each term. Although the clusters are
built using the full dimensional feature space, it will not be practical to visualize this many
dimensions. To make visualization more palatable, Principal Components Analysis is performed and the
2 most important components are mapped to a plot along with meta-data for markup purposes.
```{r fig.align='center', fig.height=8, fig.width=8}
pca_comp <- prcomp(tf_idf_norm)
pca_rep <- data_frame(sotu_name = sotu$file_name,
pc1 = pca_comp$x[,1],
pc2 = pca_comp$x[,2],
clust_id = as.factor(km_clust$cluster))
ggplot(data = pca_rep, mapping = aes(x = pc1, y = pc2, color = clust_id)) +
scale_color_brewer(palette = 'Set1') +
geom_text(mapping = aes(label = sotu_name), size = 2.5, fontface = 'bold') +
labs(title = 'K-Means Cluster: 5 clusters on PCA Features',
x = 'Principal Component Analysis: Factor 1',
y = 'Principal Component Analysis: Factor 2') +
theme_grey() +
theme(legend.position = 'right',
legend.title = element_blank())
```
Many of the same patterns are apparent. The amount of distance among the same 6 George W. Bush speeches
is stark from both the rest of the same era and party. It is additionally easy to see the same early
level clusters among party and era combined. However, this visualization takes that context a step
further. With the exception of George W. Bush's speeches, the balance of speeches largely exist
along a kind of spectrum that has roughly ordered the speeches across time, despite the fact that no
date data is present in the model (it was stripped out in the loading step). Thus, it appears that
the content of the State of the Union addresses is largely driven by the era it is reflecting more
that the political association of the president giving it.
**Analysis**
In order to dig deeper into underlying drivers, examining the most common terms may help shed light
on those factors that mean the most to each cluster. To produce a word cloud specifically for this
purpose, it is necessary to produce a Term Document Matrix. This matrix is largely similar to the
Document Term matrix created earlier, but due to the needs of the library, will have its rows and
columns switched. At the same time, a larger set of common stop words will be scrubbed using the
Cornell SMART list.
```{r warning=FALSE}
term_doc <- TermDocumentMatrix(sotu_corpus)
term_doc$dimnames$Docs <- sotu$file_name
td_mat <- as.matrix(term_doc)
td_mat <- td_mat[!row.names(td_mat) %in% stopwords(kind = 'SMART'),]
commonality.cloud(term.matrix = td_mat)
```
The terms common across all speeches are no surprise. They are the staple terms of presidential
patriotism and political populism.
```{r warning=FALSE}
commonality.cloud(term.matrix = td_mat[,km_clust$cluster != 3], max.words = 50)
```
_Spectrum vs. Outliers_: Inspecting the words most common across the large spectrum shows a much
larger emphasis on use of the word "world", with many of the other terms occuring in s similar capacity.
```{r warning=FALSE}
commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 3], max.words = 50)
```
The potential outlier cluster housing 6 of George W. Bush's speeches is heavily influenced by the
frequent use of the word "applause". If this term is temporarily removed from the set, it may provide
more information as to secondary drivers.
```{r warning=FALSE}
td_mat <- td_mat[!row.names(td_mat) %in% c('applause'),]
commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 3], max.words = 50)
```
With "applause" gone, other differences immediately crop up. Where the large spectrum commonly used
the term "nation", these 6 speeches use other terms far more often like "america" and "country".
Additionally, other terms show up in higher frequency like "freedom" and "security" .
_Within the Spectrum_: In order to understand the differences within the spectrum, the post-WWII set
(cluster 4) is compared to the most modern set (cluster 2).
```{r warning=FALSE}
commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 4], max.words = 50)
```
The post-WWII/Korean War era speeches largely emphasize the "nation" and related terms, and reflect
a concern for the recovering economy fresh from the scars of the Great Depression.
```{r warning=FALSE}
commonality.cloud(term.matrix = td_mat[,km_clust$cluster == 2], max.words = 50)
```
The tone is markedly different at the other end of the spectrum in the modern era. Multiple terms
related to time are apparent, like "year/years" and "time". There is additional a stark contrast in
the use of "nation" in the earler era to the explicit "america" in multiple forms.
**Conclusion**
The presidential State of the Union speech largely reflects the time in which it is delivered. Though
stylistic differences are commonly notable and detectable across a single president as well as a
combination of the political party and era--its magnitude is secondary. On its own, political affiliation
is not a characteristic driver compared to the position in time. Democrat Jimmy Carter's 1978 State of
the Union has more in common with Republican Ronald Reagan's 1981 speech than it does with fellow
Democrat Bill Clinton's 1993 speech. This pattern is pervasive and leads to the conclusion that with
the exception of George W. Bush, the content of the State of the Union is most largely driven by the
evolution of time.