-
Notifications
You must be signed in to change notification settings - Fork 0
/
syndeyfc.Rmd
284 lines (229 loc) · 8.67 KB
/
syndeyfc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
title: "What if We Take the Cup out of Melbourne?"
date: '2019-05-02'
slug: sydney-fc
tags:
- R
- Viz
categories: Post
---
+ [Introduction](#introduction)
+ [Data](#data)
+ [Data Summary](#data-summary)
+ [Tidytext Analysis](#tidytext-data-analysis)
+ [Data Cleaning](#data-cleaning)
+ [Data Visualisation](#data-visualisation)
+ [Conclusions](#conclusions)
+ [References](#references)
## Introduction
Word associations can be created using word vectors with relatively simple linear algebra following Julia Silge's blog Word Vectors with Tidy Data Principles.
This approach is a step beyond of a simple word cloud using word counts but not as complex as a word2vec.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
```
```{r load packages, message=FALSE, warning=FALSE}
library(tidyverse)
library(tidytext)
library(widyr)
library(irlba)
library(broom)
options(stringsasFactors = FALSE)
```
## Data
The dataset contains headlines published over a period of seventeen years from Australian Broadcasting Corp (ABC) news are freely available [One Million News Headlines](https://www.kaggle.com/therohk/million-headlines).
```{r import data}
# Import offline news query
news <- readr::read_csv(here::here("blogdown_source/data/abcnews.csv"))
```
## Tidytext Analysis
### Generate Ngrams and Probabilities
We will generate unigrams, bigrams and trigrams with the original posts. With these ngrams we will calculate the word counts and probabilities.
**Unigrams**
Calculate unigram probabilities (used to normalize skipgram probabilities later). This is based the count and how often the word occurs ie the probability from the text contents.
```{r unigrams}
# Create a post id
news$postID <- row.names(news)
# First convert the posts into tidytext format
tidy_posts <- news %>%
unnest_tokens(word, headline_text) %>%
# Remove stopwords
anti_join(get_stopwords()) %>%
# Remove null for empty contents
filter(word != "null")
# Now calculate the unigram probabilies
unigram_probs <- tidy_posts %>%
dplyr::count(word, sort = TRUE) %>%
mutate(p = n / sum(n)) %>%
# Remove words that occur only once
filter(n>1)
# View the unigrams
unigram_probs %>%
head(15)
```
The most common word is police, which may be as expected with news reporting. The other unigram words are not very meaningful.
**Bigrams**
Now do the same for for 2 word ie bigram combinations.
```{r bigrams}
# Calculate bigram probabilities
tidy_posts_bigrams <- news %>%
unnest_tokens(bigram, headline_text, token = "ngrams",n_min=2,n=2)
# Separate a character column into multiple columns using a regular expression separator
bigrams_separated <- tidy_posts_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
# Remove stopwords
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# Calculate new bigram counts
bigram_counts <- bigrams_filtered %>%
dplyr::count(word1, word2, sort = TRUE)
# Unite multiple columns into one by pasting strings together
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
# Calculate the probabilities from the counts
bigram_probs <- bigrams_united %>%
dplyr::count(bigram, sort = TRUE) %>%
mutate(p = n / sum(n)) %>%
# Remove words that occur only once
filter(n>1) %>%
drop_na()
# View the bigrams
bigram_probs %>%
head(40)
```
The most common bigram refers to country hour, which is an [ABC radio show](https://www.abc.net.au/radio/programs/). The other bigrams have some meaning, including reference to police and events that have happened in the Gold Coast and Broken Hill.
**Trigrams**
Now do the same for for 3 word ie trigram combinations.
```{r trigrams}
# Calculate bigram probabilities. First convert the posts into tidytext format
tidy_posts_trigrams <- news %>%
unnest_tokens(trigram, headline_text, token = "ngrams",n_min=3,n=3)
trigram_probs <- tidy_posts_trigrams %>%
dplyr::count(trigram, sort = TRUE) %>%
mutate(p = n / sum(n)) %>%
# Remove words that occur only once
filter(n>1) %>%
drop_na()
# View the trigrams
trigram_probs %>%
head(15)
```
The trigrams now extract general word combinations rather than context.
**Skipgrams**
Now calculate the skipgram probabilities or how often we find each word next to every other word within the context window. Here use the unigrams.
```{r skipgrams unigrams}
# Now calculate the skipgram probabilities. Create context window with length 6
tidy_skipgrams <- news %>%
unnest_tokens(ngram, headline_text, token = "ngrams", n=6) %>%
mutate(ngramID = row_number()) %>%
tidyr::unite(skipgramID, postID, ngramID) %>%
unnest_tokens(word, ngram)
# Calculate probabilities
skipgram_probs <- tidy_skipgrams %>%
pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
mutate(p = n / sum(n))
# Normalize probabilities
normalized_prob <- skipgram_probs %>%
filter(n > 4) %>%
rename(word1 = item1, word2 = item2) %>%
left_join(unigram_probs %>%
select(word1 = word, p1 = p),
by = "word1") %>%
left_join(unigram_probs %>%
select(word2 = word, p2 = p),
by = "word2") %>%
mutate(p_together = p / p1 / p2)
```
**Word Probabilities**
View the probability of the city words in the context window of other words.
- sydney
- melbourne
```{r sydney probability}
# The variable p_together here describes the probability the word2 occurs within the context window of word1.
# A more instructive and useful type of output can be created by filtering this dataframe for an individual word- let's try "sydney":
normalized_prob %>%
filter(word1 == "sydney") %>%
arrange(-p_together) %>%
select(-p,-p1,-p2)
```
```{r melbourne}
normalized_prob %>%
filter(word1 == c("melbourne")) %>%
arrange(-p_together) %>%
select(-p,-p1,-p2)
```
**SVD**
Use Simple singular value decomposition (SVD) to reduce the dimensionality of the large matrix to an SVD matrix where each row is a word vector or eigenvector.
```{r SVD }
# Simple singular value decomposition from the irlba package.
pmi_matrix <- normalized_prob %>%
mutate(pmi = log10(p_together)) %>%
cast_sparse(word1, word2, pmi)
# Remove missing data
pmi_matrix@x[is.na(pmi_matrix@x)] <- 0
# Run SVD
pmi_svd <- irlba(pmi_matrix, 256, maxit = 500)
# Next we output the word vectors:
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)
```
**Word Synomyms**
Now create word synomyms from the word vectors above.
```{r synonyms}
# Create a function to identify synonyms using the word vectors created above:
search_synonyms <- function(word_vectors, selected_vector) {
similarities <- word_vectors %*%
selected_vector %>%
tidy() %>%
as_tibble() %>%
rename(token = .rownames,
similarity = unrowname.x.)
similarities %>%
arrange(-similarity)
}
# Let's see what the top synonyms are for the term "sydney"
search_synonyms(word_vectors,word_vectors["wind",])
```
```{r melbourne synonym}
# Let's see what the top synonyms are for the term "melbourne"
search_synonyms(word_vectors,word_vectors["melbourne",])
```
## Data Visualisation
Now create a synonym plot comparing the two cities.
```{r synomym plot}
sydney <- search_synonyms(word_vectors, word_vectors["sydney",])
melbourne <- search_synonyms(word_vectors, word_vectors["melbourne",])
sydney %>%
mutate(selected = "sydney") %>%
bind_rows(melbourne %>%
mutate(selected = "melbourne")) %>%
group_by(selected) %>%
top_n(20, similarity) %>%
ungroup %>%
mutate(token = reorder(token, similarity)) %>%
ggplot(aes(token, similarity, fill = selected)) +
geom_col(show.legend = FALSE) +
facet_wrap(~selected, scales = "free") +
coord_flip() +
theme(strip.text=element_text(hjust=0)) +
scale_fill_viridis_d(option = "E")+
labs(x = NULL, title = "What word vectors are most similar to 'Sydney' and 'Melbourne'?")
```
So, what if we took the cup out of Melbourne?
```{r word math}
mystery_sport <- word_vectors["melbourne",] - word_vectors["cup",] + word_vectors["sydney",]
search_synonyms(word_vectors, mystery_sport)
```
## Conclusions
From these plots the cities are synonyms of each other, as expected.
But it appears that the sport talked about in Sydney is football, in particular Wanderers Football Club.
## References
```{r package citations}
# Package citations as seen in freerangestats.info/blog/2019/03/30/afl-elo-adjusted
thankr::shoulders() %>%
mutate(maintainer = str_squish(gsub("<.+>", "", maintainer))) %>%
group_by(maintainer) %>%
summarise(`Number packages` = sum(no_packages),
packages = paste(packages, collapse = ", ")) %>%
knitr::kable()
```