Skip to content

Commit

Permalink
Produce word clouds for haiku, colors, and fauna.
Browse files Browse the repository at this point in the history
  • Loading branch information
Notgnoshi committed Aug 28, 2019
1 parent f886965 commit 8d4e2e6
Showing 1 changed file with 182 additions and 15 deletions.
197 changes: 182 additions & 15 deletions experiments/eda/word_clouds.ipynb
Original file line number Diff line number Diff line change
@@ -1,29 +1,196 @@
{
"cells": [
{
"cell_type": "raw",
"metadata": {},
"source": [
"\\author{Austin Gill}\n",
"\\title{Exploratory Data Analysis -- Word Clouds}\n",
"\\maketitle\n",
"\\tableofcontents"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of this notebook is to examine word frequency in the haiku dataset in a more qualitative and subjective manner.\n",
"\n",
"The intent is to build a word cloud not only for all of the words in the corpus, but also for\n",
"\n",
"* flowers\n",
"* colors\n",
"* animals"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Automagically reimport haikulib if it changes.\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"%aimport haikulib.utils.data\n",
"%aimport haikulib.utils.nlp\n",
"\n",
"%config InlineBackend.figure_format = 'svg'\n",
"%matplotlib inline\n",
"\n",
"from collections import Counter\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from wordcloud import WordCloud\n",
"\n",
"plt.rcParams[\"figure.figsize\"] = (16, 9)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Word Cloud with Stop Words\n",
"\n",
"If we build the word cloud without removing stop words, the results are less illuminating."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bag = haikulib.utils.data.get_bag_of(column=\"haiku\", kind=\"words\")\n",
"wordcloud = WordCloud(max_words=500, width=1600, height=900).generate_from_frequencies(bag)\n",
"\n",
"plt.imshow(wordcloud, interpolation=\"bilinear\")\n",
"plt.axis(\"off\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Word Cloud without Stop Words\n",
"\n",
"However, once all of the stop words are removed, we begin to see more itneresting results.\n",
"\n",
"As it was put to me, the results are quite stereotypical, but then stereotypes exist for a reason, and in this particular case they seem to be supported by evidence."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bag = haikulib.utils.data.get_bag_of(column=\"nostopwords\", kind=\"words\")\n",
"wordcloud = WordCloud(max_words=500, width=1600, height=900).generate_from_frequencies(bag)\n",
"\n",
"plt.imshow(wordcloud, interpolation=\"bilinear\")\n",
"plt.axis(\"off\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Flower Word Cloud\n",
"\n",
"There are a large amount of flora mentioned in the haiku, so I thought it would be entertaining to look at a word cloud of flowers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flower_names = haikulib.utils.data.get_flowers()\n",
"flowers = Counter()\n",
"for word, count in bag.items():\n",
" if word in flower_names:\n",
" flowers[word] = count\n",
"\n",
"wordcloud = WordCloud(max_words=500, width=1600, height=900).generate_from_frequencies(flowers)\n",
"\n",
"plt.imshow(wordcloud, interpolation=\"bilinear\")\n",
"plt.axis(\"off\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is a slight downside to the implementation of finding flower/tree names that I have yet to find a work around for.\n",
"Some flowers, trees, and colors have multi-word names.\n",
"However, using a bag-of-words representation makes it impossible to find multi-token names.\n",
"\n",
"This is an oustanding problem, that I'm unsure of how to approach at this time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Color Word Cloud\n",
"\n",
"One of the most interesting and unexpected applications of programming that I have found was a PyCon 2017 conference talk titled [Gothic Colors Using Python to understand color in nineteenth century literature](https://www.youtube.com/watch?v=3dDtACSYVx0).\n",
"This was the first application of programming to a soft science that I recall having been exposed to, and it's made a lasting impression.\n",
"\n",
"Ever since watching the talk, I've wanted to apply scientific techniques to solve non-scientific problems.\n",
"\n",
"I still intend on producing a color palette for haiku, but in the mean time, a word cloud of (single-token) color names will do.\n",
"The color names and their RGB values have been taken from [https://xkcd.com/color/rgb/](https://xkcd.com/color/rgb/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"color_names = haikulib.utils.data.get_colors()\n",
"colors = Counter()\n",
"for word, count in bag.items():\n",
" if word in color_names:\n",
" colors[word] = count\n",
" \n",
"wordcloud = WordCloud(max_words=500, width=1600, height=900).generate_from_frequencies(colors)\n",
"\n",
"# Set the colors to the actual RGB color values experimentally determined to be associated with that color name.\n",
"# See: https://xkcd.com/color/rgb/\n",
"for i, layout in enumerate(wordcloud.layout_):\n",
" (color, a), b, c, d, _ = layout\n",
" # Black on a black background doesn't look so hot.\n",
" rgb = color_names[color] if color != \"black\" else color_names[\"dark\"]\n",
" wordcloud.layout_[i] = ((color, a), b, c, d, rgb)\n",
"\n",
"plt.imshow(wordcloud, interpolation=\"bilinear\")\n",
"plt.axis(\"off\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploratory Data Analysis - Word Clouds\n",
"# Animal Word Cloud\n",
"\n",
"* Build a word cloud after removing stop words\n",
"* Identify haiku-specific stop words?\n",
"* Build word cloud after stemming/lemmatization\n",
"* Build word clouds for\n",
" * flowers\n",
" * colors\n",
" * season n-grams\n",
" * wind n-grams\n",
" * birds\n",
" * animals"
"A host of flora and fauna are mentioned in the haiku dataset, so I want to produce a word cloud for animals mentioned in haiku as well.\n",
"However, compiling a list of animal names is nontrivial, and I prefer to defer the production of an animal word cloud until the previouslly mentioned parsing problems have been addressed."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "research",
"language": "python",
"name": "python3"
"name": "research"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -35,9 +202,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}

0 comments on commit 8d4e2e6

Please sign in to comment.