- Marti Hearst. 1999. Untangling Text Data Mining.
Post a little bit about yourself in the Introductions forum, following the instructions there.
This week's lab task is mostly to play! It is intended to get you comfortable with out-of-the-box text analysis tools.
Use Voyant to visualize a text or set of texts. It can be anything you want: a book, a set of lyrics, scripts from a show you like, news articles. Try out the various features in Voyant: phrases, keywords in contexts, etc.
Once you've had a chance to play with Voyant, *post a short response to the lab task forum (no more than 300 words) about your experience. Some possible things to post about: What was interesting or confusing about the tool? Did you find anything intriguing about your text or texts? Did it find any recurring patterns or phrases? Did you find any visualisations beyond the word cloud to be interesting? Any other thoughts? Don't forget to tell us what text you used with Voyant.
Just a reminder that 'readings' refer to the readings you should have done by the lecture, while lab tasks are done by next week. The intention is that they are both related to the current week's theme: readings prepare you for the lecture, and the lecture lets you practice that learning.
- Sections 4.1, 4.3, and 4.4 of Search Engines: Information Retrieval in Practice (Croft, Metzler and Strohman). Starts on page 72.
- Parts of Chapter 2, Introduction to Information Retrieval (Manning, Raghavan, Schütze): Intro, Tokenization, Stop lists
This week's lab task is about getting started with powerful tools that will underlie many of the skills you learn in the course. The lab task is posted in a Jupyter notebook format on Github.
- 2.2.3 of Intro to IR: Normalization. If you missed 2.2.1 and 2.2.2 last week, catch up on those also.
- Term Weighting for Humanists. Peter Organisciak.
Supplemental:
- Term frequency and weighting. Intro to IR.
This week's lab task is again a series of questions, following along with a worksheet. Find it here.
- Liza Daly's Generative Blackout Poetry - This work uses some simple language rules that will be useful in the future.
The following three readings are web articles related to Twitter bots: for activism, for recontextualization, and a roundup of interesting bots. Not all of these are text related, but serve as a good overview.
- How Twitter Bots Turn Tweeters into Activists
- Introducing censusAmericans, A Twitter Bot For America
- 12 Weird, Excellent Twitter Bots Chosen by Twitter’s Best Bot-Makers
- Optional: The Rise of Twitter Bots
Slides
The Twitter Bot assignment is posted on the Assignments page. There is a draft posting next week (post about your plans) and the final is due in two weeks.
- Submit Twitter bot draft
- Lab 4 Worksheet.
Against Cleaning - Katie Rawson, Trevor Muñoz
- Natural Language Processing for
programmers part
2 - Liza Daly
- This talks about an old concept, but is written from a beginner perspective and is useful for your assignment.
- Part of Speech Tagging - Chapter 10 (up to 10.4) of Speech and Language Processing (3rd ed. draft)
- Chapter 5.7 of the NLTK
Book - Bird et. al
- Just section 7, but sections 1-2, 4-6 are useful as supplements to the SLP reading if you need more info or simply find it interesting. Section 7 is the conclusion of the chapter, which succinctly describes the ways that we understand a part of speech.
Twitter bot: Post to the Twitter Bot Final forum.
No lab task. Complete your bot!
Week 6: Understanding Words - Natural Language Processing 2, Information Extraction and Dependency Parsing
- Information Extraction. Section 4.6 of Search Engines: Information Retrieval in Practice (Croft, Metzler and Strohman). Starts on page 113.
- Information Extraction (up to and including section 21.2.3). Speech and Language Processing (3rd ed. draft).
Optional Reading
-
Google's approach for dependency parsing, SyntaxNet, and their model trained on it - Parsey McParseFace - are the current state of the art. This tutorial, while optional, offers a look at Part of Speech tagging using feed-forward neural networks and has a nicely described description of transition-based dependency parsing.
06 - Natural Language Processing 1 - Part of Speech Tagging
Naive Bayes Classification and Sentiment, Speech and Language Processing (3rd edition). Dan Jurafsky and James H. Martin.
Notation
We getting to the point of the term where some mathematic notation is necessary for our readings to communicate the underlying theory.
If you are unfamiliar with Bayesian inference, the description on the 3rd page of this chapter might not satisfy your curiosity. The introduction to Bayes' Theorem from Khan Academy can help equip you with some more background about what we use Bayes' Theorem for.
Since we're looking at classes, you'll start seeing set theory, like c ∈ C. This means 'c' is an element of 'C', or in the context our reading, this class (c) is part of a set of all the possible classes (C).* *Why is that something we'd want to state? Because for Naive Bayes classification, we'll be choosing the class c with the highest probability given the evidence. The equations simply need a way to state "consider P(c|d) for all possible classes and choose the class with the highest value", which they do with .
- 07 - Classification
- Includes material from: SLP v.3 slides (Jurafsky and Martin )
No required readings this week, focus on the lab task!
Optional Reading
- Brent Daniel Mittelstadt, Patrick Allo, Mariarosaria Taddeo,
Sandra Wachter, Luciano Floridi. 2016. "The ethics of
algorithms: Mapping the
debate". *Big
Data & Society. *Vol 3, Issue 2.
- Recent BBC2 Story (audio): Controlling the Unaccountable Algorithm
As with our class on art and criticism, some of the most accessible work on ethics is from the bot-making community.
- Week 08 - Classification 2 and Ethics in Text Mining
- Includes material from: SLP v.3 slides (Jurafsky and Martin)
- Textual Analysis - John Burrows, A Companion to Digital Humanities
- Clustering - Sci-Kit Learn Documentation: Read Overview and the intros to 2.3.2 (K-Means) and 2.3.6 (Hierarchical clustering)
Supplemental Readings
- Cluster Analysis - Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining
- Beyond tokens: what character counts say about a page. Peter Organisciak
Spring Break. No class.
Topic modeling made just simple enough. 2012. Ted Underwood.
Probabilistic Topic Models. 2012. David Blei.
Supplemental
Introduction to Latent Dirichlet Allocation. 2011. Edwin Chen.
Lab task 09 - Dimensionality Reduction and Sentiment Analysis
Recommended: Get started on your topic modeling assignment. Make sure you can get MALLET running on your system.
Topic Modeling Assignment Due. See description on the Assignments page.
Post the Problem Statement for your Text Mining Project. See description on the Assignments page.
Narrative framing of consumer sentiment in online restaurant reviews. Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, Noah A. Smith.
Optional but Recommended
Indexing by Latent Semantic Analysis. Deerwester, Dumais, Furnas, Landauer, Harshman.
This is one of our core papers in Library and Information Science - 13k citations can't be wrong. You'll notice that these famous papers are particularly easy to read - Chengzheng Zhai's smoothing paper is another example - a good reminder that being clever is only useful if you can communicate it.
Topic Modelling II and Sentiment Analysis
Topic Modeling Assignment Due. See description on the Assignments page.
Post the Problem Statement for your Text Mining Project. See description on the Assignments page.
It's a busy time, no readings this week!
- Literature Review and Data Collection for your final project.
-
Word Embeddings for the digital humanities. 2015. Benjamin Schmidt.
-
Vector Representations of Words (stop at 'Building the Graph'). Tensorflow Tutorials.
Supplemental (Optional)
Bonus
Something to play with: the "Bonus App" at the bottom of Radim Řehůřek's Word2Vec tutorial.
May 3rd is the last day to turn in late lab tasks! Get them in!