Week 1: Introduction

Readings

Marti Hearst. 1999. Untangling Text Data Mining.

Slides

Introduction

For Next Week

Homework

Post a little bit about yourself in the Introductions forum, following the instructions there.

Lab Task

This week's lab task is mostly to play! It is intended to get you comfortable with out-of-the-box text analysis tools.

Use Voyant to visualize a text or set of texts. It can be anything you want: a book, a set of lyrics, scripts from a show you like, news articles. Try out the various features in Voyant: phrases, keywords in contexts, etc.

Once you've had a chance to play with Voyant, *post a short response to the lab task forum (no more than 300 words) about your experience. Some possible things to post about: What was interesting or confusing about the tool? Did you find anything intriguing about your text or texts? Did it find any recurring patterns or phrases? Did you find any visualisations beyond the word cloud to be interesting? Any other thoughts? Don't forget to tell us what text you used with Voyant.

Week 2: Fundamentals

Just a reminder that 'readings' refer to the readings you should have done by the lecture, while lab tasks are done by next week. The intention is that they are both related to the current week's theme: readings prepare you for the lecture, and the lecture lets you practice that learning.

Readings

Sections 4.1, 4.3, and 4.4 of Search Engines: Information Retrieval in Practice (Croft, Metzler and Strohman). Starts on page 72.
Parts of Chapter 2, Introduction to Information Retrieval (Manning, Raghavan, Schütze): Intro, Tokenization, Stop lists

Slides

Week 2: Fundamentals

For Next Week

This week's lab task is about getting started with powerful tools that will underlie many of the skills you learn in the course. The lab task is posted in a Jupyter notebook format on Github.

Week 3: Treating Text as Data - Features

Readings

2.2.3 of Intro to IR: Normalization. If you missed 2.2.1 and 2.2.2 last week, catch up on those also.
Term Weighting for Humanists. Peter Organisciak.

Supplemental:

Term frequency and weighting. Intro to IR.

Slides

Week 3: Features

For Next Week

Lab Task

This week's lab task is again a series of questions, following along with a worksheet. Find it here.

Week 4: Text Mining for Art and Criticism

Readings

Liza Daly's Generative Blackout Poetry - This work uses some simple language rules that will be useful in the future.

The following three readings are web articles related to Twitter bots: for activism, for recontextualization, and a roundup of interesting bots. Not all of these are text related, but serve as a good overview.

How Twitter Bots Turn Tweeters into Activists
Introducing censusAmericans, A Twitter Bot For America
12 Weird, Excellent Twitter Bots Chosen by Twitter’s Best Bot-Makers
Optional: The Rise of Twitter Bots

Slides

3.5 - Features Cont.
4.0 - Text Mining for Art and Criticism

Assignments

The Twitter Bot assignment is posted on the Assignments page. There is a draft posting next week (post about your plans) and the final is due in two weeks.

For Next Week

Submit Twitter bot draft
Lab 4 Worksheet.

Week 5.1: Document Access

Readings

Against Cleaning - Katie Rawson, Trevor Muñoz

Week 5.2: Understanding Words - Natural Language Processing 1, Part of Speech Tagging

Readings

Natural Language Processing for programmers part 2 - Liza Daly
- This talks about an old concept, but is written from a beginner perspective and is useful for your assignment.
Part of Speech Tagging - Chapter 10 (up to 10.4) of Speech and Language Processing (3rd ed. draft)
Chapter 5.7 of the NLTK Book - Bird et. al
- Just section 7, but sections 1-2, 4-6 are useful as supplements to the SLP reading if you need more info or simply find it interesting. Section 7 is the conclusion of the chapter, which succinctly describes the ways that we understand a part of speech.

Slides

05 - Getting Data

For Next Week

Twitter bot: Post to the Twitter Bot Final forum.

No lab task. Complete your bot!

Week 6: Understanding Words - Natural Language Processing 2, Information Extraction and Dependency Parsing

Readings

Information Extraction. Section 4.6 of Search Engines: Information Retrieval in Practice (Croft, Metzler and Strohman). Starts on page 113.
Information Extraction (up to and including section 21.2.3). Speech and Language Processing (3rd ed. draft).

Optional Reading

SyntaxNet Detailed Tutorial

  Google's approach for dependency parsing, SyntaxNet, and
  their model trained on it - Parsey McParseFace - are the
  current state of the art. This tutorial, while optional,
  offers a look at Part of Speech tagging using feed-forward
  neural networks and has a nicely described description of
  transition-based dependency parsing.

Slides

06 - Natural Language Processing 1 - Part of Speech Tagging

For Next Week

Worksheet for the Lab Task 05.

Week 7: Classification 1

Readings

Naive Bayes Classification and Sentiment, Speech and Language Processing (3rd edition). Dan Jurafsky and James H. Martin.

Notation

We getting to the point of the term where some mathematic notation is necessary for our readings to communicate the underlying theory.

If you are unfamiliar with Bayesian inference, the description on the 3rd page of this chapter might not satisfy your curiosity. The introduction to Bayes' Theorem from Khan Academy can help equip you with some more background about what we use Bayes' Theorem for.

Since we're looking at classes, you'll start seeing set theory, like c ∈ C. This means 'c' is an element of 'C', or in the context our reading, this class (c) is part of a set of all the possible classes (C).* *Why is that something we'd want to state? Because for Naive Bayes classification, we'll be choosing the class c with the highest probability given the evidence. The equations simply need a way to state "consider P(c|d) for all possible classes and choose the class with the highest value", which they do with .

Slides

07 - Classification
Includes material from: SLP v.3 slides (Jurafsky and Martin )

For Next Week

Lab Task 06 Worksheet

Week 8.1: Classification 2

Week 8.2 Ethics in Text Mining

Readings

No required readings this week, focus on the lab task!

Optional Reading

Brent Daniel Mittelstadt, Patrick Allo, Mariarosaria Taddeo, Sandra Wachter, Luciano Floridi. 2016. "The ethics of algorithms: Mapping the debate". *Big Data & Society. *Vol 3, Issue 2.
- Recent BBC2 Story (audio): Controlling the Unaccountable Algorithm

As with our class on art and criticism, some of the most accessible work on ethics is from the bot-making community.

Bots Should Punch Up
Ethical Bot Making
How to Make a Bot that Isn't Racist

Slides

Week 08 - Classification 2 and Ethics in Text Mining
Includes material from: SLP v.3 slides (Jurafsky and Martin)

For Next Week

Lab Task 7 Worksheet

Week 9: Clustering

Readings

Textual Analysis - John Burrows, A Companion to Digital Humanities
Clustering - Sci-Kit Learn Documentation: Read Overview and the intros to 2.3.2 (K-Means) and 2.3.6 (Hierarchical clustering)

Supplemental Readings

Cluster Analysis - Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining
Beyond tokens: what character counts say about a page. Peter Organisciak

Slides

Week 9 - Clustering

For the next two weeks

Lab 08 Worksheet

Spring Break Week

Spring Break. No class.

Week 10: Topic Modeling and Dimensionality Reduction 1

Readings

Topic modeling made just simple enough. 2012. Ted Underwood.

Probabilistic Topic Models. 2012. David Blei.

Supplemental

Introduction to Latent Dirichlet Allocation. 2011. Edwin Chen.

Slides

Topic Modeling Slides

For Next Week

Lab task 09 - Dimensionality Reduction and Sentiment Analysis

Recommended: Get started on your topic modeling assignment. Make sure you can get MALLET running on your system.

For Two Weeks from Now

Topic Modeling Assignment Due. See description on the Assignments page.

Post the Problem Statement for your Text Mining Project. See description on the Assignments page.

Week 11.1 Topic Modelling 2

Week 11.2 Sentiment Analysis

Readings

Narrative framing of consumer sentiment in online restaurant reviews. Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, Noah A. Smith.

Optional but Recommended

Indexing by Latent Semantic Analysis. Deerwester, Dumais, Furnas, Landauer, Harshman.

This is one of our core papers in Library and Information Science - 13k citations can't be wrong. You'll notice that these famous papers are particularly easy to read - Chengzheng Zhai's smoothing paper is another example - a good reminder that being clever is only useful if you can communicate it.

Slides

Topic Modelling II and Sentiment Analysis

For Next Week

Topic Modeling Assignment Due. See description on the Assignments page.

Post the Problem Statement for your Text Mining Project. See description on the Assignments page.

Week 12: Visualization

Readings

It's a busy time, no readings this week!

Slides

Week 13 - Visualization

For Next Week

Literature Review and Data Collection for your final project.

Week 13: Word Embeddings

Readings

Word Embeddings for the digital humanities. 2015. Benjamin Schmidt.
Vector Representations of Words (stop at 'Building the Graph'). Tensorflow Tutorials.

Supplemental (Optional)

Distributed Representations of Words and Phrases and their Compositionality. Mikolov et. al.

Bonus

Something to play with: the "Bonus App" at the bottom of Radim Řehůřek's Word2Vec tutorial.

Files

syllabus.md

Latest commit

History

syllabus.md

File metadata and controls

Week 1: Introduction

Readings

Slides

For Next Week

Homework

Lab Task

Week 2: Fundamentals

Readings

Slides

For Next Week

Week 3: Treating Text as Data - Features

Readings

Slides

For Next Week

Lab Task

Week 4: Text Mining for Art and Criticism

Readings

Assignments

For Next Week

Week 5.1: Document Access

Readings

Week 5.2: Understanding Words - Natural Language Processing 1, Part of Speech Tagging

Readings

Slides

For Next Week

Week 6: Understanding Words - Natural Language Processing 2, Information Extraction and Dependency Parsing

Readings

SyntaxNet Detailed Tutorial

Slides

For Next Week

Week 7: Classification 1

Readings

Slides

For Next Week

Week 8.1: Classification 2

Week 8.2 Ethics in Text Mining

Readings

Slides

For Next Week

Week 9: Clustering

Readings

Slides

For the next two weeks

Spring Break Week

Week 10: Topic Modeling and Dimensionality Reduction 1

Readings

Slides

For Next Week

For Two Weeks from Now

Week 11.1 Topic Modelling 2

Week 11.2 Sentiment Analysis

Readings

Slides

For Next Week

Week 12: Visualization

Readings

Slides

For Next Week

Week 13: Word Embeddings

Readings

Week 14: What's Next: Remainder Notes from Text Mining

Slides

Reminders