Materials for the Text to Tech workshop at the Digital Humanities Oxford Summer School by Kaspar von Beelen, Mariona Coll Ardanuy and Federico Nanni.
The workshop will mostly rely on Google Colab for the hands-on activities.
- Welcome slides
- Intro to Python (a)
- Intro to Python (b)
- Intro to Python (c)
- Functions
- Opening Files
- Basic Text Processing
- Regular Expressions
- List, sets and tuples
- Dictionaries and JSON
- Text Processing Exercises
- Data Structures Exercises
- Introduction to Machine Learning for NLP: slides
- Intro to NLP (1)
- Intro to NLP (2)
- Intro to NLP (3)
- Intro to NLP (4)
- Introduction to Language Modelling: slides
- Word embeddings (1)
- Word embeddings (2)
- Introduction to Foundation Models and Transfer Learning: slides
- Transformers for NLP
- Introduction to Generative AI: slides
- Poking LLMs with HuggingFace
- Using local LLMs
Our entire course will be on Google Colab. If you want to set up the notebooks locally on your machine, these are the instructions. However bear in mind that some of the tools might not work well on certain old laptops (especially from Day 4 onwards).
- Install Anaconda
- Download the content of this repository and unzip
- Open Anaconda Navigator
- From Anaconda, create environment py39
- Install JupyterLab in environment
- Launch JupyterLab
- Open terminal in Jupyter Lab
- Write the following in the terminal, step-by-step:
conda activate py39
- Update pip:
pip install --upgrade pip
- Change directory using the
cd
command in the terminal until you are in the course folder. There you should run:pip install -r requirements.txt
- Add the environment to Jupyter (following instructions from here) or by running
ipython kernel install --user --name=py39
Then you can already start using the notebooks: select as kernelpy39
(restart JupyterLab if the correct kernel does not show)
You find more detailed instructions here.
Datasets used:
- The
Living Machines atypical animacy dataset
, freely available here. - MuSe: The Musical Sentiment Dataset Muse
- A historical dataset on popular baby names in the United States from 1880 onwards. Available here.
- A sample of British Library 19th Century Books collected from here.
- A sample of British Newspapers articles, digitized by Heritage Made Digital.
- Walsh, Melanie. Introduction to Cultural Analytics & Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome
- Karsdorp, Folgert. Python Programming for Humanists. http://www.karsdorp.io/python-course/.
- Montfort, Nick. Exploratory Programming for the Arts and Humanities. Cambridge, Massachusetts: The MIT Press, 2016. https://mitpress.mit.edu/books/exploratory-programming-arts-and-humanities.
- Sinclair, Stéfan, and Geoffrey Rockwell. The Art of Literary Text Analysis. Melissa Mony., 2016. https://github.com/sgsinclair/alta/blob/77b256f7c3ff3ceb6643d53da401096c8cdcc468/ipynb/ArtOfLiteraryTextAnalysis.ipynb.
- Graham, Shawn, Ian Milligan, Scott Weingart. The Historian's Macroscope. Under contract with Imperial College Press. Open Draft Version, Autumn 2013, http://themacroscope.org
- Downey, Allen, Peter Wentworth, Jeffrey Elkner, and Chris Meyers. “How To Think Like A Computer Scientist: Learning with Python 3.” (2016).
- Karsdorp, Folgert, Mike Kestemont and Allen Riddell, Humanities Data Analysis: Case Studies with Python, https://www.humanitiesdataanalysis.org
- Jurafsky, Daniel, and J. H. Martin. "Vector semantics and embeddings." Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2019): 94-122. https://web.stanford.edu/~jurafsky/slp3/6.pdfLinks to an external site.
- Smith, Noah A. "Contextual word representations: A contextual introduction." arXiv preprint arXiv:1902.06006 (2019). https://arxiv.org/pdf/1902.06006.pdfLinks to an external site.
- Boleda, Gemma. "Distributional semantics and linguistic theory." Annual Review of Linguistics 6 (2020): 213-234. https://arxiv.org/pdf/1905.01896.pdfLinks to an external site.
- Rogers, Anna. "Changing the World by Changing the Data." arXiv preprint arXiv:2105.13947 (2021). https://arxiv.org/pdf/2105.13947.pdfLinks to an external site.
- Wevers, Melvin, and Marijn Koolen. "Digital begriffsgeschichte: Tracing semantic change using word embeddings." Historical Methods: A Journal of Quantitative and Interdisciplinary History 53, no. 4 (2020): 226-243. https://www.tandfonline.com/doi/pdf/10.1080/01615440.2020.1760157Links to an external site.
- Bender, Emily M., and Alexander Koller. "Climbing towards NLU: On meaning, form, and understanding in the age of data." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5185-5198. 2020. https://www.aclweb.org/anthology/2020.acl-main.463.pdfLinks to an external site.
- Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. "Text as data: A new framework for machine learning and the social sciences." Princeton University Press, 2022. https://press.princeton.edu/books/paperback/9780691207551/text-as-data
This course is based upon many previous resources. Apart from the ones above:
- Nilo Pedrazzini's introduction notebook to Word2Vec.
- Materials from previous editions of this course, written by Barbara McGillivray and Gard Jenset
- The Turing's Research Software Engineering and Research Data Science Courses
- The Turing Way
- The Turing Digital Humanities & Research Software Engineering Summer School
- Fede's Computational Text Analysis Course
Resources mentioned during the workshop: slides