Columbia Data Science Institute and Columbia Enterpreneuship
This interactive course covers the basics of text mining using Python. Unstructured text (e.g. news) contains a vast amount of information but can be overwhelming to process. Text mining techniques can facilitate your navigation, organization, and insights discovery with unstructured text data. After the course, you should understand basic text manipulation in Python, standard pre-processing methods for English text, common transformations of unstructured text to quantitative data, and intuitive statistics to help navigate through a corpus.
- When: Friday, January 17th, 2020
- Where: Room 903 SSW
- Instructor info:
- Wayne Tai Lee
- email: wtl2109 | columbia.edu
-
Software setup: In this course we'll be using Anaconda to manage dependencies and Jupyter Notebooks to run code. Please follow these instructions to ensure you have the correct setup.
-
If you haven't used Python at all previously, I recommend starting with the tutorials on learnpython.org until the regular expression lesson then moving on to DataCamp's free Introduction to Python course for more practice.
9:00am - 10:00am: Morning coffee
10:00am - 12:00am Lecture + Lab:
- Introduction to text mining
- Regular Expression and String Manipulation in Python
- Common pre-processing steps for text
12:00pm - 1:00pm Lunch on own
1:00pm - 1:45pm Lecture:
1:45pm - 2:30pm Lab
2:30pm - 3:00pm Break
Optional lectures or lab time:
3:00pm - 4:00 pm
- Putting things together
- Validation discussion
- Basics of statistics (mean, variance, t-test etc.)
- Basic programming skills in Python
- Basic understanding of data structures (data frames)