This project aims to model topics in environment news covered by various sources over several years. My goal here is to explore topics using unsupervised learning techniques and to assess their performance in detecting subtopics. These techniques include (1) matrix decomposition/factorization: e.g., NMF (Non-negative Matrix Factorization), LDA (Latent Dirichlet Allocation), PCA (Pricinpal Component Analysis) and (2) clustering algorithms (e.g., KMeans).
One critical assumption I made in this project was that each article can only be described by one topic, so one-hot encoding was used to categorize the articles. In reality, an article may touch upon many topics, and this nuance can certainly be captured by the model. However, this project focuses less on the subtlety of the topics but instead on the amount of coverage different news sources gave to these topics. As a result, this simplication made comparison between sources much easier.
I obtained full-text articles from NYTimes and Fox News to compare their coverage with each other. NYTimes was seleted specifically for its extensively developed API and Fox News due to it being a good comparison point to NYTimes. Additionally, I used NewsAPI to get articles from a plethora of sources up to 1 month old (free plan). Results of this project are displayed as an interactive Tableau dashboard in 3 tabs: (1) evolution of environmental topics in NYTimes over 16 years, (2) comparison between NYTimes vs. Fox News, and (3) topics distribution in articles obtained by NewsAPI.
The full-text articles are stored in MongoDB on an AWS-EC2 instance. MongoDB is a NoSQL database and uses JSON-like documents and syntax. As there is no definite data structure between NewsAPI output, Fox News website, and NYTimes API output, in addition to the long-form nature of full-text articles, MongoDB was selected to work with unstructured data and the articles are stored on an AWS-EC2 instance due to large file size. The CSVs in this GitHub only consists of the urls for the articles:
- NYTimes: 13654 articles (08/2002-07/2018)
- Fox News: 3132 articles (09/2012-08/2018)
- NYTimes/Fox News subset (same time frame): 6876 articles (09/2012-08/2018)
- NewsAPI (various sources): 20628 articles (09/2017-08/2018)
This project consists of three parts:
- Getting URLs for environment-related articles (codes available here)
- Downloading full-text articles from URLs using newspaper API and storing them on MongoDB on an AWS-EC2 instance
- Using NLP techniques to model topics
- Integrating modeling data into Tableau for visualization and comparison
Python packages required: pandas, numpy, seaborn, matplotlib, sklearn, pymongo