Kaggle data used.
Tableau Story Here.
This repository contains Python scripts for cleaning and processing text data, including:
merge_and_categorize.py
: script to merge and categorize news articles, indicating a viable news source and fake news source.extract_spacy.py
: script to extract entities (e.g. countries, organizations, norp, etc.) from news articles using SpaCy Libraryword_count.py
: script to count the frequency of words in the news articlessentiment_analysis.py
: script to perform sentiment analysis on news articles rating a polarity from -1.0 (most negative) to 1.0(most positive)clean_corrupted.py
: script to clean corrupted news articles
- Python 3.x
- pandas
python -m pip install pandas
- spacy
python -m pip install spacy
- SpaCy model
python -m spacy download en_core_web_md
To use the scripts, follow these steps:
- Clone this repository to your local machine
- Download the Kaggle datasets and put them in a folder
..root/data
- Install the necessary requirements (pandas and spacy) using pip
- Run the
main.py
script or scripts in the following order:
cd data-cleaning-tools
py main.py
Or
cd data-cleaning-tools
py merge_and_categorize.py
py extract_country.py
py sentiment_analysis.py
cd misc-tools-and-parallelization
py word_count.py
py clean_corrupted.py