Fake news classifier web application now available! Click here!
This is an end-to-end data science/machine learning project exploring a fake news dataset with exploratory data analysis, using NLP tools and machine learning to classify fake and genuine news. The model is used in a Django web application where a news article URL is entered as input and predicts whether the article is genuine or fake.
"0. fake-news-analysis-training" contains exploratory data analysis of the data, model training and evaluation.
"1. hyperparameter-tuning" contains a notebook examining how to hyperparameter tune a Random Forest model. Because of the large dataset used, only a sample (2000 example for each class) is used for this investigation. The current model is NOT tuned, and is up to the user whether to go down this route.
"2. app" contains the web application, using Django and Heroku. Using and virtual environment is highly advisable. See "How to load virtual environment" below for details.
Data used for this project can be found from Kaggle. Credit goes to Clément Bisaillon for creating the dataset.
Data contains over 23k examples of fake news and over 21k examples of genuine news.
When working with web applications, it is important to work within a virtual environment. This is because we require certain modules/libraries to be a specific version for our project which in a way does not affect the local version installed on the computer. The project will have specific versions of libraries that are boxed up and won't affect your computer.
If your virtual environment is not yet installed, run the following command:
pip install virtualenv
Next, in the directory where you are working from, create a virtual environment. For Windows:
virtualenv <ENVIRONMENT_NAME>
Once created, enter in the command line of the root directory:
.\<ENVIRONMENT_NAME>\Scripts\activate
and for Mac/Linux:
source <ENVIRONMENT_NAME>/bin/activate
You can tell you're in the virtual environment where at the beginning of the directory you see it in brackets (ENVIRONMENT_NAME)
Once you have the virtual environment up and running, you can go ahead and install the dependencies. This is done by running the following command:
pip install -r requirements.txt
You can see what's installed by running
pip freeze
or pip list
.
When you finish working within the environment, you can deactivate just by entering deactivate
in the command line.
In the root of the application directory where manage.py
is located, run the following in the command line (and while in the virtual environment):
python manage.py runserver
This will run the Django application, and you can view this by entering in the address bar of a web browser localhost:8000.
There is room for improvement on the application. The model is by no means perfect and can be updated on a new dataset with current news. The application requries a valid news URL, but breaks if a non-URl is entered. This leads on to further written testing is required to prevent breaking and what-ifs.
Currently, the model is trained only on English language articles, so perhaps more models required for different languages.
Fake news classifier web application now available! Updated django files and necessary heroku files available in 2. app
folder.
Added fake news notebook from Kaggle containing exploratory data analysis and machine learning model training, plus the save model pkl file.
Added Random Forest hyperparameter tuning notebook. Contains RandomSearchCV, GridSearchCV, training with best hyperparameters, and comparison of best to base model.