-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Status: WIP
Replicating Fake News Detection ML model of this paper, published in 2017.
Develop a machine learning program to identify when a news source may be producing fake news. We aim to use a corpus of labeled real and fake news articles to build a classifier that can make decisions about information based on the content from the corpus.
Using the NLTK python library.
- Tokenize the body and headline with the Punkt statement tokenizer from the NLTK NLP library
- Tokenize words
- Lemmatization
- Visualize data with Word Cloud
- Get Tokens for Fake News and True News.
Titles: tokens that had a frequency more than 10 over the entire title dataset Body: the tokens that had a frequency of more than 200 over the entire dataset (we only kept tokens with string size greater than 3)
Our average hypothesis model combines the hypotheses obtained from Nave Bayes, Logistic Regression, and SVM by averaging the output probabilities obtained from each model.
A one-layered neural network model was used on the 80 tokens identified to be most causal to a source classification.
Lemmatization : Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma. For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.
- Fake News Dataset
- Real News Dataset
Dataset features:
- title
- content
- publication
- label
Web-browser extension that marks articles as fake/true.
- Serialise ML model with Pickle
- Make API with Flask
- Make extension's UI + use API
- Zip and upload it to Mozilla add-ons