Skip to content

zshoorbajee/FlatironProject4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tweeting about Disaster

Using Neural Networks to Detect Tweets About Real Crises

Flatiron School Data Science: Project 4

Advanced Machine Learning Topics

  • Author: Zaid Shoorbajee
  • Instructor: Morgan Jones
  • Pace: Flex, 40 weeks

Twitter logo art.

Contents:

Business Understanding

An international news outlet, The Flatiron Post, wants to be able to report on stories of crises and natural disasters in a prompt manner. News about plane crashes, hurricanes, earthquakes, terrorist threats, and other topics occurs without warning. Being late to the story can mean not only losing to the competition, but also leaving your audience in the dark while speculation runs amok.

The Post wants to tap into Twitter as a resource in order to detect such disasters in real time, and it’s employing a data scientist for the task. Twitter is a fire hose of information; there is a lot more noise than signal, and reporters would waste a lot of time staring at their Twitter feeds just waiting for disaster tweets. But chances are that if a disaster is occurring, someone is tweeting about it.

The task of the data scientist is to use natural language processing (NLP) and machine learning in order to systematically tell if a tweet is about a real disaster or not. Such tweets can then theoretically be presented to the newsroom in a separate feed. Reporters can then choose to pursue that story or not.


Data Understanding

NLP

The core type of data being used for this task is the text of tweets. This is unstructured data and requires natural language processing (NLP) techniques in order to be interpretable by a machine learning model, such as a neural network.

Working with natural langauge is messy; disaster and non-disaster tweets can use many of the same words, but context changes everything. The following two tweets both have the words "explosion" and "fire" in them. For any literate person, it's obvious which is about a real disaster and which is not.

Tweet about a literal explosion

Tweet about an explosion of flavor

For a computer, however, it's not so simple. To make tweets interpretable by a neural network, this project uses the following NLP techniques:

  • Tokenization
  • Lemmatization
  • Removing stop words
  • TF-IDF Vectorization
  • Part-of-speech tagging
  • Named-entity recognition
  • Meta-feature extraction
    • Character count, word count, stop word rate, etc.

The idea is that converting tweets into the signals listed above should help a machine learning model to discern the difference between a disaster tweet and non-disaster tweet.

Dataset

This project uses the Natural Language Processing with Disaster Tweets dataset from Kaggle. This is a dataset recommended by Kaggle for those looking to get started with NLP.

The labeled training set contains 7,613 entries with the following features:

  • id: Arbitrary identifier
  • keyword: Search phrase used to collect tweet
  • location: User-generated location for the tweet's account
  • text: The text of the tweet
  • target: Binary label for disaster (1) and non-disaster (0) tweets. Labeled by humans.

Data Analysis & Feature Engineering

Class distribution

The class breakdown of the dataset is as follows:

  • 42% class 0 (non-disaster)
  • 58% class 1 (disaster)

The keyword column shows what was used to search for relevant tweets. Realistically, keyword isn't a feature that will always be available when trying to predict tweets spotted "in the wild." Furthermore, the client might add or remove keywords from its repertoire of search terms. For these reasons, it is not used as a feature to make predictions.

However, this column can give use insight as to what kinds of tweets the keywords yield. The charts below show keywords with the highest and lowest yields of disaster tweets.

Class distribution by keyword

We find that there are a lot more keywords on the lower end. This is perhaps a sign that the newsroom should revise the search terms it's using to find these tweets.

Tokenization & Lemmatization

Each tweet was tokenized and lemmatized in order to make stanardized versions of the tweet.

Example:

  • input: 'shootings explosions hand grenades thrown at cars and houses and vehicles and buildings set on fire. it all just baffles me.is this sweden?'
  • output: ['shooting', 'explosion', 'hand', 'grenade', 'throw', 'at', 'car', 'and', 'house', 'and', 'vehicle', 'and', 'building', 'set', 'on', 'fire', 'it', 'all', 'just', 'baffle', 'i', 'be', 'this', 'sweden']

When tweets are lemmatized and stop words are removed, the breakdown of the most frequent words in each class looks very different.

Figure A shows simply the top tokens in common between classes. Figure B shows the top lemmas in common, excluding stop words.

Fig. A

Top 20 tokens. Split by class.

Fig. B

Top 20 lemmas with stop words removed. Split by class.

There is a stark difference; the lemmatized tweets have much fewer words in common (in the top 20 lists).

The lemmatized tweets were then used to make term frequency-inverse document frequency (TF-IDF) vectors, including the top 500 lemmas from the entire corpus of tweets, excluding stop words.

Tagging (POS & NER)

Other data that I can vectorize includes each tweets' parts-of-speech and named-entities, such as places, companies, dates, people, and more.

I used SpaCy to make count vectors of these features. Here are some examples of what I used SpaCy to identify for vectorization:

Part-of-speech tagging (POS):

Example 1: "firefighters from connecticut are headed to california to fight wild fires" POS example 1

Example 2: "watch this airport get swallowed up by a sandstorm in under a minute" POS example 2

Named-entity recognition (NER):

Example 1: "a brief violent storm swept through the chicago area sunday afternoon leading to one death and an evacuation of lollapalooza and more"

NER example 1

Example 2: "after a suicide bombing in suru that killed people turkey launches airstrikes against isil and kurdistan workers party camps in iraq" NER example 2

For the purposes of identifying disaster tweets, here are the NER tags I am interested in:

  • GPE: Countries, cities, states.
  • LOC: Non-GPE locations, mountain ranges, bodies of water.
  • NORP: Nationalities or religious or political groups.
  • ORG: Companies, agencies, institutions, etc.

NER on location column:

In the raw dataset, the location column is not very useful. The values are user-generated and many of them are nonesense. To extract some value, I used NER to make a binary variable indicating if each location value is recognized with a "GPE" tag.

Meta-features

I was able to engineer more features from each tweet using seemingly arbitrary information from it. Here are the featurs I engineered:

  • Did the tweet have a URL
  • Character count
  • Number of stop words
  • Character count of non-stop-words divided by total character count
  • Average length of lemmas
  • Number of lemmas
  • Number of unique lemmas
  • Proportion of stop words
  • Proportion of words that are hashtags (#)
  • Proportion of words that are mentions (@)

The density plots below show that the distribution of some of these features is clearly different when separated by class.

mean lemma length distribution

stop word rate distribution

Final dataset

The preprocessed dataset has 537 features:

  • 500 TF-IDF values
  • meta-features
  • POS vectors
  • NER vectors

Modeling

To make models that can predict if a tweet is in class 0 or 1, I built neural networks using TensorFlow through the Keras interface.

The data was split into train, validation, and test sets. The models were trained on the training set and the final model was chosen based on performance on the validation set. The final model was given a score based on its performance on the test set.

I made a simple baseline model and five additional models, experimenting with number of layers, number of nodes, L2 regularization, and dropout regularization. The models were configured to run for 150 epochs, but early stopping was implemented if validation loss didn't improve in 20 epochs.

Scoring and Evaluation:

I monitored several metrics for each model (loss, accuracy, precision, recall, F1, ROC-AUC).

Ultimately, I'm looking for the model with the best recall score. The business case is that a news outlet wants to make sure it doesn't miss important crises that should be reported on. Therefore, it's important to know what level of false negatives the model produces, which recall aptly measures.

recall = (true positives) / (false negatives + true positives)

Baseline model architecture:

  • Input layer: 537 units
  • Hidden layer: 268 units
    • Activation: ReLU
  • Output layer: 1 unit
    • Activation: Sigmoid
  • Optimizer: Stochastic gradient descent
  • Loss: Binary crossentropy (log loss)

Final model architecture (Model 5):

  • Input layer: 537 units
  • Dropout layer: 20%
  • Hidden layer: 134 units
    • Activation: ReLU
    • L2 regularization: 0.05
  • Dropout layer: 20%
  • Output layer: 1 unit
    • Activation: Sigmoid
  • Optimizer: Stochastic gradient descent
  • Loss: Binary crossentropy (log loss)

Results:

Several metrics were monitored for each model to make sure there were no red flags.

Validation results

baseline model evaluation

final model evaluation

Test results

Test Results
Accuracy 0.78
Recall 0.72
Precision 0.73
F1 0.69

final model results

Here's what these results mean about the final model.

  • Accuracy: The model will correctly classify 78% of all tweets.
  • Recall: The model will correctly classify 72% of actual disaster tweets. The other 28% are false negatives.
  • Precision: Of all the tweets the model puts in the disaster category, 73% of them will be correct. The other 27% are false positives.

Conclusion

Recommendations

  • Because false negatives are still an issue, reporters should still look at all tweets, but can also be given the model's probability that a tweet is about a disaster.
  • Discard search terms that don't yield many disaster tweets, such as "harm," "bloody," "screaming," "ruin," etc.
  • Narrow the criteria for what constitutes a "disaster." This dataset sometimes puts the "disaster" label on long-term crises like droughts, and past disasters like the hiroshima bombing. Perhaps the The Flatiron Post should focus on so-called "kinetic events" and more unpredictable crises (bombings, earthquakes, crashes, etc.). This would require either relabeling the dataset or gathering new data.

Limitations and Future Work

  • The training of this model is limited by the tweets provided, as well as the search terms that were used to obtain them. Searching for things like "explosion," "fire," "suicide bomber," etc. seems like it should yield tweets about disasters. But there may be other tweets without blatant keywords. Having access to a less biased sample of tweets might yield better results.
  • The tweets in the provided dataset show if a tweet originally contained a URL, but not if it contained a picture or video. Having that as a feature might have improved the model's performance.
  • The purpose of this model is to provide The Flatiron Post with a feed-like tool that shows tweets related to disasters and crises. This model is just one piece of the pipeline. Other pieces include a tool that automatically requests tweets through Twitter's API, as well as a user-friendly interface.

For more information

You can see the full analysis in this Jupyter Notebook. A non-technnical presentation about the findings can be found here.

For any questions or feedback, you can contact me here.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published