This folder contains the data and the analysis done in the paper:
@inproceedings{ribeiro2018characterizing,
title={Characterizing and Detecting Hateful Users on Twitter},
author={Horta Ribeiro, Manoel and Calais, Pedro and
Santos, Yuri and Almeida, Virg{\'\i}lio and Meira Jr, Wagner},
booktitle={Proceedings of the International AAAI Conference on Web and Social Media},
year={2018}
}
The experiments with the GraphSage algorithm are in another repository.
The dataset can be downloaded here on Kaggle.
This dataset contains a network of 100k users, out of which ~5k were annotated as hateful or not. For each user, several content-related, network-related and activity related features were provided. Some of the files used are not shared because sharing them violates Twitter's guidelines.
You can download the following files here:
bad_words.txt
list of bad words matched in the tweets.lexicon.txt
list of lexicon used in the diffusion method.
And the following files on Kaggle:
-
users_(hate|suspended)_(glove|all).content
files with the feature vector for each user and their classes, the ones withhate
label users as eitherhateful
,normal
orother
, whereas the ones withsuspended
label users as eithersuspended
or active. The ones withglove
have only the glove vectors as features, the ones withall
have other attributes related to users activity and network centrality. This is only for the GraphSage algorithm. -
user.edges
file with all the (directed) edges in the retweet graph. -
users_clean.graphml
networkx compatible file with retweet network. User id's correspond to those inusers_anon_neighborhood.csv
! -
users_anon_neighborhood.csv
file with several features for each user as well as the avg for some features for their 1-neighborhood (ppl they tweeted). Notice thatc_
are attributes calculated for the 1-neighborhood of a user in the retweet network (averaged out).
hate :("hateful"|"normal"|"other")
if user was annotated as hateful, normal, or not annotated.
(is_50|is_50_2) :bool
whether user was deleted up to 12/12/17 or 14/01/18.
(is_63|is_63_2) :bool
whether user was suspended up to 12/12/17 or 14/01/18.
(hate|normal)_neigh :bool
is the user on the neighborhood of a (hateful|normal) user?
[c_] (statuses|follower|followees|favorites)_count :int
number of (tweets|follower|followees|favorites) a user has.
[c_] listed_count:int
number of lists a user is in.
[c_] (betweenness|eigenvector|in_degree|outdegree) :float
centrality measurements for each user in the retweet graph.
[c_] *_empath :float
occurrences of empath categories in the users latest 200 tweets.
[c_] *_glove :float
glove vector calculated for users latest 200 tweets.
[c_] (sentiment|subjectivity) :float
average sentiment and subjectivity of users tweets.
[c_] (time_diff|time_diff_median) :float
average and median time difference between tweets.
[c_] (tweet|retweet|quote) number :float
percentage of direct tweets, retweets and quotes of an user.
[c_] (number urls|number hashtags|baddies|mentions) :float
number of bad words|mentions|urls|hashtags per tweet in average.
[c_] status length :float
average status length.
hashtags :string
all hashtags employed by the user separated by spaces.
These are the main folders, reproducible with the dataset downloaded from Kaggle:
-
./analysis/
contains the script exploring the dataset collected. -
./classification/
contains scripts with boosting classifier.
These folders are not reproducible, but they are present just in for completeness:
-
./crawler/
contains the code used to extract the dataset. You need to set neo4j to run it. -
./prepreprocessing/
contains scripts to select the users to be annotated, and extract their tweets. -
./features/
contains scripts to get the features to be analyzed and that will be fed into the classifier.
Auxiliary folders:
-
./data/
data generated by data wrangling. -
./secrets/
for the API/DB authentication stuff. -
./tmp/
auxiliary scripts. -
./img/
images generated by analyses.