Project CRANE (Crisis Racism and Narrative Evaluation) aims to support researchers and anti-racist organisations that wish to use state-of-the-art text analysis algorithms to study how specific events impact online hate speech and racist narratives. CRANE Toolbox is a Python package: once installed, the tools in CRANE are available as functions that users can use in their Python programs or directly through their terminal. CRANE targets users with basic programming but no machine learning skills.
- Table of Contents
- Introduction
- Quick-start
- Package documentation
- Contributing
- Frequently asked questions
Crisis Racism and Narrative Evaluation, or CRANE for short, was born during the Resiliency Challenge, a COVID19-driven hackathon. During nine weeks, an interdisciplinary team composed of students and professional designers, programmers and computer science researchers developed a prototype website for real-time monitoring of racist rhetoric linked to the Covid19 pandemic. The project was proposed by Gianluca Stringhini, Boston University, following his work on early emergence of online sinophobic behaviour during the Covid19 (SARS-Cov-2) pandemic. Readers may find more details on the approach taken for the Resiliency Challenge on the corresponding Devpost page and the resiliency-challenge-legacy branch of this repository.
While doing literature review, we noticed that few papers on online hate speech monitoring had both a sociology research approach and recent natural language processing tools. The literature seems to be divided mainly into methodological papers aiming to develop new machine learning tools to detect and characterise racism, and traditional sociology research approaches, both qualitative and quantitative, that do not take full advantage of the available data and analysis methods.
We are aware that this is due in part to the difficulty of gathering interdisciplinary teams from very different fields, and in part to the rather young and precarious nature of automated hate speech detection.
Our experiences during the hackathon lead us to believe that these issues could be somewhat alleviated by providing social sciences researchers with user-friendly tools leveraging the more robust of the current hate speech detection and analysis methods.
CRANE Toolbox is designed to support researchers and anti-racist organisations in studying how specific events impact online hate speech and racist narratives. State-of-the-art text analysis algorithms evolve quickly, and their spread is often limited to the dedicated field of computer science research. CRANE Toolbox wraps some of those methods, that fit typical social sciences questions, into user-friendly tools that require no machine learning skills.
CRANE Toolbox is a Python package. Once installed, users can call its functions in their Python scripts or use the tools it provides directly through their terminal. Those tools are organised in several sub-packages called modules. Each module corresponds to a step in the general analysis pipeline.
The first version of CRANE Toolbox provides tools and documentation for users to filter and format the data they have extracted from Twitter (import module), prepare it for analysis (preprocessing module), carry out the analysis(es) of their choice (analysis module), and get a basic graphic representation of the results (visualisation module).
βΉοΈ This section is for experienced programmers who want to dive head-first into the deep end. If you have no idea what we are talking about in there, please don't run away screaming! Though the length of the Package Documentation section might make it a bit daunting, we wrote it with non-specialists in mind, so you should find all the instructions and advice you need to use the CRANE Toolbox.
pip install cranetoolbox
The first step is to transform your data into our common data exchange format, a three column CSV file.
We use the crane-import
module to perform this step. It takes source files that contain a single JSON tweet object per line and transforms them into a lightweight CSV file. The available parameters are detailed in the Import Module section.
crane-import --source-folder ./my_source --output-folder ./my_output
It is important to clean-up text data before data analysis. We provide a CLI tool to perform this step, with several preprocessing options. The available parameters are detailed in the Preprocessing Module section.
crane-preprocess ./my_output ./my_preproc_output
We currently provide a single statistical analysis tool that performs basic frequency analysis on the data set using a provided set of keywords. The required input formats and available parameters are detailed in the Analysis Module section.
crane-analysis-quanti ./my_preproc_output keywords.json quanti_results.csv
This section is divided into two parts: a first one describing how to install CRANE Toolbox and run its tools, either in command-line mode in a terminal or through function calls in Python scripts; and a second one giving details on each module, in particular data format and parameter choice.
Advanced users can find additional information on the content of each module in the code documentation.
In order to successfully install CRANE Toolbox, you will need:
- A basic internet connection
- A terminal
- Python 3
- pip for Python 3
Some basic knowledge in programming and command-lines is strongly recommended, but the brave and the studious may choose to proceed without.
If the above list is clear for you, skip to the installation instructions. Else, read the following paragraphs to get your environment ready.
The terminal is an interface in which you can type and execute text based commands. You may find a more detailed explanation here and pictures of what a terminal looks like here.
A terminal is already installed on your computer. If you are using a Unix-based operating system (either MacOSX or a Linux distribution), it is aptly named Terminal and lives with your other applications. If you are using a Windows operating system, it is named Command Prompt and can be found in your Start Menu. For Windows 8 and 10, it is located in the Windows System folder.
To run a command in the terminal, simply type it in and press Enter.
Python 3 is a programming language.
The installation process depends on your operating system. Basic instructions can be found on the official Python website, but you will easily find more detailed guides on the internet.
pip is a package manager: it helps you download Python packages and update them when necessary.
Basic instructions for its installation can be found on its official website. It is widely used so step-by-step guides and other resources are easily found on the internet.
pip3
command instead. You can check which version of pip
is used by running pip --version
in your terminal (python -m pip --version
for Windows users).
We strongly recommend installing CRANE Toolbox using pip
. Run pip install cranetoolbox
in a terminal.
The command might take some time because pip
will also automatically install all the
Python packages required for CRANE Toolbox to run.
You can check the list of dependencies on the [PyPi page of CRANE Toolbox](link to Pypi page).
CRANE Toolbox can also be installed manually from the source code available in this repository.
The command line mode is the easiest method of using this package, it enables the user to quickly run all parts of the package and produce usable statistics in a very short period of time.
There are three different CLI commands available in the toolbox, they are crane-import
, crane-preprocess
and crane-analysis-quanti
. In order, they handle importing/transforming raw data into a standard CSV format, preprocessing that data into a usable format and finally computing the daily frequencies of given keywords.
For specifics on the parameters provided under each CLI tool please check the detailed documentation below.
This package has been written with reuse in mind. While the CLI tools are easier to use as standalone tools, CRANE Toolbox can be imported and used within another python project. All functions are available to the user through standard Python package usage. In addition, a single high level function corresponds to each CLI tool so users can easily call their desired pipeline directly from Python.
The import module is accessible from the crane-import
command-line entry point.
The goal of this module is to transform raw twitter JSON data into something more manageable and lightweight for use by the rest of the pipeline. This is not a required step and could be performed manually by the user or have already been done via some other tool since data is passed between modules as files.
This module can process a large number of large (multiple gigabyte) files without relying heavily on swap or memory usage. This is accomplished by reading files by chunks (the user can specify the number of lines in the options) as well as writing in chunks. This ensures that only a certain amount of data is held in memory at any given moment.
In addition, it removes extra tweet data that is not required for the analysis pipeline, reducing file size and increasing the performance of later steps in this package.
The module will attempt to read any file in the specified folder, regardless of extension, name, etc. It supports any text-based file format (.json, .csv, .txt). In addition, it can handle compressed tar
archives and treat multiple files or folders in a given tar archive. tar.gz
or similar formats are supported.
Regardless of the exact file type, data should always be stored with one JSON object tweet per line, using \n
as the end-of-line character.
The module will concatenate all files into a single CSV file with the following columns:
id | text | created_at |
---|---|---|
unique id of tweet | full text tweet | timestamp of tweet |
The CSV file has no headers, comma separators, and optional double quotes for text.
The text column is dynamically selected depending on if the tweet is over 140 characters or not, for more information about extended tweets see the official documentation on extended tweets
If the output file already exists, it will append to the existing file. In the case that it does not exist it will create a new one, using the default path ("./filtered_data.csv") if none is specified.
The import package can be used as a command-line tool and supports a several different options.
- (Required)
--source-folder
The source folder or file to scan for files/archives to process. - (Optional)
--output-folder
The folder to save the output. The specified folder must exist. Defaults to./
. - (Optional)
--output-name
The name, including extension, of the output file. Defaults tooutput.csv
. - (Optional)
--text-name
The name to text field, case this field has a different name. - (Optional)
--date-name
The name to created_at field, case this field has a different name. - (Optional)
--id-name
The name to id field, case this field has a different name. - (Optional)
--tweet-language
The language of tweets saved to the file. Based on the language field in the JSON object. Defaults toen
. - (Optional)
--max-lines-in-memory
The maximum number of lines that will be held in memory. This can be adjusted to to optimize for performance or on machines that have limited memory. Defaults to50000
. - (Optional)
--retweets
Use this flag to include retweets in the output set. Defaults tofalse
A complete example for the command-line entry-point:
crane-import --source-folder tweets/november --output-folder mydataset/data --max-lines-in-memory 2000 --output-name dataset.csv
The preprocessing pipeline is accessible from the crane-preprocess
command-line entry point.
The proposed preprocessing is adapted from Effrosynidis et al. (2017), using techniques 9, 0, 1, 3, 5 and 7, and 4.
The dataset given as the first positional argument can be either a folder of CSV files or a single CSV file. Each CSV file should contain 3 columns (ID of input, should be unique, int; Text of the tweet, string; timestamp: Timestamp string), no headers, comma separators, and optional double quotes for text.
id | text | created_at |
---|---|---|
unique id of tweet | full text tweet | timestamp of tweet |
In order and for a given string, it will:
- Transform uppercase characters to lowercase,
- Remove escaped unicode strings (e.g.
\u002c
), - Remove non-ascii characters,
- Replace urls with "url", or remove them if
-url
is specified, - Replace mentions (e.g. "@BobTheSurfer") with "atUser", or remove them if
-mention
is specified, - Remove the hash symbol in front of hashtags (e.g. "#spreadTheLove", and segment the hashtags if
-hashtag
is specified, - Replace contractions with their full spelling (e.g. "I'm" becomes "I am"),
- Replace repeated punctuation signs with a textual equivalent (e.g. "!!!" becomes "multiExclamation") and newlines with spaces, or remove all punctuation except underscore characters if
punct
is specified, - Replace numbers with their English text version, or remove them if
-num
is specified.
Hashtags are segmented, i.e. separated into words, using the wordsegment package.
The processed dataset is saved in the folder given as the second positional argument. If it does not exist, the folder will be created. For each input file a processed file is generated. Files names are generated by appending "_preprocessed" to the name of the corresponding input file. Each CSV file contains 4 columns (ID of input, should be unique, int; Original text of the tweet, string; Preprocessed text of the tweet, string; timestamp: Timestamp string), no headers, comma separators, and optional double quotes for text.
id | original_text | clean_text | created_at |
---|---|---|---|
unique id of tweet | full text tweet | preprocessed text of tweet | timestamp of tweet |
The pipeline has two mandatory positional arguments and five optional arguments:
- (Required) Position 1. Path to the folder containing the dataset formatted with the import module, or a single dataset file.
- (Required) Position 2. Path to the folder to save the preprocessed dataset in. If it does not exist, the folder will be created.
- (Optional)
-url
or--remove-url
Use this flag to remove URLs from the tweets instead of replacing them with 'url'. - (Optional)
-mention
or--remove-mentions
Use this flag to remove user mentions '@userHandle' from the tweets instead of replacing them with 'atUser'. - (Optional)
-hashtag
or--segment-hashtags
Use this flag to segment hashtags instead of simply removing the preceding '#' character. - (Optional)
-punct
or--remove-punctuation
Use this flag to remove all punctuation expect hyphens, instead of replacing repeated symbols and newlines. - (Optional)
-num
or--remove-numbers
Use this flag to remove all numbers from the tweets instead of replacing them with their text version.
A complete example for the command-line entry-point:
crane-preprocess mydataset/data mydataset/preprocessedData -punct
Currently the analysis module only offers some simple quantitative analysis pipeline to compute the daily frequency of given keywords.
This analysis pipeline is accessible from the crane-analysis-quanti
command-line entry point.
It computes the daily frequency of given keywords in a dataset. It allows for variants of keywords. For example, occurrences of "boys" and "boyz" can be counted together.
The dataset given as the first positional argument can be either a folder of CSV files or a single CSV file. Each CSV file should contain 4 columns (ID of input, should be unique, int; Original text of the tweet, string; Preprocessed text of the tweet, string; timestamp: Timestamp string), no headers, comma separators, and optional double quotes for text. The preprocessed text is assumed to be lower-case.
id | original_text | clean_text | created_at |
---|---|---|---|
unique id of tweet | full text tweet | preprocessed text of tweet | timestamp of tweet |
The keywords are defined in a JSON dictionary where the keys are the main variant for each keyword and the values are lists of variants. All keywords should be lower-case strings. For example:
{
"color" : [
"colour",
"color"
],
"chinese" : [
"chinese",
"chineze",
"chines"
]
}
The output is a CSV file with a day date column (format "%Y-%m-%d"), a total_count column with the daily total number of tweets in the dataset, a [keyword]_count column for each keyword (its main variant is used to name the column) with the daily number of tweets containing at least one variant of the keyword and a [keyword]_freq column for each keyword (its main variant is used to name the column) with the daily frequency of tweets containing at least one variant of the keyword.
For example, for the keywords set given above:
day | total_count | color_count | chinese_count | color_freq | chinese_freq |
---|---|---|---|---|---|
date | daily number of tweets | daily number of tweets with "color" or a variant | daily number of tweets with "chinese" or a variant | daily frequency of tweets with "color" or a variant | daily frequency of tweets with "chinese" or a variant |
The pipeline has three mandatory positional arguments and one optional argument:
- (Required) Position 1. Path to the folder containing the dataset preprocessed with the preprocess module, or a single dataset file.
- (Required) Position 2. Path to the JSON file containing the keywords and their variants. See below for the expected format.
- (Required) Position 3. Path for the result file.
- (Optional)
-d
or--date-format
String defining the format of dates in the dataset. The default is %a %b %d %H:%M:%S %z %Y".
A complete example for the command-line entry-point:
crane-analysis-quanti mydataset/preprocessedData keywords.json quanti_results.csv -d "%d %b %a %h:%M:%S %z %Y"
Not implemented yet
We welcome all contributions! If you have questions, a feature request or some feedback, please use the issues feature of GitHub.
The Issues page works rather like your good old forum page, with many additional features specific to programming and git versioning. GitHub provides a detailed guide to them, but here are the basics:
- Search the existing issues for similar questions/feature requests/feedback
- If you find an issue similar enough, you can add to its discussion with your own details
- If you don't find an issue similar enough, create a new one
- Give as much relevant information as possible. For example, if you want to report a bug or get help with an error, provide both your code/command-line and the error message. If possible, link to the data.
- Add appropriate labels/tags to the issue. For example, "documentation" if you're missing info in the doc to use the toolbox.
- Always keep in mind that this is a volunteer project, and contributors do their best to help. Be nice and patient.
You are welcome to submit Pull Requests for open issues. In particular, issues tagged "help-wanted" are usually things the core team is struggling with. Here are the guidelines we ask you to follow when contributing to the code base.
- Comment on the issue to notify everyone of your intention to do so to make sure no one else is working on the same problem.
- Tackle only one issue per PR, unless you've discussed it with the core team before and they agree a grouped PR makes more sense.
- Reference the issue you are tackling in your PR.
- Tag your PR with "need-review" when making your initial submission or after completing the changes requested by your reviewers.
- Check your PR for reviews and be open to suggestions. If a reviewer is requesting changes, they will change the label of your PR from "need-review" to "in-progress".
If you wish to involve yourself further (reviewing PRs, planning for new features, researching machine learning methods, doing user research, ...), you can join the core team by emailing bolduc2 (at) hotmail (dot) fr to get onboarded. We welcome developers, of course, but also designers, researchers from all academic fields, technical writers...
We have chosen to onboard people privately instead of sharing all our resources in the repo for two reasons. First, we wanted to make it easier for users not used to open source projects to find what they need. Second, some of our resources contain private information from user research.
https://crane-toolbox.github.io/#/ is a prototype website build during the Resiliency Challenge hackathon, in an attempt to quantify and characterise the impact of the COVID19 pandemic on online sinophobic hate speech. We left it online as a very basic example of the type of data analysis that could be carried out with CRANE Toolbox.
We do not plan to at the moment. Due to the composition of the team, we have decided to focus our efforts on CRANE Toolbox, where we believe we can make a difference.
This project follows the all-contributors specification. Contributions of any kind welcome!
The list of hackathon contributors for the first phase of the project can be found on CRANE Devpost page.
Yes! Check out the Contributing section.
Python is a very common programming language for data analysis tasks. Packages are installable add-ons to the basic Python language. Roughly speaking, a module is a file with Python code.
That said, if you are entirely unfamiliar with Python, you might want to experiment with it some before using CRANE Toolbox.
Please use the Issues page of this repo. Check out the Contributing section for more details.
CRANE Toolbox is distributed under GNU Affero General Public License v3.0.
You can use it however you want provided you respect the license requirements: include a license and copyright notice, state the changes you made, disclose your source (this repo), and distribute your code under the same license. Please note that we are not liable for whatever use you make of this code, and we provide no warranty.
Please link to this repo.
Sharing of tweets datasets is restricted by Twitter Developer Terms. You will not be able to find publicly available datasets with tweets content, but rather datasets of tweet ids. Those can be hydrated, using for example Hydrator.
From there it really depends on your research topic and questions. Some are very time-sensitive, some are less. If yours are not very time-sensitive, try your luck with a regular search engine. Some require a naive dataset, representative of the full information flow of Twitter. In this case, your best bet is to collaborate with a research team doing frequent analyses on Twitter content. They probably keep a stream running to collect 1% of the daily Twitter data. Others require a dataset that has already been filtered for specific keywords or users. If that's your case, try your luck with a regular search engine. If you don't need historical data, you could also start your own Twitter stream to collect exactly the data you desire.
We are working to include several standard input formats to our import module. If your data format is not supported, please get in touch and we will do our best to either include support for it in CRANE Toolbox or guide you to transform it into one of our supported formats.
Maybe we don't know about it, maybe we didn't have time to implement it yet, maybe we choose not to include it for a given reason. Please get in touch to tell us about it. (Unless it's a proprietary method with a use fee.)