Please use the script provided at [download_data.sh](https://github.com/shauli-ravfogel/nullspace_projection/blob/master/download_data.sh)
-
Download the data from Su Lin Blodgett dataset, described in "Demographic dialectal variation in social media: A case study of african-american english."
wget http://slanglab.cs.umass.edu/TwitterAAE/TwitterAAE-full-v1.zip # or directly from the site: http://slanglab.cs.umass.edu/TwitterAAE/
-
Follow Elazar et al. in preprocessing the dataset to get race and sentiment labels.
python make_data.py /path/to/downloaded/twitteraae_all /path/to/project/data/processed/sentiment_race sentiment race
See this doc for details about the scripts and other details.
Noticing that the prerequisites
python==2.7
, and the original scripts directly maps tokens to ids inplace, i.e., original tokens will not be stored. In order to save texts, please hack the following functiondef to_file(output_dir, voc2id, vocab, pos_pos, pos_neg, neg_pos, neg_neg): if output_dir[-1] != '/': output_dir += '/' if not os.path.isdir(output_dir): os.makedirs(output_dir) with open(output_dir + 'vocab', 'w') as f: f.writelines('\n'.join(vocab)) for data, name in zip([pos_pos, pos_neg, neg_pos, neg_neg], ['pos_pos', 'pos_neg', 'neg_pos', 'neg_neg']): with open(output_dir + name, 'w') as f: for sen in data: ids = map(lambda x: str(voc2id[x]), sen) f.write(' '.join(ids) + '\n') with open(output_dir + name + "_text", 'w') as f: for sen in data: ids = map(lambda x: str(x), sen) f.write(' '.join(ids) + '\n')
-
Encode texts with torchMoji. We provide an example for extract text representations at src/Moji.
Please use the script provided at [download_data.sh](https://github.com/shauli-ravfogel/nullspace_projection/blob/master/download_data.sh)
-
Download the dataset as described in Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting
See https://github.com/microsoft/biosbias for instructions for downloading and processing all bio records as a single file.
-
Create splits and get BERT encoding. We follow Ravfogel et al. in creating data splits and extracting BERT encoding. Please see create_dataset_biasbios.py and encode_bert_states.py.
We provide an example for dataset splits.
-
Augmented economy labels. TODO