Skip to content

Latest commit

 

History

History
72 lines (48 loc) · 3.52 KB

README.md

File metadata and controls

72 lines (48 loc) · 3.52 KB

Moji

Download processed data

Please use the script provided at [download_data.sh](https://github.com/shauli-ravfogel/nullspace_projection/blob/master/download_data.sh)

From scratch

  1. Download the data from Su Lin Blodgett dataset, described in "Demographic dialectal variation in social media: A case study of african-american english."

    wget http://slanglab.cs.umass.edu/TwitterAAE/TwitterAAE-full-v1.zip
    
    # or directly from the site: http://slanglab.cs.umass.edu/TwitterAAE/
  2. Follow Elazar et al. in preprocessing the dataset to get race and sentiment labels.

    python make_data.py /path/to/downloaded/twitteraae_all /path/to/project/data/processed/sentiment_race sentiment race

    See this doc for details about the scripts and other details.

    Noticing that the prerequisites python==2.7, and the original scripts directly maps tokens to ids inplace, i.e., original tokens will not be stored. In order to save texts, please hack the following function

    https://github.com/yanaiela/demog-text-removal/blob/f11b243c3f2f24f2179348c468b2caf76e7a3b23/src/data/make_data.py#L59

    def to_file(output_dir, voc2id, vocab, pos_pos, pos_neg, neg_pos, neg_neg):
        if output_dir[-1] != '/':
            output_dir += '/'
    
        if not os.path.isdir(output_dir):
            os.makedirs(output_dir)
    
        with open(output_dir + 'vocab', 'w') as f:
            f.writelines('\n'.join(vocab))
    
        for data, name in zip([pos_pos, pos_neg, neg_pos, neg_neg], ['pos_pos', 'pos_neg', 'neg_pos', 'neg_neg']):
            with open(output_dir + name, 'w') as f:
                for sen in data:
                    ids = map(lambda x: str(voc2id[x]), sen)
                    f.write(' '.join(ids) + '\n')
    
            with open(output_dir + name + "_text", 'w') as f:
                for sen in data:
                    ids = map(lambda x: str(x), sen)
                    f.write(' '.join(ids) + '\n')
  3. Encode texts with torchMoji. We provide an example for extract text representations at src/Moji.

Bios

Download processed data without economy labels

Please use the script provided at [download_data.sh](https://github.com/shauli-ravfogel/nullspace_projection/blob/master/download_data.sh)

From scratch

  1. Download the dataset as described in Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

    See https://github.com/microsoft/biosbias for instructions for downloading and processing all bio records as a single file.

  2. Create splits and get BERT encoding. We follow Ravfogel et al. in creating data splits and extracting BERT encoding. Please see create_dataset_biasbios.py and encode_bert_states.py.

    We provide an example for dataset splits.

  3. Augmented economy labels. TODO