Skip to content

M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, Luke Zettlemoyer

Notifications You must be signed in to change notification settings

machelreid/m2d2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

M2D2: A Massively Multi-domain Language Modeling Dataset

Scripts and data links for M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, and Luke Zettlemoyer.

m2d2_image.png

Data

Update: The data is currently hosted on HuggingFace here!

To load the dataset use the following steps:

pip install --upgrade datasets
import datasets

dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice

print(dataset['train'][0]['text']

We're currently exploring ways to host this large amount of data online in an accessible manner, so please stay tuned! If you would like to access sooner, feel free to reach out at {machelreid}-{at}-{google-dot-com}.

Evaluation Sets

Feel free to download the test sets for all domains at this Google Drive link.

or via gdown:

#!/bin/bash
# install and/or upgrade gdown with pip
pip install --upgrade gdown
# Download M2D2 test sets
gdown "1U5wki_V-IFQy733HC6NO5ZuM2jaOaw8y"
tar -xvzf m2d2_test_sets.tar.gz
# File structure
# m2d2_test_sets/
# ├─ DOMAIN_AA/
# │  ├─ test.txt
# ├─ DOMAIN_AB/
# │  ├─ test.txt/

Reproduction Scripts for Modeling

Find scripts for finetuning language models in lm_scripts/adapt.sh. Furthermore, we provide meta-scripts for generating scripts for multiple domains given an input file containing a list of directories containing domain specfici data (within train.txt and valid.txt should exist): lm_scripts/generate_multiple.sh. Respective instructions/parameters are included in each file.

For validation on multiple files, we also include lm_scripts/validate_on_multiple_files.py for calculating perplexity measures given a file containing a list of evaluation text files and a model checkpoint.

Helper Scripts for Wikipedia Data Collection

For Wikipedia data collection, we include scripts for data dump processing (data_scripts/wiki/get_data), ontology gathering (data_scripts/wiki/ontology), and generating splits (data_scripts/wiki/split_generation).

Helper Scripts for S2ORC Data Collection

To be uploaded with documentation

Scripts to reproduce analyses in the paper

To be uploaded with documentation

About

M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, Luke Zettlemoyer

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published