Download Instructions for DEMix Data

Here we provide instructions to download data used in the DEMix paper. Note that downloading most datasets involve getting approval from dataset hosters.

1B Words

Download 1B words corpus from here: https://opensource.google/projects/lm-benchmark

Legal

Create an account and download data here https://case.law/

S2ORC (e.g., Med, CS)

Follow instructions here to download papers: https://github.com/allenai/s2orc

When papers are downloaded, you can extract papers using the scripts in domains/s2orc/extract_papers.py.

Openwebtext

Download Openwebtext from here https://skylion007.github.io/OpenWebTextCorpus/.

Use the script at domains/openwebtext/unpack_openwebtext.py to unpack the data.

Tweets

Sign up for the Twitter Academic API, and download tweets in a jsonl format.

Breaking News

Use domain/scripts/fetch_articles.py to crawl breaking news articles. We use the URLs associated with high factuality in https://github.com/ramybaly/News-Media-Reliability/blob/master/data/acl2020/corpus.tsv.

python -m domain.scripts.fetch_articles --num-articles-per-source 100 --path-to-output news.jsonl

Yelp Reviews

Download dataset here https://www.yelp.com/dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOWNLOAD_DEMIX_DATA.md

DOWNLOAD_DEMIX_DATA.md

Download Instructions for DEMix Data

1B Words

Legal

S2ORC (e.g., Med, CS)

Openwebtext

RealNews

Reviews

Gutenberg

Github

ACL Papers

Legal contracts

CORD-19

Tweets

Breaking News

Yelp Reviews

Files

DOWNLOAD_DEMIX_DATA.md

Latest commit

History

DOWNLOAD_DEMIX_DATA.md

File metadata and controls

Download Instructions for DEMix Data

1B Words

Legal

S2ORC (e.g., Med, CS)

Openwebtext

RealNews

Reviews

Gutenberg

Github

ACL Papers

Legal contracts

CORD-19

Tweets

Breaking News

Yelp Reviews