Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Google 1 Billion Benchmark Dataset to PyTorch dataset #645

Open
h56cho opened this issue Nov 19, 2019 · 1 comment · May be fixed by #688
Open

Adding Google 1 Billion Benchmark Dataset to PyTorch dataset #645

h56cho opened this issue Nov 19, 2019 · 1 comment · May be fixed by #688

Comments

@h56cho
Copy link

h56cho commented Nov 19, 2019

Hello,

Currently, for language modelling, PyTorch has 3 built-in datasets (WikiText103, WikiText2, Penn Treebank). Would it be possible to add the Google 1 Billion Benchmark dataset as one of the PyTorch language modelling built-in data?

The link to the Google 1 Billion Benchmark dataset is below:

http://www.statmt.org/lm-benchmark/

Thanks,

@zhangguanheng66
Copy link
Contributor

Just for future contributors, the raw files are quite large compared with current built-in language modeling datasets. However, it's still useful to include this benchmark dataset for Torchtext users. My suggestion is to start with the one "News Crawl corpus (2011 only)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants