Adding Google 1 Billion Benchmark Dataset to PyTorch dataset #645

h56cho · 2019-11-19T15:36:43Z

Hello,

Currently, for language modelling, PyTorch has 3 built-in datasets (WikiText103, WikiText2, Penn Treebank). Would it be possible to add the Google 1 Billion Benchmark dataset as one of the PyTorch language modelling built-in data?

The link to the Google 1 Billion Benchmark dataset is below:

http://www.statmt.org/lm-benchmark/

Thanks,

zhangguanheng66 · 2019-11-19T15:48:29Z

Just for future contributors, the raw files are quite large compared with current built-in language modeling datasets. However, it's still useful to include this benchmark dataset for Torchtext users. My suggestion is to start with the one "News Crawl corpus (2011 only)"

zhangguanheng66 added feature request Good for the first PR contribution labels Nov 19, 2019

anmolsjoshi linked a pull request Feb 3, 2020 that will close this issue

Added WMT News Crawl Dataset for language modeling #688

Open

anmolsjoshi mentioned this issue Feb 24, 2020

Added WMT News Crawl to experimental.language_modeling #700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Google 1 Billion Benchmark Dataset to PyTorch dataset #645

Adding Google 1 Billion Benchmark Dataset to PyTorch dataset #645

h56cho commented Nov 19, 2019

zhangguanheng66 commented Nov 19, 2019

Adding Google 1 Billion Benchmark Dataset to PyTorch dataset #645

Adding Google 1 Billion Benchmark Dataset to PyTorch dataset #645

Comments

h56cho commented Nov 19, 2019

zhangguanheng66 commented Nov 19, 2019