Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added WMT News Crawl Dataset for language modeling #688

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
4f5a3d3
Added WMT News Crawl Dataset for language modeling
anmolsjoshi Feb 3, 2020
40a0b3e
Removed WMT from torchtext.datasets
anmolsjoshi Feb 4, 2020
b498950
Fixed error related to tar files.
anmolsjoshi Feb 4, 2020
1a016d5
Added WMT to experimental dataset
anmolsjoshi Feb 4, 2020
e113246
Working tests and text classification dataset
anmolsjoshi Feb 5, 2020
71eeaf3
Updated docstrings
anmolsjoshi Feb 5, 2020
4c807a9
Added line to join root with extracted files
anmolsjoshi Feb 5, 2020
8b2e372
Merge branch 'master' of https://github.com/pytorch/text into feature…
anmolsjoshi Feb 24, 2020
960ab99
Incorporated comments
anmolsjoshi Feb 24, 2020
3b39db2
Reverted files to master version
anmolsjoshi Feb 24, 2020
061fff1
Revered test files to master version
anmolsjoshi Feb 24, 2020
04f497c
spacing
anmolsjoshi Feb 24, 2020
d259c7d
Fixed arguments
anmolsjoshi Feb 24, 2020
a7d720c
fixed flake8 errors
anmolsjoshi Feb 24, 2020
e6308d1
Added test for WMTNewsCrawl
anmolsjoshi Feb 25, 2020
27fb2d3
fixed flake8 issues
anmolsjoshi Feb 25, 2020
5be9f51
Added a test for incorrect option for data_select
anmolsjoshi Feb 25, 2020
62b20e9
Merge branch 'master' into feature/news_crawl
anmolsjoshi Feb 25, 2020
de521be
Added option for year for WMT, included information by year
anmolsjoshi Feb 26, 2020
c228e80
Updated Example code
anmolsjoshi Feb 26, 2020
b5e8a02
Added validation for language and year with tests
anmolsjoshi Feb 26, 2020
2a2cbea
Merge branch 'master' into feature/news_crawl
anmolsjoshi Apr 15, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions test/data/test_builtin_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,36 @@ def test_wikitext2(self):
datafile = os.path.join(self.project_root, ".data", "wikitext-2-v1.zip")
conditional_remove(datafile)

def test_wmtnewscrawl(self):
from torchtext.experimental.datasets import WMTNewsCrawl
# smoke test to ensure WMT News Crawl works properly
train_dataset, = WMTNewsCrawl(data_select='train')
self.assertEqual(len(train_dataset), 54831406)

vocab = train_dataset.get_vocab()
tokens_ids = [vocab[token] for token in 'the player characters rest'.split()]
self.assertEqual(tokens_ids, [3, 1009, 2920, 1135])

# Delete the dataset after we're done to save disk space on CI
datafile = os.path.join(self.project_root, ".data", "training-monolingual")
conditional_remove(datafile)
datafile = os.path.join(self.project_root,
".data",
"training-monolingual-news-2011.tgz")
conditional_remove(datafile)

# Raises ValueError for incorrect option for data_select
with self.assertRaises(ValueError):
train_dataset, = WMTNewsCrawl(data_select='valid')

# Raises ValueError for incorrect option for year
with self.assertRaises(ValueError):
train_dataset, = WMTNewsCrawl(data_select='train', year=2005)

# Raises ValueError for incorrect option for language
with self.assertRaises(ValueError):
train_dataset, = WMTNewsCrawl(data_select='train', language='jp')

@slow
def test_penntreebank_legacy(self):
from torchtext.datasets import PennTreebank
Expand Down
3 changes: 2 additions & 1 deletion torchtext/experimental/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from .language_modeling import LanguageModelingDataset, WikiText2, WikiText103, PennTreebank # NOQA
from .language_modeling import LanguageModelingDataset, WikiText2, WikiText103, PennTreebank, WMTNewsCrawl # NOQA
from .text_classification import IMDB

__all__ = ['LanguageModelingDataset',
'WikiText2',
'WikiText103',
'PennTreebank',
'WMTNewsCrawl',
'IMDB']
90 changes: 84 additions & 6 deletions torchtext/experimental/datasets/language_modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
'PennTreebank':
['https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt',
'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt',
'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt']
'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt'],
'WMTNewsCrawl': 'http://www.statmt.org/wmt11/training-monolingual-news-{}.tgz'
}


Expand All @@ -26,7 +27,7 @@ class LanguageModelingDataset(torch.utils.data.Dataset):
- WikiText2
- WikiText103
- PennTreebank

- WMTNewsCrawl
"""

def __init__(self, data, vocab):
Expand Down Expand Up @@ -73,7 +74,7 @@ def _get_datafile_path(key, extracted_files):

def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
root='.data', vocab=None, removed_tokens=[],
data_select=('train', 'test', 'valid')):
data_select=('train', 'test', 'valid'), **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of adding a generic **kwargs to this function, especially since it seems to serve the function of expanding data_select, which is already a well defined parameter. This runs the chance of making the APIs inconsistent. We should think about how to extend data_select to support cases like this.

The obvious choice is to expand it by using the cross-product, but that'll yield a lot of potential arguments. The other idea could be to have a NamedTuple object as a sort of argument object. Also, this seems to only create the training dataset. This means there is no predefined set, so the distinction between training, validation and test is arbitrary / not defined, so we might as well drop it. We could create an WMTNewsCrawlOptions namedtuple that is constructed by giving it a Year and a Language and passed to data_select. I'm sure there are some other options here as well.


if isinstance(data_select, str):
data_select = [data_select]
Expand All @@ -85,6 +86,31 @@ def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
select_to_index = {'train': 0, 'test': 1, 'valid': 2}
extracted_files = [download_from_url(URLS['PennTreebank'][select_to_index[key]],
root=root) for key in data_select]
elif dataset_name == 'WMTNewsCrawl':
if not (data_select == ['train'] or set(data_select).issubset(set(('train',)))):
raise ValueError("Invalid option for data_select, got {}. "
"WMTNewsCrawl only creates a training dataset. "
"data_select should be 'train' "
"or ('train',).".format(data_select))

year = kwargs.get('year', 2011)
if str(year) not in ['2007', '2008', '2009', '2010', '2011']:
raise ValueError("Invalid option for year, {}. "
"WMTNewsCrawl dataset is only available for "
"years between 2007-2011.".format(year))

language = kwargs.get('language', 'en')
if language not in ['cs', 'de', 'en', 'es', 'fr']:
raise ValueError("Invalid option for language, {}. "
"WMTNewsCrawl dataset is only available for "
"cs, de, en, es, fr.".format(language))

download_url = URLS[dataset_name].format(year)
fname = 'news.{year}.{language}.shuffled'.format(year=year, language=language)

dataset_tar = download_from_url(download_url, root=root)
extracted_files = extract_archive(dataset_tar)
extracted_files = [f for f in extracted_files if fname in f]
else:
dataset_tar = download_from_url(URLS[dataset_name], root=root)
extracted_files = extract_archive(dataset_tar)
Expand Down Expand Up @@ -139,7 +165,7 @@ def WikiText2(*args, **kwargs):
vocab: Vocabulary used for dataset. If None, it will generate a new
vocabulary based on the train data set.
removed_tokens: removed tokens from output dataset (Default: [])
data_select: a string or tupel for the returned datasets
data_select: a string or tuple for the returned datasets
(Default: ('train', 'test','valid'))
By default, all the three datasets (train, test, valid) are generated. Users
could also choose any one or two of them, for example ('train', 'test') or
Expand Down Expand Up @@ -181,7 +207,7 @@ def WikiText103(*args, **kwargs):
If 'train' is not in the tuple, an vocab object should be provided which will
be used to process valid and/or test data.
removed_tokens: removed tokens from output dataset (Default: [])
data_select: a string or tupel for the returned datasets
data_select: a string or tuple for the returned datasets
(Default: ('train', 'test','valid'))
By default, all the three datasets (train, test, valid) are generated. Users
could also choose any one or two of them, for example ('train', 'test') or
Expand Down Expand Up @@ -218,7 +244,7 @@ def PennTreebank(*args, **kwargs):
vocab: Vocabulary used for dataset. If None, it will generate a new
vocabulary based on the train data set.
removed_tokens: removed tokens from output dataset (Default: [])
data_select: a string or tupel for the returned datasets
data_select: a string or tuple for the returned datasets
(Default: ('train', 'test','valid'))
By default, all the three datasets (train, test, valid) are generated. Users
could also choose any one or two of them, for example ('train', 'test') or
Expand All @@ -238,3 +264,55 @@ def PennTreebank(*args, **kwargs):
"""

return _setup_datasets(*(("PennTreebank",) + args), **kwargs)


def WMTNewsCrawl(*args, **kwargs):
""" Defines WMT News Crawl.
Create language modeling dataset: WMTNewsCrawl
Creates a training set only
Arguments:
tokenizer: the tokenizer used to preprocess raw text data.
The default one is basic_english tokenizer in fastText. spacy tokenizer
is supported as well (see example below). A custom tokenizer is callable
function with input of a string and output of a token list.
root: Directory where the datasets are saved. Default: ".data"
vocab: Vocabulary used for dataset. If None, it will generate a new
vocabulary based on the train data set.
removed_tokens: removed tokens from output dataset (Default: [])
data_select: a string or tuple for the returned datasets.
(Default: 'train')
Only training dataset for News Crawl Corpus datasets by year.
year: year of dataset to use. Choices are 2007-2011.
See details below for memory and token size.
(Default: 2011)
language: language for dataset. Choices are cs, de, en, es, fr.
(Default: 'en')

News Crawl Corpus Dataset Details from `Statistical Machine Translation`__:
+------+--------+--------------------+------------+----------+
| Year | Memory | len(train_dataset) | len(vocab) | lines |
+======+========+====================+============+==========+
| 2007 | 1.1 GB | 338142548 | 573176 | 13984262 |
+------+--------+--------------------+------------+----------+
| 2008 | 3.4 GB | 838296018 | 1091099 | 34737842 |
+------+--------+--------------------+------------+----------+
| 2009 | 3.7 GB | 1027839909 | 1236276 | 44041422 |
+------+--------+--------------------+------------+----------+
| 2010 | 1.4 GB | 399857558 | 680765 | 17676013 |
+------+--------+--------------------+------------+----------+
| 2011 | 229 MB | 54831406 | 208279 | 2466169 |
+------+--------+--------------------+------------+----------+

NOTE: The memory refers to size of the tar.gz file downloaded.
NOTE: The other metrics are for the english datasets.

Examples:
>>> from torchtext.experimental.datasets import WMTNewsCrawl
>>> train_dataset, = WMTNewsCrawl(data_select='train',
language='en',
year=2011)
>>> vocab = train_dataset.get_vocab()

__ http://www.statmt.org/wmt11/translation-task.html#download
"""
return _setup_datasets(*(("WMTNewsCrawl",) + args), **kwargs)