Added WMT News Crawl Dataset for language modeling #688

anmolsjoshi · 2020-02-03T06:10:40Z

Fixes #645

Added WMT News Crawl dataset for language modeling

anmolsjoshi · 2020-02-03T06:12:10Z

@zhangguanheng66 I added the 2011 dataset per your instructions - was unable to find the validation and test sets. What are your thoughts on this?

zhangguanheng66

Thanks @anmolsjoshi for the contribution. As you may know, we introduced a new dataset abstraction in 0.5.0 release link, which is more compatible with torch.utils.data. We plan to retire all the legacy code after we re-write them. For WLM datasets, we have new datasets as the experimental link. Could you follow the new abstraction for the News Crawl dataset?

Regarding to the valid/test datasets, it's fine to have train dataset if the original data have all the data together in one file. Users could later split the data into train/test/valid by themself.

anmolsjoshi · 2020-02-04T02:50:06Z

@zhangguanheng66 I noticed an issue in writing this PR.

zip and tar files are handled differently.

Assuming both .zip and .tar files are stored to .data, the output of filenames from extract_archive is different.

.zip output files such as - "en-us/train.txt"

.tar output files such as - ".data/en-us/train.txt"

This causes inconsistency in the _setup_datasets function as the root is being joined to the outputted filenames.

Is this by design? In my opinion, this should be corrected

If you feel there is a mistake in the download function, you could open a separate issue and/or PR to fix it. It's better to do it separately for good record.

anmolsjoshi · 2020-02-04T02:52:24Z

@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.

zhangguanheng66 · 2020-02-04T15:55:08Z

@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.

Thanks for the contribution. I understand. It's a huge dataset.

How do you feel about the new abstract, compared with the old pattern? Any feedbacks?

anmolsjoshi · 2020-02-05T05:40:12Z

@zhangguanheng66 this PR should be good to go. Let me know if you have any comments!

anmolsjoshi · 2020-02-05T15:45:33Z

@zhangguanheng66 any thoughts on overloading the iter method for language modeling?

zhangguanheng66 · 2020-02-05T16:14:05Z

@zhangguanheng66 any thoughts on overloading the iter method for language modeling?

Ideally, the iter method should be handled by DataLoader, rather than torchtext. We want to eventually retire those legacy code, including batch, split.

torchtext/utils.py

test/test_utils.py

torchtext/datasets/text_classification.py

cpuhrsch · 2020-02-05T17:47:52Z

torchtext/experimental/datasets/language_modeling.py

@@ -90,6 +87,12 @@ def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
        dataset_tar = download_from_url(URLS[dataset_name], root=root)
        extracted_files = [os.path.join(root, d) for d in extract_archive(dataset_tar)]

+    if dataset_name == "WMTNewsCrawl":
+        data_select = ('train',)


Why is there a hardcoded data_select?

There is no validation and test set provided. Would you prefer I added it as a default argument?

If this is the case you should error out and say it doesn't exist and have the user explicitly pass in this option. There are a few arguments to be made in favor of this:

the number of expected return values depends on this option.

the user might not be aware of the fact that there is no validation and no test dataset.

Agreed! I will account for that and push changes later today

torchtext/experimental/datasets/text_classification.py

anmolsjoshi · 2020-02-05T18:22:03Z

Main reason for changes is due to the code below.

I can revert back all my code to the original and include a if/else in _setup_datasets function for non zip files

import os
from torchtext.utils import extract_archive, donwload_from_url

os.mkdir('.data')
download_from_url('http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz')
# '.data/validation.tar.gz'

download_from_url('https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip')
# '.data/wikitext-2-v1.zip'

extract_archive('.data/validation.tar.gz')
# ['.data/val.de', '.data/val.en']

extract_archive('.data/wikitext-2-v1.zip')
# ['wikitext-2/', 'wikitext-2/wiki.test.tokens', 'wikitext-2/wiki.valid.tokens', 'wikitext-2/wiki.train.tokens']

@cpuhrsch @zhangguanheng66

cpuhrsch · 2020-02-05T21:10:37Z

@anmolsjoshi - I think it makes sense to fix this separately and before merging this.

anmolsjoshi · 2020-02-05T21:13:53Z

@zhangguanheng66 @cpuhrsch - I'll push a branch up later today fixing this tar/zip issue. And we can move forward with the WMT dataset after.

Thanks for your review!

anmolsjoshi · 2020-02-05T22:06:26Z

@zhangguanheng66 @cpuhrsch - quick question before I proceed - are the filenames returned from zip or tar correct i.e. should root folder be prepended to the path?

I think the filenames from zip (root prepended) might be better as it is consistent with download_url function

cpuhrsch · 2020-02-06T16:31:04Z

@anmolsjoshi - I think we should always return the full path, because it gives more flexibility and the user knows exactly what's going on. So, if prepending root gives us that, that's what we should do.

anmolsjoshi · 2020-02-24T21:51:40Z

~~Closing in favor of #700~~

anmolsjoshi · 2020-02-25T02:14:50Z

@zhangguanheng66 @cpuhrsch I have incorporated changes requested in an earlier review and made some additional changes. Here is a summary:

Removed any code related to extract_archive fix and moved it to [Bug Fixing][BC Breaking] Unify tar and zip handling with extract_archive #692
Added a check for data_select for WMT News Crawl, it can only be train as other datasets are not provided
Added tests to load the dataset and check if errors are raised for incorrect data_select
Made corrections to spelling errors, mainly correcting tupel to tuple

zhangguanheng66

Regarding your early question, I just run the dataset locally, and it turn out that it spends most time to download the file. Other than that, I don't think it's unreasonably slow.

One question regarding to WMTNewsCrawl. Its size is only about half of WikiText103's size. Is this the largest Crawl dataset we could have?

zhangguanheng66 · 2020-02-25T20:43:09Z

It seems that there is an even larger Wikitext dataset like this one,
https://dl.fbaipublicfiles.com/fairseq/data/wikipedia.en_filtered.gz
Any thought?

anmolsjoshi · 2020-02-25T20:43:19Z

Reading from the website, 2009 is the largest dataset.

From Europarl (403MB) md5 sha1

From the News Commentary corpus (41MB) md5 sha1

From the News Crawl corpus (2007 only) (1.1 GB) md5

From the News Crawl corpus (2008 only) (3.4 GB) md5

From the News Crawl corpus (2009 only) (3.7 GB) md5

From the News Crawl corpus (2010 only) (1.4 GB) md5

From the News Crawl corpus (2011 only) (229 MB) md5 sha1

The reason I picked 2011 was due to your comment on #645

zhangguanheng66 · 2020-02-25T20:46:48Z

Reading from the website, 2009 is the largest dataset.

From Europarl (403MB) md5 sha1

From the News Commentary corpus (41MB) md5 sha1

From the News Crawl corpus (2007 only) (1.1 GB) md5

From the News Crawl corpus (2008 only) (3.4 GB) md5

From the News Crawl corpus (2009 only) (3.7 GB) md5

From the News Crawl corpus (2010 only) (1.4 GB) md5

From the News Crawl corpus (2011 only) (229 MB) md5 sha1

The reason I picked 2011 was due to your comment on #645

I think it could be interesting to include at least one more larger Crawl corpus so users have the option to try both.

anmolsjoshi · 2020-02-25T20:50:08Z

Is the idea to have multiple functions for different years' datasets or provide an argument for the year?

zhangguanheng66 · 2020-02-25T22:33:10Z

Is the idea to have multiple functions for different years' datasets or provide an argument for the year?

Correct me if I'm wrong @anmolsjoshi @cpuhrsch , I don't think there is a significant differences between years. We could just provide different Crawl corpus with different size. Users may want to train the models with different datasets depending on their memory.

anmolsjoshi · 2020-02-25T22:57:24Z

Should I update the current dataset to 2009?

Which other datasets would you want to provide?

zhangguanheng66 · 2020-02-25T23:23:21Z

Maybe we could add one more argument (as you did for language) so user can explicitly choose the one they like. And in the docs, we clearly mark the number of tokens and the memory side for the corresponding dataset.

I'm using 2010 English one with 1.4G in total for my BERT model. The pipeline you built is flexible enough to switch between years/languages. Thanks a lot.

anmolsjoshi · 2020-02-26T18:49:33Z

@zhangguanheng66 thanks for the comments.

I've added an option where users can pass the year and a table in the docstrings with details about the news crawl datasets by year.

Let me know what you think.

What are your thoughts on adding a check whether the provided year and language are valid?

anmolsjoshi · 2020-02-27T04:50:54Z

@zhangguanheng66 I saw discussion in #691 and #690, code in #696 - Is there value in decoupling vocab and LanguageModelingDataset as well?

zhangguanheng66 · 2020-02-27T16:33:33Z

@anmolsjoshi We want to de-couple the vocab object from dataset but are not very sure the design. I will work on some cases and pull you guys for a look.

anmolsjoshi · 2020-02-27T16:46:21Z

Thanks! Let me know if any other changes are needed on this PR!

cpuhrsch · 2020-03-02T18:57:36Z

torchtext/experimental/datasets/language_modeling.py

@@ -73,7 +74,7 @@ def _get_datafile_path(key, extracted_files):

 def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
                    root='.data', vocab=None, removed_tokens=[],
-                    data_select=('train', 'test', 'valid')):
+                    data_select=('train', 'test', 'valid'), **kwargs):


I'm not a fan of adding a generic **kwargs to this function, especially since it seems to serve the function of expanding data_select, which is already a well defined parameter. This runs the chance of making the APIs inconsistent. We should think about how to extend data_select to support cases like this.

The obvious choice is to expand it by using the cross-product, but that'll yield a lot of potential arguments. The other idea could be to have a NamedTuple object as a sort of argument object. Also, this seems to only create the training dataset. This means there is no predefined set, so the distinction between training, validation and test is arbitrary / not defined, so we might as well drop it. We could create an WMTNewsCrawlOptions namedtuple that is constructed by giving it a Year and a Language and passed to data_select. I'm sure there are some other options here as well.

facebook-github-bot · 2020-10-30T17:36:46Z

Hi @anmolsjoshi!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

Added WMT News Crawl Dataset for language modeling

4f5a3d3

anmolsjoshi requested a review from zhangguanheng66 February 3, 2020 06:11

zhangguanheng66 requested changes Feb 3, 2020

View reviewed changes

anmolsjoshi added 3 commits February 3, 2020 18:21

Removed WMT from torchtext.datasets

40a0b3e

Fixed error related to tar files.

b498950

Added WMT to experimental dataset

1a016d5

anmolsjoshi added 3 commits February 4, 2020 19:54

Working tests and text classification dataset

e113246

Updated docstrings

71eeaf3

Added line to join root with extracted files

4c807a9

zhangguanheng66 reviewed Feb 5, 2020

View reviewed changes

torchtext/utils.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Feb 5, 2020

View reviewed changes

test/test_utils.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Feb 5, 2020

View reviewed changes

torchtext/datasets/text_classification.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Feb 5, 2020

View reviewed changes

torchtext/experimental/datasets/text_classification.py Outdated Show resolved Hide resolved

anmolsjoshi mentioned this pull request Feb 6, 2020

[Bug Fixing][BC Breaking] Unify tar and zip handling with extract_archive #692

Merged

anmolsjoshi closed this Feb 24, 2020

anmolsjoshi mentioned this pull request Feb 24, 2020

Added WMT News Crawl to experimental.language_modeling #700

Closed

Incorporated comments

960ab99

anmolsjoshi reopened this Feb 24, 2020

anmolsjoshi added 8 commits February 24, 2020 15:35

Reverted files to master version

3b39db2

Revered test files to master version

061fff1

spacing

04f497c

Fixed arguments

d259c7d

fixed flake8 errors

a7d720c

Added test for WMTNewsCrawl

e6308d1

fixed flake8 issues

27fb2d3

Added a test for incorrect option for data_select

5be9f51

Merge branch 'master' into feature/news_crawl

62b20e9

zhangguanheng66 reviewed Feb 25, 2020

View reviewed changes

Added option for year for WMT, included information by year

de521be

anmolsjoshi added 2 commits February 26, 2020 10:53

Updated Example code

c228e80

Added validation for language and year with tests

b5e8a02

cpuhrsch reviewed Mar 2, 2020

View reviewed changes

Merge branch 'master' into feature/news_crawl

2a2cbea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added WMT News Crawl Dataset for language modeling #688

Added WMT News Crawl Dataset for language modeling #688

anmolsjoshi commented Feb 3, 2020

anmolsjoshi commented Feb 3, 2020

zhangguanheng66 left a comment

anmolsjoshi commented Feb 4, 2020 •

edited by zhangguanheng66

Loading

anmolsjoshi commented Feb 4, 2020

zhangguanheng66 commented Feb 4, 2020

anmolsjoshi commented Feb 5, 2020

anmolsjoshi commented Feb 5, 2020

zhangguanheng66 commented Feb 5, 2020

cpuhrsch Feb 5, 2020

anmolsjoshi Feb 5, 2020

cpuhrsch Feb 5, 2020

anmolsjoshi Feb 5, 2020

anmolsjoshi commented Feb 5, 2020 •

edited

Loading

cpuhrsch commented Feb 5, 2020

anmolsjoshi commented Feb 5, 2020

anmolsjoshi commented Feb 5, 2020

cpuhrsch commented Feb 6, 2020

anmolsjoshi commented Feb 24, 2020 •

edited

Loading

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 left a comment

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 26, 2020 •

edited

Loading

anmolsjoshi commented Feb 27, 2020

zhangguanheng66 commented Feb 27, 2020

anmolsjoshi commented Feb 27, 2020

cpuhrsch Mar 2, 2020

facebook-github-bot commented Oct 30, 2020

Added WMT News Crawl Dataset for language modeling #688

Are you sure you want to change the base?

Added WMT News Crawl Dataset for language modeling #688

Conversation

anmolsjoshi commented Feb 3, 2020

anmolsjoshi commented Feb 3, 2020

zhangguanheng66 left a comment

Choose a reason for hiding this comment

anmolsjoshi commented Feb 4, 2020 • edited by zhangguanheng66 Loading

anmolsjoshi commented Feb 4, 2020

zhangguanheng66 commented Feb 4, 2020

anmolsjoshi commented Feb 5, 2020

anmolsjoshi commented Feb 5, 2020

zhangguanheng66 commented Feb 5, 2020

cpuhrsch Feb 5, 2020

Choose a reason for hiding this comment

anmolsjoshi Feb 5, 2020

Choose a reason for hiding this comment

cpuhrsch Feb 5, 2020

Choose a reason for hiding this comment

anmolsjoshi Feb 5, 2020

Choose a reason for hiding this comment

anmolsjoshi commented Feb 5, 2020 • edited Loading

cpuhrsch commented Feb 5, 2020

anmolsjoshi commented Feb 5, 2020

anmolsjoshi commented Feb 5, 2020

cpuhrsch commented Feb 6, 2020

anmolsjoshi commented Feb 24, 2020 • edited Loading

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 left a comment

Choose a reason for hiding this comment

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 25, 2020

zhangguanheng66 commented Feb 25, 2020

anmolsjoshi commented Feb 26, 2020 • edited Loading

anmolsjoshi commented Feb 27, 2020

zhangguanheng66 commented Feb 27, 2020

anmolsjoshi commented Feb 27, 2020

cpuhrsch Mar 2, 2020

Choose a reason for hiding this comment

facebook-github-bot commented Oct 30, 2020

anmolsjoshi commented Feb 4, 2020 •

edited by zhangguanheng66

Loading

anmolsjoshi commented Feb 5, 2020 •

edited

Loading

anmolsjoshi commented Feb 24, 2020 •

edited

Loading

anmolsjoshi commented Feb 26, 2020 •

edited

Loading