Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added WMT News Crawl Dataset for language modeling #688

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

anmolsjoshi
Copy link
Contributor

Fixes #645

  • Added WMT News Crawl dataset for language modeling

@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 I added the 2011 dataset per your instructions - was unable to find the validation and test sets. What are your thoughts on this?

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anmolsjoshi for the contribution. As you may know, we introduced a new dataset abstraction in 0.5.0 release link, which is more compatible with torch.utils.data. We plan to retire all the legacy code after we re-write them. For WLM datasets, we have new datasets as the experimental link. Could you follow the new abstraction for the News Crawl dataset?

Regarding to the valid/test datasets, it's fine to have train dataset if the original data have all the data together in one file. Users could later split the data into train/test/valid by themself.

@anmolsjoshi
Copy link
Contributor Author

anmolsjoshi commented Feb 4, 2020

@zhangguanheng66 I noticed an issue in writing this PR.

zip and tar files are handled differently.

Assuming both .zip and .tar files are stored to .data, the output of filenames from extract_archive is different.

.zip output files such as - "en-us/train.txt"

.tar output files such as - ".data/en-us/train.txt"

This causes inconsistency in the _setup_datasets function as the root is being joined to the outputted filenames.

Is this by design? In my opinion, this should be corrected

If you feel there is a mistake in the download function, you could open a separate issue and/or PR to fix it. It's better to do it separately for good record.

@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset.

Thanks for the contribution. I understand. It's a huge dataset.

How do you feel about the new abstract, compared with the old pattern? Any feedbacks?

@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 this PR should be good to go. Let me know if you have any comments!

@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 any thoughts on overloading the iter method for language modeling?

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 any thoughts on overloading the iter method for language modeling?

Ideally, the iter method should be handled by DataLoader, rather than torchtext. We want to eventually retire those legacy code, including batch, split.

torchtext/utils.py Outdated Show resolved Hide resolved
test/test_utils.py Outdated Show resolved Hide resolved
@@ -90,6 +87,12 @@ def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
dataset_tar = download_from_url(URLS[dataset_name], root=root)
extracted_files = [os.path.join(root, d) for d in extract_archive(dataset_tar)]

if dataset_name == "WMTNewsCrawl":
data_select = ('train',)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a hardcoded data_select?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no validation and test set provided. Would you prefer I added it as a default argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case you should error out and say it doesn't exist and have the user explicitly pass in this option. There are a few arguments to be made in favor of this:

  • the number of expected return values depends on this option.
  • the user might not be aware of the fact that there is no validation and no test dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I will account for that and push changes later today

@anmolsjoshi
Copy link
Contributor Author

anmolsjoshi commented Feb 5, 2020

Main reason for changes is due to the code below.

I can revert back all my code to the original and include a if/else in _setup_datasets function for non zip files

import os
from torchtext.utils import extract_archive, donwload_from_url

os.mkdir('.data')
download_from_url('http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz')
# '.data/validation.tar.gz'

download_from_url('https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip')
# '.data/wikitext-2-v1.zip'

extract_archive('.data/validation.tar.gz')
# ['.data/val.de', '.data/val.en']

extract_archive('.data/wikitext-2-v1.zip')
# ['wikitext-2/', 'wikitext-2/wiki.test.tokens', 'wikitext-2/wiki.valid.tokens', 'wikitext-2/wiki.train.tokens']

@cpuhrsch @zhangguanheng66

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Feb 5, 2020

@anmolsjoshi - I think it makes sense to fix this separately and before merging this.

@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 @cpuhrsch - I'll push a branch up later today fixing this tar/zip issue. And we can move forward with the WMT dataset after.

Thanks for your review!

@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 @cpuhrsch - quick question before I proceed - are the filenames returned from zip or tar correct i.e. should root folder be prepended to the path?

I think the filenames from zip (root prepended) might be better as it is consistent with download_url function

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Feb 6, 2020

@anmolsjoshi - I think we should always return the full path, because it gives more flexibility and the user knows exactly what's going on. So, if prepending root gives us that, that's what we should do.

@anmolsjoshi
Copy link
Contributor Author

anmolsjoshi commented Feb 24, 2020

Closing in favor of #700

@anmolsjoshi anmolsjoshi reopened this Feb 24, 2020
@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 @cpuhrsch I have incorporated changes requested in an earlier review and made some additional changes. Here is a summary:

  • Removed any code related to extract_archive fix and moved it to [Bug Fixing][BC Breaking] Unify tar and zip handling with extract_archive #692
  • Added a check for data_select for WMT News Crawl, it can only be train as other datasets are not provided
  • Added tests to load the dataset and check if errors are raised for incorrect data_select
  • Made corrections to spelling errors, mainly correcting tupel to tuple

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your early question, I just run the dataset locally, and it turn out that it spends most time to download the file. Other than that, I don't think it's unreasonably slow.

One question regarding to WMTNewsCrawl. Its size is only about half of WikiText103's size. Is this the largest Crawl dataset we could have?

@zhangguanheng66
Copy link
Contributor

It seems that there is an even larger Wikitext dataset like this one,
https://dl.fbaipublicfiles.com/fairseq/data/wikipedia.en_filtered.gz
Any thought?

@anmolsjoshi
Copy link
Contributor Author

Reading from the website, 2009 is the largest dataset.

  • From Europarl (403MB) md5 sha1
  • From the News Commentary corpus (41MB) md5 sha1
  • From the News Crawl corpus (2007 only) (1.1 GB) md5
  • From the News Crawl corpus (2008 only) (3.4 GB) md5
  • From the News Crawl corpus (2009 only) (3.7 GB) md5
  • From the News Crawl corpus (2010 only) (1.4 GB) md5
  • From the News Crawl corpus (2011 only) (229 MB) md5 sha1

The reason I picked 2011 was due to your comment on #645

@zhangguanheng66
Copy link
Contributor

Reading from the website, 2009 is the largest dataset.

  • From Europarl (403MB) md5 sha1
  • From the News Commentary corpus (41MB) md5 sha1
  • From the News Crawl corpus (2007 only) (1.1 GB) md5
  • From the News Crawl corpus (2008 only) (3.4 GB) md5
  • From the News Crawl corpus (2009 only) (3.7 GB) md5
  • From the News Crawl corpus (2010 only) (1.4 GB) md5
  • From the News Crawl corpus (2011 only) (229 MB) md5 sha1

The reason I picked 2011 was due to your comment on #645

I think it could be interesting to include at least one more larger Crawl corpus so users have the option to try both.

@anmolsjoshi
Copy link
Contributor Author

Is the idea to have multiple functions for different years' datasets or provide an argument for the year?

@zhangguanheng66
Copy link
Contributor

Is the idea to have multiple functions for different years' datasets or provide an argument for the year?

Correct me if I'm wrong @anmolsjoshi @cpuhrsch , I don't think there is a significant differences between years. We could just provide different Crawl corpus with different size. Users may want to train the models with different datasets depending on their memory.

@anmolsjoshi
Copy link
Contributor Author

Should I update the current dataset to 2009?

Which other datasets would you want to provide?

@zhangguanheng66
Copy link
Contributor

Maybe we could add one more argument (as you did for language) so user can explicitly choose the one they like. And in the docs, we clearly mark the number of tokens and the memory side for the corresponding dataset.

I'm using 2010 English one with 1.4G in total for my BERT model. The pipeline you built is flexible enough to switch between years/languages. Thanks a lot.

@anmolsjoshi
Copy link
Contributor Author

anmolsjoshi commented Feb 26, 2020

@zhangguanheng66 thanks for the comments.

I've added an option where users can pass the year and a table in the docstrings with details about the news crawl datasets by year.

Let me know what you think.

What are your thoughts on adding a check whether the provided year and language are valid?

@anmolsjoshi
Copy link
Contributor Author

@zhangguanheng66 I saw discussion in #691 and #690, code in #696 - Is there value in decoupling vocab and LanguageModelingDataset as well?

@zhangguanheng66
Copy link
Contributor

@anmolsjoshi We want to de-couple the vocab object from dataset but are not very sure the design. I will work on some cases and pull you guys for a look.

@anmolsjoshi
Copy link
Contributor Author

Thanks! Let me know if any other changes are needed on this PR!

@@ -73,7 +74,7 @@ def _get_datafile_path(key, extracted_files):

def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"),
root='.data', vocab=None, removed_tokens=[],
data_select=('train', 'test', 'valid')):
data_select=('train', 'test', 'valid'), **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of adding a generic **kwargs to this function, especially since it seems to serve the function of expanding data_select, which is already a well defined parameter. This runs the chance of making the APIs inconsistent. We should think about how to extend data_select to support cases like this.

The obvious choice is to expand it by using the cross-product, but that'll yield a lot of potential arguments. The other idea could be to have a NamedTuple object as a sort of argument object. Also, this seems to only create the training dataset. This means there is no predefined set, so the distinction between training, validation and test is arbitrary / not defined, so we might as well drop it. We could create an WMTNewsCrawlOptions namedtuple that is constructed by giving it a Year and a Language and passed to data_select. I'm sure there are some other options here as well.

@facebook-github-bot
Copy link
Contributor

Hi @anmolsjoshi!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Google 1 Billion Benchmark Dataset to PyTorch dataset
4 participants