-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added WMT News Crawl Dataset for language modeling #688
base: main
Are you sure you want to change the base?
Conversation
@zhangguanheng66 I added the 2011 dataset per your instructions - was unable to find the validation and test sets. What are your thoughts on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anmolsjoshi for the contribution. As you may know, we introduced a new dataset abstraction in 0.5.0 release link, which is more compatible with torch.utils.data
. We plan to retire all the legacy code after we re-write them. For WLM datasets, we have new datasets as the experimental link. Could you follow the new abstraction for the News Crawl dataset?
Regarding to the valid/test datasets, it's fine to have train dataset if the original data have all the data together in one file. Users could later split the data into train/test/valid by themself.
@zhangguanheng66 I noticed an issue in writing this PR. zip and tar files are handled differently. Assuming both .zip and .tar files are stored to .data, the output of filenames from extract_archive is different. .zip output files such as - "en-us/train.txt" .tar output files such as - ".data/en-us/train.txt" This causes inconsistency in the _setup_datasets function as the root is being joined to the outputted filenames. Is this by design? In my opinion, this should be corrected If you feel there is a mistake in the download function, you could open a separate issue and/or PR to fix it. It's better to do it separately for good record. |
@zhangguanheng66 as a note, this PR is able to download files correctly and setup the dataset just fine. But, it takes a very long time to create the dataset given that there are over 2 million rows in the News Crawl dataset - it takes a while to tokenize the entire dataset. |
Thanks for the contribution. I understand. It's a huge dataset. How do you feel about the new abstract, compared with the old pattern? Any feedbacks? |
@zhangguanheng66 this PR should be good to go. Let me know if you have any comments! |
@zhangguanheng66 any thoughts on overloading the iter method for language modeling? |
Ideally, the |
@@ -90,6 +87,12 @@ def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"), | |||
dataset_tar = download_from_url(URLS[dataset_name], root=root) | |||
extracted_files = [os.path.join(root, d) for d in extract_archive(dataset_tar)] | |||
|
|||
if dataset_name == "WMTNewsCrawl": | |||
data_select = ('train',) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a hardcoded data_select?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no validation and test set provided. Would you prefer I added it as a default argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the case you should error out and say it doesn't exist and have the user explicitly pass in this option. There are a few arguments to be made in favor of this:
- the number of expected return values depends on this option.
- the user might not be aware of the fact that there is no validation and no test dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I will account for that and push changes later today
Main reason for changes is due to the code below. I can revert back all my code to the original and include a if/else in _setup_datasets function for non zip files import os
from torchtext.utils import extract_archive, donwload_from_url
os.mkdir('.data')
download_from_url('http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz')
# '.data/validation.tar.gz'
download_from_url('https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip')
# '.data/wikitext-2-v1.zip'
extract_archive('.data/validation.tar.gz')
# ['.data/val.de', '.data/val.en']
extract_archive('.data/wikitext-2-v1.zip')
# ['wikitext-2/', 'wikitext-2/wiki.test.tokens', 'wikitext-2/wiki.valid.tokens', 'wikitext-2/wiki.train.tokens'] |
@anmolsjoshi - I think it makes sense to fix this separately and before merging this. |
@zhangguanheng66 @cpuhrsch - I'll push a branch up later today fixing this tar/zip issue. And we can move forward with the WMT dataset after. Thanks for your review! |
@zhangguanheng66 @cpuhrsch - quick question before I proceed - are the filenames returned from zip or tar correct i.e. should root folder be prepended to the path? I think the filenames from zip (root prepended) might be better as it is consistent with download_url function |
@anmolsjoshi - I think we should always return the full path, because it gives more flexibility and the user knows exactly what's going on. So, if prepending root gives us that, that's what we should do. |
|
@zhangguanheng66 @cpuhrsch I have incorporated changes requested in an earlier review and made some additional changes. Here is a summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your early question, I just run the dataset locally, and it turn out that it spends most time to download the file. Other than that, I don't think it's unreasonably slow.
One question regarding to WMTNewsCrawl. Its size is only about half of WikiText103's size. Is this the largest Crawl dataset we could have?
It seems that there is an even larger Wikitext dataset like this one, |
Reading from the website, 2009 is the largest dataset.
|
I think it could be interesting to include at least one more larger Crawl corpus so users have the option to try both. |
Is the idea to have multiple functions for different years' datasets or provide an argument for the year? |
Correct me if I'm wrong @anmolsjoshi @cpuhrsch , I don't think there is a significant differences between years. We could just provide different Crawl corpus with different size. Users may want to train the models with different datasets depending on their memory. |
Should I update the current dataset to 2009? Which other datasets would you want to provide? |
Maybe we could add one more argument (as you did for language) so user can explicitly choose the one they like. And in the docs, we clearly mark the number of tokens and the memory side for the corresponding dataset. I'm using 2010 English one with 1.4G in total for my BERT model. The pipeline you built is flexible enough to switch between years/languages. Thanks a lot. |
@zhangguanheng66 thanks for the comments. I've added an option where users can pass the year and a table in the docstrings with details about the news crawl datasets by year. Let me know what you think. What are your thoughts on adding a check whether the provided year and language are valid? |
@zhangguanheng66 I saw discussion in #691 and #690, code in #696 - Is there value in decoupling vocab and LanguageModelingDataset as well? |
@anmolsjoshi We want to de-couple the vocab object from dataset but are not very sure the design. I will work on some cases and pull you guys for a look. |
Thanks! Let me know if any other changes are needed on this PR! |
@@ -73,7 +74,7 @@ def _get_datafile_path(key, extracted_files): | |||
|
|||
def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"), | |||
root='.data', vocab=None, removed_tokens=[], | |||
data_select=('train', 'test', 'valid')): | |||
data_select=('train', 'test', 'valid'), **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of adding a generic **kwargs to this function, especially since it seems to serve the function of expanding data_select, which is already a well defined parameter. This runs the chance of making the APIs inconsistent. We should think about how to extend data_select to support cases like this.
The obvious choice is to expand it by using the cross-product, but that'll yield a lot of potential arguments. The other idea could be to have a NamedTuple object as a sort of argument object. Also, this seems to only create the training dataset. This means there is no predefined set, so the distinction between training, validation and test is arbitrary / not defined, so we might as well drop it. We could create an WMTNewsCrawlOptions namedtuple that is constructed by giving it a Year and a Language and passed to data_select. I'm sure there are some other options here as well.
Hi @anmolsjoshi! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but we do not have a signature on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Fixes #645