-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added WMT News Crawl Dataset for language modeling #688
Open
anmolsjoshi
wants to merge
22
commits into
pytorch:main
Choose a base branch
from
anmolsjoshi:feature/news_crawl
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+116
−7
Open
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
4f5a3d3
Added WMT News Crawl Dataset for language modeling
anmolsjoshi 40a0b3e
Removed WMT from torchtext.datasets
anmolsjoshi b498950
Fixed error related to tar files.
anmolsjoshi 1a016d5
Added WMT to experimental dataset
anmolsjoshi e113246
Working tests and text classification dataset
anmolsjoshi 71eeaf3
Updated docstrings
anmolsjoshi 4c807a9
Added line to join root with extracted files
anmolsjoshi 8b2e372
Merge branch 'master' of https://github.com/pytorch/text into feature…
anmolsjoshi 960ab99
Incorporated comments
anmolsjoshi 3b39db2
Reverted files to master version
anmolsjoshi 061fff1
Revered test files to master version
anmolsjoshi 04f497c
spacing
anmolsjoshi d259c7d
Fixed arguments
anmolsjoshi a7d720c
fixed flake8 errors
anmolsjoshi e6308d1
Added test for WMTNewsCrawl
anmolsjoshi 27fb2d3
fixed flake8 issues
anmolsjoshi 5be9f51
Added a test for incorrect option for data_select
anmolsjoshi 62b20e9
Merge branch 'master' into feature/news_crawl
anmolsjoshi de521be
Added option for year for WMT, included information by year
anmolsjoshi c228e80
Updated Example code
anmolsjoshi b5e8a02
Added validation for language and year with tests
anmolsjoshi 2a2cbea
Merge branch 'master' into feature/news_crawl
anmolsjoshi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,9 @@ | ||
from .language_modeling import LanguageModelingDataset, WikiText2, WikiText103, PennTreebank # NOQA | ||
from .language_modeling import LanguageModelingDataset, WikiText2, WikiText103, PennTreebank, WMTNewsCrawl # NOQA | ||
from .text_classification import IMDB | ||
|
||
__all__ = ['LanguageModelingDataset', | ||
'WikiText2', | ||
'WikiText103', | ||
'PennTreebank', | ||
'WMTNewsCrawl', | ||
'IMDB'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of adding a generic **kwargs to this function, especially since it seems to serve the function of expanding data_select, which is already a well defined parameter. This runs the chance of making the APIs inconsistent. We should think about how to extend data_select to support cases like this.
The obvious choice is to expand it by using the cross-product, but that'll yield a lot of potential arguments. The other idea could be to have a NamedTuple object as a sort of argument object. Also, this seems to only create the training dataset. This means there is no predefined set, so the distinction between training, validation and test is arbitrary / not defined, so we might as well drop it. We could create an WMTNewsCrawlOptions namedtuple that is constructed by giving it a Year and a Language and passed to data_select. I'm sure there are some other options here as well.