-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp TorchText Dataset Testing Strategy #1493
Comments
I plan to pick up SQuAD1, SQuAD2, and SST2 tests next! |
I have pending feature branches for |
will give a try to Multi30K |
I can pick up the IWSLTs |
Planning on picking up YahooAnswers, SogouNews, and PennTreebank next! |
Planning on picking-up cc-100. |
Just a quick update. This issue can be closed once #1608 is merged in. All other dataset testing is completed |
Thanks @parmeet, @abhinavarora, @erip, and @VirgileHlav for all your help with designing, implementing, and iterating on the mock dataset tests. I'm going to go ahead and close this now that all tasks within the backlog are complete! |
🚀 Feature
Revamp our dataset testing strategy to reduce amount of time spent waiting on tests to complete before merging PRs in torchtext.
Motivation
TorchText dataset testing currently relies on downloading and caching the datasets daily and then running CircleCI tests on the cached data. This can be slow and unreliable for the first PR that kicks off the dataset download and caching. In addition, dataset extraction can be time consuming for some of the larger datasets within torchtext and this extraction process occurs each time the dataset tests are run on a PR. Due to these reasons, tests on CircleCI can take up to an hour to run for each PR whereas vision/audio tests run in mere minutes. We want to revamp our dataset testing strategy in order to reduce the amount of time we spend waiting on tests to complete before merging our PRs in torchtext.
Pitch
We need to update the legacy dataset tests within torchtext. Currently we test for things including:
Going forward it doesn’t make sense to test the MD5 hash or the number of lines in the dataset. Instead we
Backlog of Dataset Tests
Contributing
We have already implemented a dataset test for AmazonReviewPolarity (#1532) as an example to follow when implementing future dataset tests. Please leave a message below if you plan to work on particular dataset test to avoid duplication of efforts. Also please link to the corresponding PRs.
Follow-Up Items
utf8
before writing to file when creating mocked data (see Multi30k mocked testing #1554 (comment))Additional Context
We explored how other domains implemented testing for datasets and summarize them below. We will implement our new testing strategy by taking inspiration from TorchAudio and TorchVision
Possible Approaches
TorchAudio Approach
TestBaseMixin
andPytorchTestCase
(link)TestBaseMixin
base class provide consistent way to define device/dtype/backend aware TestCaseget_mock_dataset()
method which is responsible for creating the mocked data and saving it to a file in a temp dir (link)TorchVision Approach
test_datasets.py
file. This file is really long (1300) and a little hard to read as opposed to seperating tests for each dataset into it's own fileCocoDetectionTestCase
for the COCO dataset extends theDatasetTestCase
base class (link)DatasetTestCase
is the abstract base class for all dataset test cases and expects child classes to overwrite class attributes such asDATASET_CLASS
andFEATURE_TYPES
(link)The text was updated successfully, but these errors were encountered: