-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate datasets to build on top of torchdata datapipes #1494
Comments
This sounds great! Logistically, how do we want to organize the PRs? Should we have a branch which adds torchdata as a dependency, we branch from that branch and target that branch for our PRs? |
@erip That's a good point. Right now we have torchdata as an optional dependency since it is not yet available on pypi (torchdata is in prototype phase until the next Pytorch release). I think going forward we still want to keep it as an optional dependency until torchdata is released as beta. What do you think @parmeet @abhinavarora? |
Great, so individual PRs instead of one big one -- got it! I'll pick a couple soon and start digging. |
I guess it will become necessary to use some Here is the update I applied recently to avoid that pytorch/audio#2127 |
I'll keep a moving list here. To start, I'll take...
|
Hey @erip, thank you for picking the datasets :).
Since we already have a backlog checklist in the the issue, you may constraint this list to the datasets you pick-up and checkmark it directly in the backlog items once the PR lands.
you may skip the mention for this dataset. I will check mark it once I land the PR :) |
AmazonRiviewPolarity [WIP] #1490. Note that this PR shall also resolve issues with existing CI dataset tests. |
Actually in this case, the problem is beyond just unit-testing. The package builds (conda, wheels, docs) also rely on availability of torchdata, because it is imported inside the modules. As of now, the import is conditioned on availability. For unit-testing, we are installing torchdata from source. Perhaps we could do the same for package builds? |
Isn't torchdata optional runtime dependency? I looked into the build process, and CI configuration but I do not see any mention to torchdata or the build/packaging logic changed by the presence of torchdata. |
I plan to pick-up CC-100 next.. |
Yes, it is if users does not use migrated datasets. But they would need installation if they need to use them since the underlying implementation of migrated datasets does not fall back to previous (non-datapipes based) implementation if torchdata package is not found. Edit: We are going to make torchdata hard-dependency with the next release. I think the question is rather what should we be doing during this migration phase. As a safe side, I think keeping it optional make sense, so that users working directly from source/nightly do not break their workflow unless they depend on the migrated datasets (in which case they would need installation anyway).
If we remove the conditional import, it breaks the CI build/packaging related tests. I guess we would need to install torchdata in appropriate locations for them to pass like we are doing for unit-testing here |
I second on this. Before our release, TorchData theoretically is unstable library, which means our API would change a lot. Nightly users of TorchText can encounter error during installation phase. This should not be expected as some users won't use anything from TorchData. |
Okay, this is what I wanted to verify. As long as it is the intended move (and is going to be mentioned in README) then it sounds good to me.
nit: IMU, packaging job should not be importing torchtext, so, I am not entirely convinced that it's required at build time. (especially since torchdata is pure-python package) Maybe it failed at smoke test step that happens right after the packaging step in the same CI job. |
sorry for the confusion, yes you are right it is not needed for packaging, but rather in testing in the same CI job (basically importing torchtext would fail as it would implicitly import datasets that would fail due to missing torchdata package) |
Thanks @abhinavarora and @Nayef211 for your thoughts on BC issue. Honestly, I am of similar impression and cannot think immediately of any obvious BC issues (which is why I wrote unexpected :) ). One of things new API doesn't support is @Nayef211 regarding you question:
I don't think we want to do that. One way to keep both the implementations is by switching to new implementation through some flag etc. By default users would still use the existing implementation and along with that get user-warning and settings information to switch to new implementation. This surely adds to some maintenance overhead which @abhinavarora already raised concerns about. |
I'd like to take a stab at the IMDB dataset |
Replying the comment from #1513
If using a flag, I recommend that to make it a part of constructor. (should not be a big deal if the interface is same) Not a global flag. torchaudio has a global flag for I/O backend and it's not a good maintenance experience. It turned the codebase stateful and in a large code base, it becomes impossible to tell which downstream project sets/expects which flag state. |
Quick update:
In the meantime, I'll take a stab at Multi30k -- it looks quite straightforward. I can't seem to find CC100 and SST2 impls -- are these new datasets for torchtext? |
Done :)
Could you please help provide more details as to where is this required?
Great, thanks!
SST-2 is in experimental folder. @Nayef211 is working on it, and is currently blocked due to some internal fbcode dependency. I am looking at CC100. It is already added as a notebook example here. Unfortunately, the host seems to be quite un-responsive and hence i am currently blocked to add this one. I am trying to contact some internal folks to figure out the path forward. |
Yes, it's here. We extract contents from a tarball and inside the tarball are other tarballs. This requires something roughly like: # [ 1.tgz, 2.tgz, ..., n.tgz]
inner_tars = outer_tar.read_from_tar()
# Yuck, an IterableDataPipe of IterableDataPipes... flatmap to the rescue
# [ 1_1.txt, 1_2.txt, 2_1.txt, 2_2.txt, 3_1.txt, ...]
inner_inner_files = inner_tars.flatmap(lambda inner_tar: FileOpener(inner_tar).read_from_tar()) |
I see, but isn't it that tar datapipe would recursively already do it and return all the inner files? |
@parmeet oh, I'm not sure... I didn't think so, but I guess I hadn't tried yet so I don't know. I'll give it a shot tomorrow. |
@parmeet looking at the implementation of the datapipe associated with the |
Awesome! Thanks so much @erip for taking it forward. I guess these are the only two datasets left (apart for CC-100 :) ). Please feel free to post if there are any blockages :). |
Summary: ## Summary - Updated datamodule to work with any torchtext dataset - No longer checking to see whether the dataset is an intance of the SST2Dataset - Updated `DocClassificationDataModuleConf` dataclass to take in user provided `columns` and `label_column` fields since different datasets have different column orderings - Updated tests to use patching for testing with mocked datasets similar to what is done in OSS for the [AmazonReviewPolarity dataset test](pytorch/text#1532) - Removed dataset test from torchrecipe since the torchtext repo unittests provide adequate coverage ## Followup Items - [ ] Update instantiation call for datasets to work with functional API as opposed to class API once the SST2 dataset has been migrated out of experimental ([reference GH issue](pytorch/text#1494)) Reviewed By: abhinavarora, mthrok, parmeet Differential Revision: D33775443 fbshipit-source-id: 1e6545949808ec5bd0e13cf3f9e7aaea08d68a59
This issue is officially close now. For any improvements, bug fixes, CI dependencies or adding new datasets, we can follow them in separate threads. Thank you everyone for your contributions in realizing this issue sooner than I expected :). cc: @erip, @Nayef211, @abhinavarora, @ejguan, @VirgileHlav |
Thanks for leading the efforts on this migration @parmeet! It was great to see how fast we worked together to get this done 😄 |
🚀 Feature
Motivation
https://github.com/pytorch/data#why-composable-data-loading
user-experience: TorchData datasets enable new functional API, auto-sharding, and snapshotting support out-of-the-box. They also enable standard flow-control like batching, collation, shuffling, bucketing, and mapping/transformation using user-defined functions and transforms (UDFs).
Maintenance: By relying on TorchData, we no longer have to maintain low level functionality like downloading, extracting, caching, file/steam parsing, etc.
Reference
Examples: https://github.com/facebookexternal/torchdata/tree/main/examples/text
TorchData: https://github.com/facebookexternal/torchdata
Backlog of datasets
Contributing
Please leave a message below if you plan to work on particular dataset(s) to avoid duplication of efforts. Also please link to the corresponding PRs.
cc: @Nayef211 , @abhinavarora , @erip , @ejguan , @VitalyFedyunin
The text was updated successfully, but these errors were encountered: