-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Laion5b dataset example #1017
Laion5b dataset example #1017
Conversation
Summary: `fsspec` doesn't support python 3.11 for now. In order to enable TorchData conda release for python 3.11, I have to remove `fsspec` from `meta.yml` to prevent installation during building process. See: https://anaconda.org/anaconda/fsspec/files You can find failing release workflow in https://github.com/pytorch/data/actions/runs/4183212525 Pull Request resolved: pytorch#1015 Reviewed By: atalman Differential Revision: D43307022 Pulled By: ejguan fbshipit-source-id: 2da8022705d503ff8a75ec3e5e8766d7d5592d82
Summary: Fix the mypy Error after `types_requests` is released to `2.28.11.13` Pull Request resolved: pytorch#1018 Reviewed By: NivekT Differential Revision: D43309767 Pulled By: ejguan fbshipit-source-id: 1990ec809156c961ebf32d6f3b16b67071cc1303
Summary: Pull Request resolved: pytorch#1019 Test Plan: Imported from OSS --- Ran this on devvm: ``` buck2 test mode/dev-nosan //caffe2/torch/fb/trainer/data_modules/tests:test_full_sync_data_module_data_loader_v2_compatibility -- --exact 'caffe2/torch/fb/trainer/data_modules/tests:test_full_sync_data_module_data_loader_v2_compatibility - test_full_sync_data_module_data_reading_checkpoint_read_beyond_data_length (caffe2.torch.fb.trainer.data_modules.tests.test_full_sync_data_module_data_loader_v2_compatibility.FullSyncDataModuleDataLoaderV2CompatibilityTest)' ``` Reviewed By: ejguan Differential Revision: D43312918 Pulled By: NivekT fbshipit-source-id: f4766c4aec1adfd23f2b5aadf4a26905ba8716c9
Summary: When a worker thread fails in PrototypeMultiProcessingReadingService, exception was silently neglected and the thread hanged. This PR fixes that by catching and propagating exception to response queue, showing that to user and exiting the process instead of hanging. Pull Request resolved: pytorch#1003 Reviewed By: ejguan Differential Revision: D43327742 Pulled By: priyaramani fbshipit-source-id: 7a1f6bed2d3baab04912df2b8456dbb940c3e663
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM with two minor comments.
This reverts commit 1e60010.
Summary: Pull Request resolved: pytorch#1022 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D43367475 Pulled By: NivekT fbshipit-source-id: 3087326aab04efbc4e7dde3232302903d3e4eb9e
Summary: Pull Request resolved: pytorch#1028 Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D43370795 Pulled By: NivekT fbshipit-source-id: f052b80c0b51c65b8735a381e8d4d239be9e5dba
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
examples/vision/laion5b.py
Outdated
i = 0 | ||
for batch in laion2b_en(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget to mention, can you please use DataLoader2 as well?
from torchdata.dataloader2 import DataLoader2, MultiprocessingReadingService
dp = laion2b_en()
rs = MultiprocessingReadingService(num_workers=4)
dl = DataLoader2(dp, reading_service=rs)
for batch in dl:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1034 (comment)
Summary: Fixes pytorch#1013 ## Changes - Simplify the control flow of prefetcher - Delay Exception raised from thread worker to main thread in `__iter__` - Stop prefetching whenever Exception is received - As long as `stop_iteration` is not turned on or `buffer` is not empty, continue yielding data from `__iter__`. - Add serialization test - Add `PinMemory` DataPipe - `is_replciable() -> False` to keep it in the main process - Add unit tests - Update `test_proto_multi_rs.py` to `test_mprs.py` Pull Request resolved: pytorch#1014 Reviewed By: NivekT Differential Revision: D43329696 Pulled By: ejguan fbshipit-source-id: da4326dbe2388f4e23b9a1a3a5c43da09d29185a
Sorry I made a mistake. I will have to reopen this. |
Summary: This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing. ### Changes - Load metadata from Huggingface and filter - Load images from the urls - access metadata of image and print out label and copyright information Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR. Pull Request resolved: #1034 Reviewed By: NivekT Differential Revision: D43463022 Pulled By: ejguan fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
Summary: This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing. ### Changes - Load metadata from Huggingface and filter - Load images from the urls - access metadata of image and print out label and copyright information Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR. Pull Request resolved: #1034 Reviewed By: NivekT Differential Revision: D43463022 Pulled By: ejguan fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
Summary: This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing. ### Changes - Load metadata from Huggingface and filter - Load images from the urls - access metadata of image and print out label and copyright information Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR. Pull Request resolved: #1034 Reviewed By: NivekT Differential Revision: D43463022 Pulled By: ejguan fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
This is an example that uses Datapipes to download and preprocess the laion5b-dataset (to be more precise this subset)
Changes
Unfortunately loading images is still very slow as this is not done in parallel. For this we would a need a new Datapipe that supports multithreading or
asyncyio
.