Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Laion5b dataset example #1017

Closed
wants to merge 17 commits into from
Closed

Laion5b dataset example #1017

wants to merge 17 commits into from

Conversation

SvenDS9
Copy link
Contributor

@SvenDS9 SvenDS9 commented Feb 15, 2023

This is an example that uses Datapipes to download and preprocess the laion5b-dataset (to be more precise this subset)

Changes

  • Load metadata from Huggingface and filter
  • Load images from the urls
  • access metadata of image and print out label and copyright information

Unfortunately loading images is still very slow as this is not done in parallel. For this we would a need a new Datapipe that supports multithreading or asyncyio.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 15, 2023
Summary:
`fsspec` doesn't support python 3.11 for now. In order to enable TorchData conda release for python 3.11, I have to remove `fsspec` from `meta.yml` to prevent installation during building process.
See: https://anaconda.org/anaconda/fsspec/files

You can find failing release workflow in https://github.com/pytorch/data/actions/runs/4183212525

Pull Request resolved: pytorch#1015

Reviewed By: atalman

Differential Revision: D43307022

Pulled By: ejguan

fbshipit-source-id: 2da8022705d503ff8a75ec3e5e8766d7d5592d82
@ejguan ejguan added this to the 0.6.0 milestone Feb 15, 2023
Summary:
Fix the mypy Error after `types_requests` is released to `2.28.11.13`

Pull Request resolved: pytorch#1018

Reviewed By: NivekT

Differential Revision: D43309767

Pulled By: ejguan

fbshipit-source-id: 1990ec809156c961ebf32d6f3b16b67071cc1303
NivekT and others added 3 commits February 15, 2023 14:23
Summary: Pull Request resolved: pytorch#1019

Test Plan:
Imported from OSS

 ---
Ran this on devvm:

```
buck2 test mode/dev-nosan //caffe2/torch/fb/trainer/data_modules/tests:test_full_sync_data_module_data_loader_v2_compatibility -- --exact 'caffe2/torch/fb/trainer/data_modules/tests:test_full_sync_data_module_data_loader_v2_compatibility - test_full_sync_data_module_data_reading_checkpoint_read_beyond_data_length (caffe2.torch.fb.trainer.data_modules.tests.test_full_sync_data_module_data_loader_v2_compatibility.FullSyncDataModuleDataLoaderV2CompatibilityTest)'
```

Reviewed By: ejguan

Differential Revision: D43312918

Pulled By: NivekT

fbshipit-source-id: f4766c4aec1adfd23f2b5aadf4a26905ba8716c9
Summary:
When a worker thread fails in PrototypeMultiProcessingReadingService, exception was silently neglected and the thread hanged. This PR fixes that by catching and propagating exception to response queue, showing that to user and exiting the process instead of hanging.

Pull Request resolved: pytorch#1003

Reviewed By: ejguan

Differential Revision: D43327742

Pulled By: priyaramani

fbshipit-source-id: 7a1f6bed2d3baab04912df2b8456dbb940c3e663
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM with two minor comments.

examples/vision/laion5b.py Show resolved Hide resolved
examples/vision/laion5b.py Outdated Show resolved Hide resolved
examples/vision/laion5b.py Outdated Show resolved Hide resolved
SvenDS9 and others added 4 commits February 16, 2023 15:51
Summary: Pull Request resolved: pytorch#1022

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D43367475

Pulled By: NivekT

fbshipit-source-id: 3087326aab04efbc4e7dde3232302903d3e4eb9e
Summary: Pull Request resolved: pytorch#1028

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D43370795

Pulled By: NivekT

fbshipit-source-id: f052b80c0b51c65b8735a381e8d4d239be9e5dba
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines 74 to 75
i = 0
for batch in laion2b_en():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget to mention, can you please use DataLoader2 as well?

from torchdata.dataloader2 import DataLoader2, MultiprocessingReadingService

dp = laion2b_en()
rs = MultiprocessingReadingService(num_workers=4)
dl = DataLoader2(dp, reading_service=rs)
for batch in dl:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary:
Fixes pytorch#1013

## Changes

- Simplify the control flow of prefetcher
  - Delay Exception raised from thread worker to main thread in `__iter__`
  - Stop prefetching whenever Exception is received
  - As long as `stop_iteration` is not turned on or `buffer` is not empty, continue yielding data from `__iter__`.
  - Add serialization test
- Add `PinMemory` DataPipe
  -  `is_replciable() -> False` to keep it in the main process
  - Add unit tests
- Update `test_proto_multi_rs.py` to `test_mprs.py`

Pull Request resolved: pytorch#1014

Reviewed By: NivekT

Differential Revision: D43329696

Pulled By: ejguan

fbshipit-source-id: da4326dbe2388f4e23b9a1a3a5c43da09d29185a
@NivekT NivekT mentioned this pull request Feb 17, 2023
10 tasks
@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Feb 20, 2023

Sorry I made a mistake. I will have to reopen this.

@SvenDS9 SvenDS9 closed this Feb 20, 2023
facebook-github-bot pushed a commit that referenced this pull request Feb 21, 2023
Summary:
This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing.

### Changes
- Load metadata from Huggingface and filter
- Load images from the urls
- access metadata of image and print out label and copyright information

Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.

Pull Request resolved: #1034

Reviewed By: NivekT

Differential Revision: D43463022

Pulled By: ejguan

fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
NivekT pushed a commit that referenced this pull request Feb 21, 2023
Summary:
This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing.

### Changes
- Load metadata from Huggingface and filter
- Load images from the urls
- access metadata of image and print out label and copyright information

Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.

Pull Request resolved: #1034

Reviewed By: NivekT

Differential Revision: D43463022

Pulled By: ejguan

fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
ejguan pushed a commit that referenced this pull request Feb 22, 2023
Summary:
This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing.

### Changes
- Load metadata from Huggingface and filter
- Load images from the urls
- access metadata of image and print out label and copyright information

Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.

Pull Request resolved: #1034

Reviewed By: NivekT

Differential Revision: D43463022

Pulled By: ejguan

fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants