-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Cloud Storage Provider tutorial section #812
Conversation
[ghstack-poisoned]
ghstack-source-id: d286edc0fc9de44ed43dec3aba38f4ac5a21c92f Pull Request resolved: #812
[ghstack-poisoned]
ghstack-source-id: e17389a9dd199503b6b1f2a8214867bccb30f3dc Pull Request resolved: #812
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`. Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well. ![Screen Shot 2022-10-06 at 7 12 13 PM](https://user-images.githubusercontent.com/4935152/194434873-c8edb939-0e43-4622-883b-c1ce88f767f2.png) ![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png) Differential Revision: [D40162645](https://our.internmc.facebook.com/intern/diff/D40162645) [ghstack-poisoned]
ghstack-source-id: d7a20ed7fabc0f0a4804cc7c9d70830fe8d9bae2 Pull Request resolved: #812
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice probably the single highest leverage PR I've seen in this repo so far. Some minor nits but I believe they can be addressed in a future PR
.. code:: python | ||
|
||
from torchdata.datapipes.iter import IterableWrapper | ||
dp = IterableWrapper(["s3://BUCKET_NAME/DIRECTORY/1.tar"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth hosting a toy dataset in a public S3 bucket (can be a followup PR) so we can actually run this code and have a test for it
Working with Cloud Storage Providers | ||
--------------------------------------------- | ||
|
||
Accessing AWS S3 with ``fsspec`` DataPipes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worth having fsspec in requirements.txt? It jsut seems so broadly useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still support making fsspec
an optional requirement to maintain the minimum requirements to install torchdata
. And, even with fsspec
installed, users still need to install specific lib for each cloud vendor like s3fs
. So, I feel the extra effort for users are not that significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you
Working with Cloud Storage Providers | ||
--------------------------------------------- | ||
|
||
Accessing AWS S3 with ``fsspec`` DataPipes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still support making fsspec
an optional requirement to maintain the minimum requirements to install torchdata
. And, even with fsspec
installed, users still need to install specific lib for each cloud vendor like s3fs
. So, I feel the extra effort for users are not that significant.
Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`. Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well. ![Screen Shot 2022-10-06 at 7 18 11 PM](https://user-images.githubusercontent.com/4935152/194435408-4bbec176-4176-4153-921c-cfba603b4374.png) ![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png) Differential Revision: [D40162645](https://our.internmc.facebook.com/intern/diff/D40162645) [ghstack-poisoned]
ghstack-source-id: e3ebeef23d37b0c403c6dade1d0cea84f2f2a0e7 Pull Request resolved: #812
Updated based on feedback and wrote a bit extra about potentially using other I will land this version soon. |
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: #812 Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`. Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well. ![Screen Shot 2022-10-06 at 7 12 13 PM](https://user-images.githubusercontent.com/4935152/194434873-c8edb939-0e43-4622-883b-c1ce88f767f2.png) ![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png) Test Plan: Imported from OSS Reviewed By: ejguan, msaroufim Differential Revision: D40162645 fbshipit-source-id: 6a7c156714b30bfed2da900b6d47d64f745333e2
Summary: Pull Request resolved: #812 Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`. Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well. ![Screen Shot 2022-10-06 at 7 12 13 PM](https://user-images.githubusercontent.com/4935152/194434873-c8edb939-0e43-4622-883b-c1ce88f767f2.png) ![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png) Test Plan: Imported from OSS Reviewed By: ejguan, msaroufim Differential Revision: D40162645 fbshipit-source-id: 6a7c156714b30bfed2da900b6d47d64f745333e2
Summary: ### Changes Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to #812. The example is placed into a new section in `docs/source/tutorial.rst` Here is the screenshot showing that code snippets in the tutorial work as expected: <img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png"> #### Minor note Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason: - If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs` - So the condition `root.startswith(protocol)` is false, and `is_local` is true - As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot: <img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png"> This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial Pull Request resolved: #836 Reviewed By: NivekT Differential Revision: D40483505 Pulled By: sgrigory fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
Summary: ### Changes Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to pytorch#812. The example is placed into a new section in `docs/source/tutorial.rst` Here is the screenshot showing that code snippets in the tutorial work as expected: <img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png"> #### Minor note Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason: - If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs` - So the condition `root.startswith(protocol)` is false, and `is_local` is true - As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot: <img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png"> This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial Pull Request resolved: pytorch#836 Reviewed By: NivekT Differential Revision: D40483505 Pulled By: sgrigory fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
Summary: ### Changes Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to #812. The example is placed into a new section in `docs/source/tutorial.rst` Here is the screenshot showing that code snippets in the tutorial work as expected: <img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png"> #### Minor note Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason: - If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs` - So the condition `root.startswith(protocol)` is false, and `is_local` is true - As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot: <img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png"> This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial Pull Request resolved: #836 Reviewed By: NivekT Differential Revision: D40483505 Pulled By: sgrigory fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
Summary: ### Changes Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to pytorch#812. The example is placed into a new section in `docs/source/tutorial.rst` Here is the screenshot showing that code snippets in the tutorial work as expected: <img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png"> #### Minor note Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason: - If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs` - So the condition `root.startswith(protocol)` is false, and `is_local` is true - As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot: <img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png"> This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial Pull Request resolved: pytorch#836 Reviewed By: NivekT Differential Revision: D40483505 Pulled By: sgrigory fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
Stack from ghstack:
Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via
fsspec
.Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well.
Note that these screenshots are up-to-date:
Differential Revision: D40162645