Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Cloud Storage Provider tutorial section #812

Closed
wants to merge 4 commits into from

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Oct 6, 2022

Stack from ghstack:

Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via fsspec.

Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well.

Note that these screenshots are up-to-date:

Screen Shot 2022-10-07 at 2 45 36 PM

Screen Shot 2022-10-07 at 2 46 01 PM

Differential Revision: D40162645

NivekT added a commit that referenced this pull request Oct 6, 2022
ghstack-source-id: d286edc0fc9de44ed43dec3aba38f4ac5a21c92f
Pull Request resolved: #812
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 6, 2022
NivekT added a commit that referenced this pull request Oct 6, 2022
ghstack-source-id: e17389a9dd199503b6b1f2a8214867bccb30f3dc
Pull Request resolved: #812
@NivekT
Copy link
Contributor Author

NivekT commented Oct 6, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`.

Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well.

![Screen Shot 2022-10-06 at 7 12 13 PM](https://user-images.githubusercontent.com/4935152/194434873-c8edb939-0e43-4622-883b-c1ce88f767f2.png)

![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png)

Differential Revision: [D40162645](https://our.internmc.facebook.com/intern/diff/D40162645)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 6, 2022
ghstack-source-id: d7a20ed7fabc0f0a4804cc7c9d70830fe8d9bae2
Pull Request resolved: #812
@NivekT
Copy link
Contributor Author

NivekT commented Oct 6, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice probably the single highest leverage PR I've seen in this repo so far. Some minor nits but I believe they can be addressed in a future PR

docs/source/tutorial.rst Outdated Show resolved Hide resolved
docs/source/tutorial.rst Outdated Show resolved Hide resolved
.. code:: python

from torchdata.datapipes.iter import IterableWrapper
dp = IterableWrapper(["s3://BUCKET_NAME/DIRECTORY/1.tar"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth hosting a toy dataset in a public S3 bucket (can be a followup PR) so we can actually run this code and have a test for it

Working with Cloud Storage Providers
---------------------------------------------

Accessing AWS S3 with ``fsspec`` DataPipes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth having fsspec in requirements.txt? It jsut seems so broadly useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still support making fsspec an optional requirement to maintain the minimum requirements to install torchdata. And, even with fsspec installed, users still need to install specific lib for each cloud vendor like s3fs. So, I feel the extra effort for users are not that significant.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

Working with Cloud Storage Providers
---------------------------------------------

Accessing AWS S3 with ``fsspec`` DataPipes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still support making fsspec an optional requirement to maintain the minimum requirements to install torchdata. And, even with fsspec installed, users still need to install specific lib for each cloud vendor like s3fs. So, I feel the extra effort for users are not that significant.

docs/source/tutorial.rst Show resolved Hide resolved
Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`.

Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well.

![Screen Shot 2022-10-06 at 7 18 11 PM](https://user-images.githubusercontent.com/4935152/194435408-4bbec176-4176-4153-921c-cfba603b4374.png)

![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png)

Differential Revision: [D40162645](https://our.internmc.facebook.com/intern/diff/D40162645)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 7, 2022
ghstack-source-id: e3ebeef23d37b0c403c6dade1d0cea84f2f2a0e7
Pull Request resolved: #812
@NivekT
Copy link
Contributor Author

NivekT commented Oct 7, 2022

Updated based on feedback and wrote a bit extra about potentially using other fsspec libraries to access other providers. Also did some minor doc fixes.

I will land this version soon.

@NivekT
Copy link
Contributor Author

NivekT commented Oct 7, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@NivekT NivekT mentioned this pull request Oct 7, 2022
NivekT added a commit that referenced this pull request Oct 7, 2022
Summary:
Pull Request resolved: #812

Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`.

Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well.

![Screen Shot 2022-10-06 at 7 12 13 PM](https://user-images.githubusercontent.com/4935152/194434873-c8edb939-0e43-4622-883b-c1ce88f767f2.png)

![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png)

Test Plan: Imported from OSS

Reviewed By: ejguan, msaroufim

Differential Revision: D40162645

fbshipit-source-id: 6a7c156714b30bfed2da900b6d47d64f745333e2
ejguan pushed a commit that referenced this pull request Oct 7, 2022
Summary:
Pull Request resolved: #812

Adding example of DataPipe usage with cloud storage providers (AWS S3 and GCS) via `fsspec`.

Feel free to run the examples to verify that they work. Let me know what additional information I should provide as well.

![Screen Shot 2022-10-06 at 7 12 13 PM](https://user-images.githubusercontent.com/4935152/194434873-c8edb939-0e43-4622-883b-c1ce88f767f2.png)

![Screen Shot 2022-10-06 at 7 12 25 PM](https://user-images.githubusercontent.com/4935152/194434886-8b730522-0881-4e53-a5b9-75459304c061.png)

Test Plan: Imported from OSS

Reviewed By: ejguan, msaroufim

Differential Revision: D40162645

fbshipit-source-id: 6a7c156714b30bfed2da900b6d47d64f745333e2
@facebook-github-bot facebook-github-bot deleted the gh/NivekT/96/head branch October 11, 2022 14:19
facebook-github-bot pushed a commit that referenced this pull request Oct 19, 2022
Summary:
### Changes

Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to #812. The example is placed into a new section in `docs/source/tutorial.rst`

Here is the screenshot showing that code snippets in the tutorial work as expected:

<img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png">

####  Minor note

Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason:
- If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs`
- So the condition `root.startswith(protocol)` is false, and `is_local` is true
- As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot:
<img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png">

This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial

Pull Request resolved: #836

Reviewed By: NivekT

Differential Revision: D40483505

Pulled By: sgrigory

fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 21, 2022
Summary:
### Changes

Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to pytorch#812. The example is placed into a new section in `docs/source/tutorial.rst`

Here is the screenshot showing that code snippets in the tutorial work as expected:

<img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png">

####  Minor note

Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason:
- If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs`
- So the condition `root.startswith(protocol)` is false, and `is_local` is true
- As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot:
<img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png">

This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial

Pull Request resolved: pytorch#836

Reviewed By: NivekT

Differential Revision: D40483505

Pulled By: sgrigory

fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
ejguan pushed a commit that referenced this pull request Oct 21, 2022
Summary:
### Changes

Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to #812. The example is placed into a new section in `docs/source/tutorial.rst`

Here is the screenshot showing that code snippets in the tutorial work as expected:

<img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png">

####  Minor note

Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason:
- If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs`
- So the condition `root.startswith(protocol)` is false, and `is_local` is true
- As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot:
<img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png">

This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial

Pull Request resolved: #836

Reviewed By: NivekT

Differential Revision: D40483505

Pulled By: sgrigory

fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 23, 2022
Summary:
### Changes

Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to pytorch#812. The example is placed into a new section in `docs/source/tutorial.rst`

Here is the screenshot showing that code snippets in the tutorial work as expected:

<img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png">

####  Minor note

Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason:
- If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs`
- So the condition `root.startswith(protocol)` is false, and `is_local` is true
- As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot:
<img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png">

This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial

Pull Request resolved: pytorch#836

Reviewed By: NivekT

Differential Revision: D40483505

Pulled By: sgrigory

fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants