Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] Adding kwargs for fs.open() in fsspec DataPipes #804

Closed
wants to merge 2 commits into from

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Oct 3, 2022

Stack from ghstack:

Fixes #803

I left FSSpecFileLister untouched since I don't think it will be useful for fs.ls() to accept kwargs.

Differential Revision: D40038331

NivekT added a commit that referenced this pull request Oct 3, 2022
ghstack-source-id: c1b6d39078d855e3f04cbe0d617353b5901fa54a
Pull Request resolved: #804
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2022
@NivekT NivekT changed the title [DataPipe] Adding kwargs for fs.open() in fsspec DataPipes [DataPipe] Adding kwargs for fs.open() in fsspec DataPipes Oct 3, 2022
@NivekT NivekT requested a review from ejguan October 3, 2022 21:26
@NivekT
Copy link
Contributor Author

NivekT commented Oct 3, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you. And, could you please add corresponding tests, if possible?

torchdata/datapipes/iter/load/fsspec.py Outdated Show resolved Hide resolved
torchdata/datapipes/iter/load/fsspec.py Show resolved Hide resolved
Fixes #803

I left `FSSpecFileLister` untouched since I don't think it will be useful for `fs.ls()` to accept kwargs.

Differential Revision: [D40038331](https://our.internmc.facebook.com/intern/diff/D40038331)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 4, 2022
ghstack-source-id: cafc1d14ee42d1d753987cb301617331d24f218c
Pull Request resolved: #804
@NivekT
Copy link
Contributor Author

NivekT commented Oct 4, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@NivekT
Copy link
Contributor Author

NivekT commented Oct 5, 2022

@dorbittonn This PR should allow you to pass version_id as part of kwargs_for_open to FSSpecFileOpener. You will need to install from source or wait for the PR to land into the nightly release.

Note that the usage of these fsspec DataPipes require the installation of fsspec and s3fs. This documentation provides more details on bucket version awareness.

We are looking into the extending the same options for S3 DataPipes but it will be low priority for us if the fsspec solution works.

@dorbittonn
Copy link

@dorbittonn This PR should allow you to pass version_id as part of kwargs_for_open to FSSpecFileOpener. You will need to install from source or wait for the PR to land into the nightly release.

Note that the usage of these fsspec DataPipes require the installation of fsspec and s3fs. This documentation provides more details on bucket version awareness.

We are looking into the extending the same options for S3 DataPipes but it will be low priority for us if the fsspec solution works.

Thanks for the tag @NivekT , but using s3fs library is not a good alternative for me, since this library is too slow.
You can observe this post on towardsdatascience examining different ways for loading from s3 using pytorch: Training in pytorch from s3 bucket

@NivekT
Copy link
Contributor Author

NivekT commented Oct 6, 2022

@dorbittonn We have recently ran our own benchmark on AWS EC2 to load data from S3, we actually found that fsspec is faster than the S3 DataPipe implementation, especially if you use open tar archives via streaming .load_from_tar(mode="r|"). Can you please try it and let us know if you got something different?

@dorbittonn
Copy link

@dorbittonn We have recently ran our own benchmark on AWS EC2 to load data from S3, we actually found that fsspec is faster than the S3 DataPipe implementation, especially if you use open tar archives via streaming .load_from_tar(mode="r|"). Can you please try it and let us know if you got something different?

Do you have any example for streaming data from s3 using fsspec using load from tar?

@NivekT
Copy link
Contributor Author

NivekT commented Oct 6, 2022

This is what I have used, let me know if you need more info.

dp = IterableWrapper(s3_paths)  # Replace this with a DataPipe that gives up paths to S3 archives
dp = dp.open_files_by_fsspec(mode="rb", anon=True).load_from_tar(mode="r|")  # Note that it is r| instead of r:, | enables streaming of uncompressed archives

@dorbittonn
Copy link

dorbittonn commented Oct 6, 2022

This is what I have used, let me know if you need more info.

dp = IterableWrapper(s3_paths)  # Replace this with a DataPipe that gives up paths to S3 archives
dp = dp.open_files_by_fsspec(mode="rb", anon=True).load_from_tar(mode="r|")  # Note that it is r| instead of r:, | enables streaming of uncompressed archives

I get
from torch.utils.data.graph import traverse_dps ImportError: cannot import name 'traverse_dps' from 'torch.utils.data.graph'

when trying to import torchdata after installing this PR

@NivekT
Copy link
Contributor Author

NivekT commented Oct 6, 2022

In order to install TorchData from source, you will need to install PyTorch from source as well.

@ejguan
Copy link
Contributor

ejguan commented Oct 6, 2022

Maybe try nightly release of torchdata? It should include nightly pytorch as well.

@ejguan
Copy link
Contributor

ejguan commented Oct 7, 2022

BTW, I will cherry-pick this PR to the release branch by the end of today.

@ejguan ejguan mentioned this pull request Oct 7, 2022
@dorbittonn
Copy link

In order to install TorchData from source, you will need to install PyTorch from source as well.

Which version of pytorch should I install? the last commit? last release of 1.12.1 is enough?
I tried to use torch 1.12.1 and installed the last commit of torchdata from source, and the same error I've mentioned appears

@ejguan
Copy link
Contributor

ejguan commented Oct 7, 2022

@dorbittonn Please check this link. There is a Preview (nightly) version of PyTorch. You can install torch and torchdata preview releases.
For linux cpu: pip3 install --pre torch torhcdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu

@dorbittonn
Copy link

dorbittonn commented Oct 7, 2022

@dorbittonn Please check this link. There is a Preview (nightly) version of PyTorch. You can install torch and torchdata preview releases. For linux cpu: pip3 install --pre torch torhcdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu

I tried it (with cpu and with nightly gpu by replacing the url you gave with https://download.pytorch.org/whl/nightly/torchrec_nightly_3.8_cu11.whl/ )
And it seems that the preview(nightly) version of pytorch doesnt include this PR

thanks for your help!

@ejguan
Copy link
Contributor

ejguan commented Oct 7, 2022

I tried it (with cpu and with nightly gpu by replacing the url you gave with https://download.pytorch.org/whl/nightly/torchrec_nightly_3.8_cu11.whl/ )
And it seems that the preview(nightly) version of pytorch doesnt include this PR

Emmm. If you need to install torch and torchdata nightly with cuda11. You can install via
pip install --pre torch torhcdata --extra-index-url https://download.pytorch.org/whl/nightly/cu117

Edit: And, this commit should be present in torchdata nightly. Could you provide the log when you execute pip install?

@dorbittonn
Copy link

dorbittonn commented Oct 7, 2022

Here's the log :
TDLOG.txt
And I'm still getting
AttributeError: 'IterableWrapperIterDataPipe' object has no attribute 'openfiles_by_fsspec

Thank you again

edit: Sorry, my typo @NivekT

@NivekT
Copy link
Contributor Author

NivekT commented Oct 7, 2022

Here's the log : TDLOG.txt And I'm still getting AttributeError: 'IterableWrapperIterDataPipe' object has no attribute 'openfiles_by_fsspec

Thank you again

Are you missing an underscore? It should be open_files_by_fsspec

NivekT added a commit that referenced this pull request Oct 7, 2022
Summary:
Pull Request resolved: #804

Fixes #803

I left `FSSpecFileLister` untouched since I don't think it will be useful for `fs.ls()` to accept kwargs.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40038331

Pulled By: NivekT

fbshipit-source-id: 45232b938693690bc0906fc6240a104e80ef51f9
ejguan pushed a commit that referenced this pull request Oct 7, 2022
Summary:
Pull Request resolved: #804

Fixes #803

I left `FSSpecFileLister` untouched since I don't think it will be useful for `fs.ls()` to accept kwargs.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40038331

Pulled By: NivekT

fbshipit-source-id: 45232b938693690bc0906fc6240a104e80ef51f9
@facebook-github-bot facebook-github-bot deleted the gh/NivekT/95/head branch October 9, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants