Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does torchdata already work with GCP and Azure blob storage #794

Closed
msaroufim opened this issue Sep 27, 2022 · 7 comments
Closed

Does torchdata already work with GCP and Azure blob storage #794

msaroufim opened this issue Sep 27, 2022 · 7 comments

Comments

@msaroufim
Copy link
Member

msaroufim commented Sep 27, 2022

🚀 The feature

We already have an S3 integration and it seems like the S3 API already works with both

Motivation, pitch

So ideally we can already support Azure, GCP without doing much

Alternatives

Build a new integration for each of Azure and GCP using their native APIs

h/t: @chauhang for the idea

@ejguan
Copy link
Contributor

ejguan commented Sep 28, 2022

Technical speaking, with fsspec-DataPipe, torchdata has already working with cloud vendors.

@msaroufim
Copy link
Member Author

Have you by any chance observed any perf impact from using fsspec vs the S3 integration. If not then agreed fsspec is a good option and we just need to spend some time authoring a tutorial

@ejguan
Copy link
Contributor

ejguan commented Sep 28, 2022

After the observation on the performance regression last time, I didn't get a chance to take a deeper look at the culprit. But, discussed with @ydaiming earlier, and he claimed that S3 integration works better on archive files but not on small pieces of files compared to boto3 (boto3 is the internal implementation of fsspec).

Overall, in some cases, fsspec does provide benefit to our users. So, adding more detailed instruction for fsspec and talked about perf impact on the type of files might be a good step for now.

@NivekT
Copy link
Contributor

NivekT commented Sep 28, 2022

I am going to take a quick look into fsspec vs s3 performance in my benchmark

@NivekT
Copy link
Contributor

NivekT commented Sep 30, 2022

My benchmark shows that using FSSpecFileOpener is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).

@ejguan
Copy link
Contributor

ejguan commented Sep 30, 2022

My benchmark shows that using FSSpecFileOpener is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).

Our benchmarking results shows even for archives (large files) fsspec performs better than the current implementation of S3Handler. I suspect this is caused by the downloading behavior. See: #800
cc: @ydaiming

@NivekT
Copy link
Contributor

NivekT commented Oct 20, 2022

Since #812 and #836 have landed, I believe users should be able to use GCP and Azure Blob storage. Please feel free to re-open this issue or open a new issue if additional features are required. Thanks!

@NivekT NivekT closed this as completed Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants