Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIStore Datapipes (AISFileListerIterDataPipe and AISFileLoaderIterDataPipe) #517

Closed
gaikwadabhishek opened this issue Jun 14, 2022 · 7 comments

Comments

@gaikwadabhishek
Copy link
Contributor

🚀 The feature

Add a new file loader and lister for AIStore, similar to s3io and fsspec.

For introduction and background, see PyTorch Blog “Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs.”

Motivation, pitch

AIStore (AIS for short) is a highly available lightweight object storage system that specifically focuses on petascale deep learning. As a reliable redundant storage, AIS supports n-way mirroring and erasure coding. But it is not purely – or not only – a storage system: it’ll shuffle user datasets and run custom extract-transform-load workloads.

AIS is an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size.

AIS fully supports Amazon S3, Google Cloud, and Microsoft Azure backends, providing a unified namespace across multiple connected backends and/or other AIS clusters, and more. Getting started with AIS will take only a few minutes (prerequisites boil down to having a Linux with a disk) and can be done either by running a prebuilt all-in-one docker image or directly from the open-source.

We are hoping that once integrated with TorchData, AIS can prove to be useful to the community.

Alternatives

There are numerous ways to load TorchData pipelines: s3io (for objects on Amazon S3), fsspec (for files), and many more. The difference that AIS can maybe contribute is that it can be used both as a standalone reliable/scalable storage and/or be deployed in front of any of the supported backends, including (but not limited to) Amazon S3.

AIS consistently shows balanced I/O distribution and linear scalability across arbitrary numbers of clustered servers. The ability to scale linearly with each added disk was, and remains, one of the major incentives. Much of the development is also driven by the idea to offload dataset transformations.

Additional context

This feature will be taken up by the AIStore Team at Nvidia. Here’s the initial commit introduces AISFileListerIterDataPipe and AISFileLoaderIterDataPipe:

Question to the PyTorch Team:
For end-to-end integration testing with PyTorch, we would need to have a running AIStore instance. Currently, we advise users to run AIS using any of the many documented ways - for instance, using a minimal (one gateway, one storage node) AIS cluster in a single docker image.

What would be our options in the context of submitting a patch? If running a custom container as part of the test pipeline is not supported and/or not feasible, then the question is: could we maybe mock AIS and its APIs? Would this level of (unit) testing be considered acceptable?

@ejguan
Copy link
Contributor

ejguan commented Jun 14, 2022

Thank you for opening the issue. I think it sounds great to have AIStore clients integrated with TorchData. And the implementation of your DataPipes looks great.

I do have a few questions regarding AIStore, and please correct me if I am wrong. Does AIS support carrying out data manipulations on server side of AIS? If so, how should users get benefit from AIS. The currently implementation of FileLister and FileLoader would only provide users accessing data via AIS. And, I do see your comment about

it’ll shuffle user datasets and run custom extract-transform-load workloads.

I would imagine users should get more benefit by creating a DataPipe to hold an AIS client, which could delegate all operations back to AIS server. Whenever an operation can not be forwarded to AIS server, the data should be sent to client side, and return a normal DataPipe to carry out the operation.

@gaikwadabhishek
Copy link
Contributor Author

Hi @ejguan

Yes, AIS does support carrying out data manipulations on the server-side.

In fact, we were planning a sequence of PRs, to bring in these features of AIS into Torch data pipelines incrementally, so that each patch is self-sufficient, testable, and improving/extending the previous ones.

But of course, right now we'll be focusing on the basic FileLister/FileLoader to see how it goes and learn the process.

@NivekT
Copy link
Contributor

NivekT commented Jun 14, 2022

In fact, we were planning a sequence of PRs, to bring in these features of AIS into Torch data pipelines incrementally, so that each patch is self-sufficient, testable, and improving/extending the previous ones.

That makes sense to me and I'm curious what are some potential features of AIS that may be added? I am also not very familiar with AIS.

@gaikwadabhishek
Copy link
Contributor Author

Hi @NivekT
As we are using AIS as a backend, we will already be leveraging many of its core capabilities.

But going forward, it'd be great to discuss maybe less conventional features that broadly include I/O intensive operations currently (conventionally) executed on the client-side. ETL is one example, but there's more.

@ejguan
Copy link
Contributor

ejguan commented Jun 15, 2022

For end-to-end integration testing with PyTorch, we would need to have a running AIStore instance. Currently, we advise users to run AIS using any of the many documented ways - for instance, using a minimal (one gateway, one storage node) AIS cluster in a single docker image.

Hi @gaikwadabhishek

In terms of testing, you should be able to create a GHA workflow in a docker image. For example, in TorchData, we use pytorch container for release

container: ${{ startsWith( matrix.os, 'ubuntu' ) && 'pytorch/manylinux-cpu' || null }}

And, this means the server and client will be executed in a single CI machine. Besides, we can't really test the behavior on Mac or Windows.

@gaikwadabhishek
Copy link
Contributor Author

Hi @ejguan
Thanks for the update. This looks like a good option. I think we should be good with the testing part with this. Can we proceed with the development?

@ejguan
Copy link
Contributor

ejguan commented Jun 15, 2022

For sure, please go ahead and open a PR.

facebook-github-bot pushed a commit that referenced this issue Jul 6, 2022
Summary:
Fixes #{[517](#517)}

### Changes
- Added `aisio.py` (Iterable Datapipe for AIStore backends)
- Added unit tests in ~~`test/test_local_io.py`~~ `test/test_aistore.py`
- Added GitHub action for running AIStore in ~~`CI.yml`~~ `.github/aistore_ci.yml` workflow

### Questions to maintainers
- We are unsure about the documentation generated on PyTorch to refer it in `README.md`, so I have tentatively added a URL similar to s3io functions. (see ```torchdata/datapipes/iter/load/README.md```)

Signed-off-by: Abhishek Gaikwad <gaikwadabhishek1997@gmail.com>

Pull Request resolved: #545

Reviewed By: VitalyFedyunin

Differential Revision: D37620194

Pulled By: msaroufim

fbshipit-source-id: 9df099586dd39d47f8fdf2b760b17503f8a9822d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants