-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIStore Datapipes (AISFileListerIterDataPipe and AISFileLoaderIterDataPipe) #517
Comments
Thank you for opening the issue. I think it sounds great to have AIStore clients integrated with TorchData. And the implementation of your I do have a few questions regarding
I would imagine users should get more benefit by creating a |
Hi @ejguan Yes, AIS does support carrying out data manipulations on the server-side. In fact, we were planning a sequence of PRs, to bring in these features of AIS into Torch data pipelines incrementally, so that each patch is self-sufficient, testable, and improving/extending the previous ones. But of course, right now we'll be focusing on the basic FileLister/FileLoader to see how it goes and learn the process. |
That makes sense to me and I'm curious what are some potential features of AIS that may be added? I am also not very familiar with AIS. |
Hi @NivekT But going forward, it'd be great to discuss maybe less conventional features that broadly include I/O intensive operations currently (conventionally) executed on the client-side. ETL is one example, but there's more. |
In terms of testing, you should be able to create a GHA workflow in a docker image. For example, in TorchData, we use pytorch container for release
And, this means the server and client will be executed in a single CI machine. Besides, we can't really test the behavior on Mac or Windows. |
Hi @ejguan |
For sure, please go ahead and open a PR. |
Summary: Fixes #{[517](#517)} ### Changes - Added `aisio.py` (Iterable Datapipe for AIStore backends) - Added unit tests in ~~`test/test_local_io.py`~~ `test/test_aistore.py` - Added GitHub action for running AIStore in ~~`CI.yml`~~ `.github/aistore_ci.yml` workflow ### Questions to maintainers - We are unsure about the documentation generated on PyTorch to refer it in `README.md`, so I have tentatively added a URL similar to s3io functions. (see ```torchdata/datapipes/iter/load/README.md```) Signed-off-by: Abhishek Gaikwad <gaikwadabhishek1997@gmail.com> Pull Request resolved: #545 Reviewed By: VitalyFedyunin Differential Revision: D37620194 Pulled By: msaroufim fbshipit-source-id: 9df099586dd39d47f8fdf2b760b17503f8a9822d
🚀 The feature
Add a new file loader and lister for AIStore, similar to s3io and fsspec.
For introduction and background, see PyTorch Blog “Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs.”
Motivation, pitch
AIStore (AIS for short) is a highly available lightweight object storage system that specifically focuses on petascale deep learning. As a reliable redundant storage, AIS supports n-way mirroring and erasure coding. But it is not purely – or not only – a storage system: it’ll shuffle user datasets and run custom extract-transform-load workloads.
AIS is an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size.
AIS fully supports Amazon S3, Google Cloud, and Microsoft Azure backends, providing a unified namespace across multiple connected backends and/or other AIS clusters, and more. Getting started with AIS will take only a few minutes (prerequisites boil down to having a Linux with a disk) and can be done either by running a prebuilt all-in-one docker image or directly from the open-source.
We are hoping that once integrated with TorchData, AIS can prove to be useful to the community.
Alternatives
There are numerous ways to load TorchData pipelines: s3io (for objects on Amazon S3), fsspec (for files), and many more. The difference that AIS can maybe contribute is that it can be used both as a standalone reliable/scalable storage and/or be deployed in front of any of the supported backends, including (but not limited to) Amazon S3.
AIS consistently shows balanced I/O distribution and linear scalability across arbitrary numbers of clustered servers. The ability to scale linearly with each added disk was, and remains, one of the major incentives. Much of the development is also driven by the idea to offload dataset transformations.
Additional context
This feature will be taken up by the AIStore Team at Nvidia. Here’s the initial commit introduces
AISFileListerIterDataPipe
andAISFileLoaderIterDataPipe
:Question to the PyTorch Team:
For end-to-end integration testing with PyTorch, we would need to have a running AIStore instance. Currently, we advise users to run AIS using any of the many documented ways - for instance, using a minimal (one gateway, one storage node) AIS cluster in a single docker image.
What would be our options in the context of submitting a patch? If running a custom container as part of the test pipeline is not supported and/or not feasible, then the question is: could we maybe mock AIS and its APIs? Would this level of (unit) testing be considered acceptable?
The text was updated successfully, but these errors were encountered: