AIStore Datapipes (AISFileListerIterDataPipe and AISFileLoaderIterDataPipe) #517

gaikwadabhishek · 2022-06-14T20:40:33Z

🚀 The feature

Add a new file loader and lister for AIStore, similar to s3io and fsspec.

For introduction and background, see PyTorch Blog “Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs.”

Motivation, pitch

AIStore (AIS for short) is a highly available lightweight object storage system that specifically focuses on petascale deep learning. As a reliable redundant storage, AIS supports n-way mirroring and erasure coding. But it is not purely – or not only – a storage system: it’ll shuffle user datasets and run custom extract-transform-load workloads.

AIS is an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size.

AIS fully supports Amazon S3, Google Cloud, and Microsoft Azure backends, providing a unified namespace across multiple connected backends and/or other AIS clusters, and more. Getting started with AIS will take only a few minutes (prerequisites boil down to having a Linux with a disk) and can be done either by running a prebuilt all-in-one docker image or directly from the open-source.

We are hoping that once integrated with TorchData, AIS can prove to be useful to the community.

Alternatives

There are numerous ways to load TorchData pipelines: s3io (for objects on Amazon S3), fsspec (for files), and many more. The difference that AIS can maybe contribute is that it can be used both as a standalone reliable/scalable storage and/or be deployed in front of any of the supported backends, including (but not limited to) Amazon S3.

AIS consistently shows balanced I/O distribution and linear scalability across arbitrary numbers of clustered servers. The ability to scale linearly with each added disk was, and remains, one of the major incentives. Much of the development is also driven by the idea to offload dataset transformations.

Additional context

This feature will be taken up by the AIStore Team at Nvidia. Here’s the initial commit introduces AISFileListerIterDataPipe and AISFileLoaderIterDataPipe:

NVIDIA/aistore@583d05e

Question to the PyTorch Team:
For end-to-end integration testing with PyTorch, we would need to have a running AIStore instance. Currently, we advise users to run AIS using any of the many documented ways - for instance, using a minimal (one gateway, one storage node) AIS cluster in a single docker image.

What would be our options in the context of submitting a patch? If running a custom container as part of the test pipeline is not supported and/or not feasible, then the question is: could we maybe mock AIS and its APIs? Would this level of (unit) testing be considered acceptable?

The text was updated successfully, but these errors were encountered:

ejguan · 2022-06-14T21:03:55Z

Thank you for opening the issue. I think it sounds great to have AIStore clients integrated with TorchData. And the implementation of your DataPipes looks great.

I do have a few questions regarding AIStore, and please correct me if I am wrong. Does AIS support carrying out data manipulations on server side of AIS? If so, how should users get benefit from AIS. The currently implementation of FileLister and FileLoader would only provide users accessing data via AIS. And, I do see your comment about

it’ll shuffle user datasets and run custom extract-transform-load workloads.

I would imagine users should get more benefit by creating a DataPipe to hold an AIS client, which could delegate all operations back to AIS server. Whenever an operation can not be forwarded to AIS server, the data should be sent to client side, and return a normal DataPipe to carry out the operation.

gaikwadabhishek · 2022-06-14T21:36:38Z

Hi @ejguan

Yes, AIS does support carrying out data manipulations on the server-side.

In fact, we were planning a sequence of PRs, to bring in these features of AIS into Torch data pipelines incrementally, so that each patch is self-sufficient, testable, and improving/extending the previous ones.

But of course, right now we'll be focusing on the basic FileLister/FileLoader to see how it goes and learn the process.

NivekT · 2022-06-14T21:56:33Z

In fact, we were planning a sequence of PRs, to bring in these features of AIS into Torch data pipelines incrementally, so that each patch is self-sufficient, testable, and improving/extending the previous ones.

That makes sense to me and I'm curious what are some potential features of AIS that may be added? I am also not very familiar with AIS.

gaikwadabhishek · 2022-06-14T22:54:47Z

Hi @NivekT
As we are using AIS as a backend, we will already be leveraging many of its core capabilities.

But going forward, it'd be great to discuss maybe less conventional features that broadly include I/O intensive operations currently (conventionally) executed on the client-side. ETL is one example, but there's more.

ejguan · 2022-06-15T14:03:00Z

For end-to-end integration testing with PyTorch, we would need to have a running AIStore instance. Currently, we advise users to run AIS using any of the many documented ways - for instance, using a minimal (one gateway, one storage node) AIS cluster in a single docker image.

Hi @gaikwadabhishek

In terms of testing, you should be able to create a GHA workflow in a docker image. For example, in TorchData, we use pytorch container for release

data/.github/workflows/_build_test_upload.yml

Line 57 in 100d086

    
           container: ${{ startsWith( matrix.os, 'ubuntu' ) && 'pytorch/manylinux-cpu' || null }}

And, this means the server and client will be executed in a single CI machine. Besides, we can't really test the behavior on Mac or Windows.

gaikwadabhishek · 2022-06-15T16:37:48Z

Hi @ejguan
Thanks for the update. This looks like a good option. I think we should be good with the testing part with this. Can we proceed with the development?

ejguan · 2022-06-15T16:40:07Z

For sure, please go ahead and open a PR.

Summary: Fixes #{[517](#517)} ### Changes - Added `aisio.py` (Iterable Datapipe for AIStore backends) - Added unit tests in ~~`test/test_local_io.py`~~ `test/test_aistore.py` - Added GitHub action for running AIStore in ~~`CI.yml`~~ `.github/aistore_ci.yml` workflow ### Questions to maintainers - We are unsure about the documentation generated on PyTorch to refer it in `README.md`, so I have tentatively added a URL similar to s3io functions. (see ```torchdata/datapipes/iter/load/README.md```) Signed-off-by: Abhishek Gaikwad <gaikwadabhishek1997@gmail.com> Pull Request resolved: #545 Reviewed By: VitalyFedyunin Differential Revision: D37620194 Pulled By: msaroufim fbshipit-source-id: 9df099586dd39d47f8fdf2b760b17503f8a9822d

gaikwadabhishek mentioned this issue Jun 24, 2022

adding aistore datapipe #545

Closed

gaikwadabhishek closed this as completed Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIStore Datapipes (AISFileListerIterDataPipe and AISFileLoaderIterDataPipe) #517

AIStore Datapipes (AISFileListerIterDataPipe and AISFileLoaderIterDataPipe) #517

gaikwadabhishek commented Jun 14, 2022

ejguan commented Jun 14, 2022

gaikwadabhishek commented Jun 14, 2022

NivekT commented Jun 14, 2022

gaikwadabhishek commented Jun 14, 2022

ejguan commented Jun 15, 2022 •

edited

Loading

gaikwadabhishek commented Jun 15, 2022

ejguan commented Jun 15, 2022

AIStore Datapipes (AISFileListerIterDataPipe and AISFileLoaderIterDataPipe) #517

AIStore Datapipes (AISFileListerIterDataPipe and AISFileLoaderIterDataPipe) #517

Comments

gaikwadabhishek commented Jun 14, 2022

🚀 The feature

Motivation, pitch

Alternatives

Additional context

ejguan commented Jun 14, 2022

gaikwadabhishek commented Jun 14, 2022

NivekT commented Jun 14, 2022

gaikwadabhishek commented Jun 14, 2022

ejguan commented Jun 15, 2022 • edited Loading

gaikwadabhishek commented Jun 15, 2022

ejguan commented Jun 15, 2022

ejguan commented Jun 15, 2022 •

edited

Loading