Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of torchdata and dataloading #1196

Open
laurencer opened this issue Jul 17, 2023 · 55 comments
Open

Future of torchdata and dataloading #1196

laurencer opened this issue Jul 17, 2023 · 55 comments

Comments

@laurencer
Copy link
Contributor

laurencer commented Jul 17, 2023

As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space.

We want to hear from users on their use-cases and the pain-points they have (with data loading in general or torchdata specifically). Please reply on this issue to help inform our future roadmap.

@laurencer laurencer pinned this issue Jul 17, 2023
@erip
Copy link
Contributor

erip commented Jul 17, 2023

Thanks for the update, @laurencer. Does this mean torchdata as a domain library or the entire concept of datapipes/DataLoader2 is under review?

@laurencer
Copy link
Contributor Author

The short answer is we need to look at both. More holistically there's lots of benefits with datapipes & DataloaderV2, however we've seen some limitations in a few use-cases which indicate we may need to tweak them a bit (or they're not the one-stop solution that we were hoping for). Overall the data loading space is really important and we hear about a lot of pain-points so we want to make sure we get the core abstractions right.

@josiahls
Copy link

I've personally liked the idea behind data pipes and the newer data loader. It'll be helpful to know what are some examples of use cases where this concept/API broke down?

I think this would be helpful for people about to jump into torchdata to know the breaking limitations, because it is not obvious to me at least.

@andrew-bydlon
Copy link

I too would like to hear what limitations you are referencing.

If it is performance oriented, I believe there's an argument. You could make something compatible with a compiled framework, especially something like torch.compile. Someone recently talked about speedups loading tar files with rust as an example.

My question is what you recommend as an alternative platform to torchdata for flexible and fast dataloading in pytorch?

@nairbv
Copy link

nairbv commented Jul 18, 2023

Does the design re-evaluation apply only to TorchData, or also to the portion of the datapipes API that was upstreamed to PyTorch core?

https://github.com/pytorch/pytorch/tree/main/torch/utils/data/datapipes

@BlueskyFR
Copy link

IMO the first thing that comes to mind is TensorFlow's tf.data.Dataset API, which is super cool to use. It is deeply integrated in the framework and with Keras, and the different operations you "pipe"/chain together can be fused at runtime so that the input pipeline is more optimized.
However there are still some solutions such as NVIDIA DALI that is waaay faster if it applies to your use case.

I don't know TorchData in much details but I'd say building a nice-looking pipeline is easy, but making a pipeline optimized for high performance while still making it look cool is the real challenge.
By cool-looking I mean nice/easy to read as code, but also easily extensible, like users could share "databricks" together or something.

I hope this helps!

@BlueskyFR
Copy link

Also, do you still recommend to use torchData in the meantime or will the compatibility with torch break at some point so we should avoid using it?

@rbavery
Copy link

rbavery commented Jul 21, 2023

Our ML team have been avid users of torchdata. We have used it to build datapipes that fetch large raster datasets from cloud providers to support training and inference. Currently we have a couple projects using torchdata, the most robust is
https://zen3geo.readthedocs.io/en/latest/walkthrough.html a library managed by my colleague @weiji14 for fetching and batching large satellite images, with steps organized as functional datapipe ops.

While the API makes it easier to reuse custom data operations, we've been running into some consistent pain points when integrating datapipes with Dataloader V1 or Dataloader V2. We've had to switch back to Dataloader V1 and Datasets.

It doesn't seem like there is a clear set of documented rules for prefetching, shuffling, buffer sizes, memory pinning that results in good performance, or even any performance gains that beats single process dataloading when using torchdata with either Dataloader. All the configurations we have tried result in hangups, out of memory errors, or slower performance than a single process. It's also unclear how these parameters interact with different reading services.

I would love to see better docs and functionality for setting prefetching, shuffling, etc. with different kinds of reading services. Being able to profile datapipes and inspect RAM and cpu consumption of each operation would also be invaluable.

@biphasic
Copy link

I started to prototype a new version of my event data library on top of torchdata. The API is very clean and easy to understand, which is a strong plus point, even if it has a minor performance impact (I didn't verify that). I remember struggling with DataLoader2 and making multithreading work.

@sehoffmann
Copy link

Hey, thanks for the update. Does that mean that torchdata will become obsolete in the future?

As I already indicated in older issues, what I see as the biggest weakness right now, is the lack of control and flexibility with regards to:

  • Shuffling
  • Sharding
  • Multiprocessing (dispatching to other processes etc.)

These things are right now tightly integrated into the torchdata core, and not easily accessible from user code. Giving User code the same Power and flexibility is paramount in my opinion to facilitate more complex pipelines than the vanilla cv pipeline. Or contrary, these functionalities should be implemented in user land without privileged handling from torchdata.

You can have a look at my repository (sehoffmann/atmodata) for some monkey patches that were necessary to facilitate my pipeline.

@npuichigo
Copy link

npuichigo commented Jul 26, 2023

As for me, the ideal data pipeline should be ergonomic, flexible and efficient.

Chainable iterator already shows its power in iterator algorithm libraries like itertools and more-itertools. torchdata choose to enhance that with a functional programming API, that's good. Actually rust Iterator algorithms provide a good list for common used pipeline, besides Filter and Map, there're also FilterMap and FlatMap and so on.

But flexibilty still has room for improvement. At least, torchdata should be comparible to pypeln which has good flexibility to switch or mix thread/process/coroutine based tasks.
image

For performance, can torchdata be comparible with huggingface dataset? Can we easily leverage arrow or something else to build a high-performance data pipeline? It's better to show more benchmarks on production data.

@BarclayII
Copy link

DGL team is currently studying what should the UX be for scaling deep learning on graphs, namely the sampling strategies. More specifically, we want to support customization on

  • Different graph storages (in-memory, disk, graph databases, etc.)
  • Different node/edge feature storages (in-memory, disk, etc.)
  • Sampling algorithms (online neighbor/subgraph sampling, offline sampling, etc.)
  • Downstream tasks (node classification, link prediction with negative sampling, graph classification, etc.)
  • Orchestration (whether to put sampling and feature fetching in multiprocessing/multithreading, how to schedule different stages, etc.).

So our current design depends on the composability of torchdata's DataPipes to allow for maximum extensibility, expressing the graph storages/feature storages/sampling algorithms/etc. as a composition of iterables and their transforms.

That being said, we are currently not pursuing active usages on DataLoader2 due to concerns on compatibility to existing packages depending on PyTorch DataLoader (e.g. PyTorch Lightning). That being said, we borrowed some ideas from ReadingService (namely the in-place editing of DataPipes).

We already have some demo in https://github.com/dmlc/dgl/tree/master/tests/python/pytorch/graphbolt.

Happy to discuss further.

@nairbv
Copy link

nairbv commented Jul 27, 2023

@npuichigo pypeln looks interesting. Based on there being multiple single-threaded queues between stages I assume this is designed for a single-node setup? PyTorch users would need multi-node support. To insert a queue between stages of a pipeline with multi-node stages, presumably we'd want to use some kind of purpose-built stand alone message queue. I'm not sure if that kind of setup is desirable -- once the training data reaches GPU hosts, I'd think we usually don't want to send it back elsewhere, so that architecture might make more sense for pre-processing.

@BarclayII

our current design depends on the composability of torchdata's DataPipes
not pursuing active usages on DataLoader2

I'd like to use DataPipes for some NLP problems for similar reasons, and have some prototypes. I'd like to get confirmation, but from what I can tell it seems like it may only be TorchData that has paused development, whereas the DataPipes API is already part of PyTorch core.

For my use-cases I don't need DataLoader2 or readingservice/adapter. I think there are other ways to solve the problems addressed by those additional APIs -- I think there are ways to do it with just DataPipes that would address the concerns @sehoffmann raises around shuffling/sharding being "tightly integrated into the torchdata core, and not easily accessible from user code."

I also wonder whether we actually need a separate API focused on composability of datasets. The original dataset API could be used with composition too, and I'm not sure exactly what challenges we'd face in doing so. I know we wouldn't have the functional helper functions but that seems minor, and not sure what else we'd be missing.

@bryant1410
Copy link

bryant1410 commented Jul 28, 2023

however we've seen some limitations in a few use-cases which indicate we may need to tweak them a bit

@laurencer can you elaborate on which are those use cases?

@BarclayII
Copy link

BarclayII commented Aug 3, 2023

@npuichigo I checked pypeln as well. It seems that the user needs to specify how to organize the queues in low-level (e.g. multiprocessing, multithreading, asyncio, etc.). Normally our UX shouldn't involve such a low-level specification unless the developers want to implement their own pipeline scheduling.

The original dataset API could be used with composition too, and I'm not sure exactly what challenges we'd face in doing so. I know we wouldn't have the functional helper functions but that seems minor, and not sure what else we'd be missing.

@nairbv Other than the functional helper functions, I find the in-place editing of DataPipe (namely torchdata.dataloader2.graph namespace) useful. For instance, with a single-processing DataPipe. I can have a DataLoader that changes the DataPipe for multi-processing and the process will be transparent to users. We also intend to apply the same idea for coordinate graph sampling, feature prefetching, and CPU-GPU transfer. https://github.com/dmlc/dgl/blob/master/python/dgl/graphbolt/dataloader.py#L58 shows an example.

Happy to discuss further.

@hhoeflin
Copy link

As for data pipelining solution, it would be nice if this could be developed without a dependency on a deep learning framework (torch, tensorflow etc),

@vincenttran-msft
Copy link

vincenttran-msft commented Sep 5, 2023

Thanks for the update @laurencer Laurence.

This is unfortunate news to hear as we over here at Microsoft's Developer Experience team have seen a lot of interest in cloud computing and seamless integration between Azure Storage and PyTorch from our customers. Our summer intern's project was building out a custom FileLoader and FileLister DataPipe that allowed easily interacting with datasets that are stored on Azure Storage, and so the news of a halt in development and an uncertain future of the torchdata repo makes for a difficult situation in regard to planning our future in terms of continuing development of integration with PyTorch.

With that being said, I am hopeful that this is a necessary step back in order to re-strategize and refine the future roadmap to ultimately end up with a better user experience for all. As the field of AI/ML continues to develop in the near future, it is really a matter of when (and not if) we will revisit building direct support for Azure Storage with PyTorch workflows, and so we will likely reach out sometime when the future of torchdata and dataloading is clearer. However, there are still some questions (and many were raised by previous posters above) that I would like to echo which would greatly help us developers in the interim of no new releases:

  1. What are the expectations of compatibility / should we even consider continuing to build out torchdata support, or is that something that is becoming obsolete in the near future?
  2. Any ETAs for when we will be updated on the status of the future of torchdata? While I understand this may be difficult, it would be great if any information could be shared in regard to timing to keep any developers from continuing to build out their infrastructure with torchdata, or if they should begin the migration process etc.

Thanks in advance, and please feel free to reach out if necessary!

@rgtjf
Copy link

rgtjf commented Jul 4, 2024

StatefulDataLoader may be a prerequisite for stateful data pipes.

@andrewkho
Copy link
Contributor

Hi everyone, our roadmap for the second half of the 2024 is available publicly alongside many other PyTorch efforts and I wanted to share this here as well in case you haven't seen it yet. The KRs are subject to change as we learn more, and there is still more work to do to get clarity on the overall effort. If you have feedback or thoughts, we're very interested to hear from you folks :)

Also another update, we'll be at PyTorch Conference in San Francisco this September, if anyone is attending we'd be happy to meet up and talk about anything!

@josiahls
Copy link

josiahls commented Jul 10, 2024

@andrewkho Thanks for the update! So if I am understanding correctly:

The replacement for torchdata datapipes themselves might look something like:

  • huggingface datasets (which actually are built on apache arrow what it looks like?) and others such as WebDatasets, Mosaic, Ray, Nvidia DALI).
  • torchdata will attempt using learnings from what huggingface et al have implemented and move the basic functionality into a core API.
  • Problem with original datapipe API was it did not get buy in from the larger OSS space.

Using huggingface API as an example (there could be other examples to pull from though):

I have only worked a little bit with huggingface datasets. From going through their documentation and my memory working with them. Maybe someone more knowledgeable can enlighten me:

Pros:

  • Huggingface dataset API is intended as a scaffold for custom pipelines.
  • Pretty good docs.
  • Core methods / functions / processes being just methods of a common object are easier to step through when debugging compared to datapipes.
  • Custom functions such as transforms are just input / output callable.

Cons (could be my ignorance of what hg datasets can do):

  • Custom core behaviors for shuffling / sharding / dataloading / mapping can only be done via inheritance.
    • Personal taste inheritance is hard to reason with when objects have many abstractions.
    • Was a fan of the datapipe API being very horizontal. The question of "what does this pipe do?" could be answered (generally) on a single page and a single object, as opposed to jump through several hierarchies or scrolling for 1000s of lines of code.
  • Transforms with internal state / params would need to be separate callable objects that are instantiated separately and then passed to the .map. This is as opposed to just using a custom datapipe.
    • Additionally transforms that need multiple inputs or generate multiple outputs would need a dataset subclass
  • In general "scaffolds" can be limiting, and trying to customize them isn't fun. Datapipes in the end were just iterators, so doing "weird" stuff was more straightforward.
  • Played around with making an RL framework in torchdata which was doable since the API itself was pretty horizontal. Trying to make more hierarchical pipeline frameworks work for RL was difficult.

These cons aren't a condemnation of the decision to look into HuggingFace Datasets, WebDatasets, Mosaic, Ray, Nvidia DALI though since they are popular frameworks for a reason. I'm looking forward to seeing how torchdata leverages their work and make it more flexible. As for me personally, I'll also look at HG and try out their API (for transforms / multimodel / RL training) see if the cons above aren't really cons at all.

edited: Make it more clear in my comment that torchdata plans to generally look at the space of dataloading / datasets, not just huggingface specifically.

@andrewkho
Copy link
Contributor

andrewkho commented Jul 10, 2024

@josiahls we definitely do not want to force users to take dependencies on HuggingFace which is more NLP focused. We do want to ensure we're compatible and working well together, since they depend on PyTorch and LLM fine-tuning is such an important use case today. However the use cases we're supporting are more broad

@npuichigo
Copy link

@andrewkho Please also take litdata into consideration

@mfbalin
Copy link

mfbalin commented Jul 29, 2024

@andrewkho Are the datapipes in torch.utils.data going to be removed as well?

@frozenbugs
Copy link

Are you going to remove the datapipe related code from torch.util.data? If so, can you announce the plan in advance and give us several months to take action?

@KBlansit
Copy link

KBlansit commented Jul 31, 2024

I'd like to chime in and say that it is very unfortunate that the official online documentation make almost no mention of this planned deprecation. In fact, I only was made aware of this planned of this issue this morning when I updated my version of torchdata.

Adding a banner warning of deprecation and linking to this github issue would better communicate to ML developers the current status of this project. Keeping the official documents up to date with the current project status is an important issue as otherwise developers may be unaware and waste crucial time and effort into projects that will not be supported in the future.

@keunwoochoi
Copy link

keunwoochoi commented Aug 4, 2024

as a user of torchdata, i was very happy to see the resurrection of the project.

i have a question about the development plan. from the README, i see:

torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader

this is somewhat surprising. although the current Datapipes seem to have various issues underneath the shell, so far, Datapipes ARE torchdata. the current API reference:

API Reference:

Stateful DataLoader
Iterable-style DataPipes
Map-style DataPipes
Utility Functions
DataLoader2
ReadingService

and this is it; i.e., until ver 0.7, torchdata == the datapipes and other necessary utilities (dataloader2 and reading service).

that's why it is surprising for me, that while the development of torchdata has re-started, it is being done in a way it discards everything it had.

so, can i ask for a bit more details about what the new direction (enhancement of torch.utils.data.DataLoader)? or am i missing something here?

thanks.

@keunwoochoi
Copy link

to add more context: i love datapipes and want them improved, not deleted.


although alternatives seem to exist, none of these (HuggingFace Datasets, WebDatasets, Mosaic, Ray, Nvidia DALI) are a general-purpose, scalable data loading solution that is native to torch i.e., expected to be compatible with other libraries such as Accelerator, Lightning, etc. E.g., MosaicML dataset is not compatible with Lightning's multi-node training. in short, none of them enabled us to load multimodal data from remote (s3 buckets) with multi-node training. datapipes are the only working solution. even if they are somehow compatible in some cases, they are not expected so. they can always become incompatible with something they're not responsible for.

why then the low adoption of datapipes? i believe, perhaps for these reasons, based on my experience.

  • lack of documentation and examples.
  • difficulty in debugging and profiling.
    • when it's slow, it's really hard to figure out why
    • with some unexpected timeout, i had to monitor & re-start the training quite often.
  • lack of automatic optimization such as tf.data.AUTOTUNE (yes.. tensorflow..)

even so, i still believe

  • the API is great! functional datapipes for loading and .map() with a distributed setup.. what else?
  • 100% that this should be done on torch, natively
  • improving it would lead to much happier and wider adoption.

@andrewkho
Copy link
Contributor

I'd like to chime in and say that it is very unfortunate that the official online documentation make almost no mention of this planned deprecation. In fact, I only was made aware of this planned of this issue this morning when I updated my version of torchdata.

Adding a banner warning of deprecation and linking to this github issue would better communicate to ML developers the current status of this project. Keeping the official documents up to date with the current project status is an important issue as otherwise developers may be unaware and waste crucial time and effort into projects that will not be supported in the future.

Thanks @KBlansit for calling out the gap in notice in that document, we'll be updating that this week. cc @gokulavasan

@andrewkho
Copy link
Contributor

as a user of torchdata, i was very happy to see the resurrection of the project.

i have a question about the development plan. from the README, i see:

torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader

this is somewhat surprising. although the current Datapipes seem to have various issues underneath the shell, so far, Datapipes ARE torchdata. the current API reference:

API Reference:
Stateful DataLoader
Iterable-style DataPipes
Map-style DataPipes
Utility Functions
DataLoader2
ReadingService

and this is it; i.e., until ver 0.7, torchdata == the datapipes and other necessary utilities (dataloader2 and reading service).

that's why it is surprising for me, that while the development of torchdata has re-started, it is being done in a way it discards everything it had.

so, can i ask for a bit more details about what the new direction (enhancement of torch.utils.data.DataLoader)? or am i missing something here?

thanks.

Hi @keunwoochoi , thanks for the feedback! Unfortunately we aren't quite ready to share any more details about the new direction quite yet (we are actively iterating on this), but we do really appreciate the feedback and will update this thread as soon as we can.

@ClementMaliet
Copy link

ClementMaliet commented Aug 9, 2024

I side with @keunwoochoi in feeling quite enthusiastic when I discovered that torchdata was finally getting some updates only to discover the planned disappearance of datapipes.

While not perfect (and I share the grievances around how difficult debugging and profiling can get with datapipes), I did not found an equivalent composable solution with near unlimited flexibility (notably through the zip/unzip and map operators). This allows to prototype and tests new idea whilst virtually ignoring the implementation side and instead focus on actual data science work which is why I am getting payed in the first place.

For me, DataPipes are an asset of the PyTorch ecosystem, not a historical glitch that should be discretely covered up.

For what it's worth I do believe that a significant reason for the lack of adoption was the relative confidential nature of the API. Although it enjoyed some publicity in release messages, it kept getting advertised as "beta" thus discouraging users to invest in learning the solution. Paired with the scarce documentation and examples mentioned before, this made moving an existing code base from the monolithic Dataset API to DataPipe quite the endeavor, but a rewarding one if you managed to finally get there.

With that in mind I can only press you to improve and double down on DataPipes rather that removing them entirely.

@BarclayII
Copy link

BarclayII commented Aug 9, 2024

I share the same feeling as @keunwoochoi and @ClementMaliet .

DataPipe's composability is why the GraphBolt subpackage from DGL (https://github.com/dmlc/dgl) chose it in the first place. Scalable training with graph neural networks can have very diverse graph storage, feature storage, algorithm, and computing device. And DataPipe's composability allows users and developers to easily adapt a graph sampling algorithm or deploy optimizations by focusing on the most important part only, whether it being one particular device, algorithm, or optimizing the entire DataPipe graph execution with different scheduling and parallelism strategies.

Personally, discarding DataPipe altogether does not really make sense to me. If PyTorch team is short of bandwidth to continually develop it, could we ask to at least keep it as it is rather than throwing it away altogether?

cc @frozenbugs .

@npuichigo
Copy link

npuichigo commented Aug 9, 2024

Same here as @keunwoochoi, @ClementMaliet and @BarclayII.

One more thing. DataPipe as DAG maps well to a structured config, and gives user full flexibility to define data loading logics without having to write boilerplate codes.

We heavily depend on DataPipe and hydra to dynamically construct data pipeline for end users. For example, users may want to construct a balanced datapipe for training, it's simple to use a hydra yaml config like:

dataset:
  _target_: torchdata.datapipes.iter.Batcher
  batch_size: 32
  datapipe:
    _target_: torchdata.datapipes.iter.Multiplexer
    _args_:
    - name: animal
      _target_: CustomSampleMultiplexer
      _args_:
      - weight: 0.7
        pipe:
          _target_: CustomImageFolder
          path: /path/to/dog
      - weight: 0.3
        pipe:
          _target_: CustomImageFolder
          path: /path/to/cat
    - name: plant
      _target_: CustomSampleMultiplexer
      _args_:
      - weight: 0.5
        pipe:
          _target_: CustomImageFolder
          path: /path/to/fern
      - weight: 0.5
        pipe:
          _target_: CustomImageFolder
          path: /path/to/moss

@andrewkho
Copy link
Contributor

Thanks everyone for the comments and feedback on datapipes, we're planning to have something that will be composable (eg with Hydra configs) and will provide migration guides for those coming from datapipes. I'll share more details for feedback as soon as I am able.

@andrewkho
Copy link
Contributor

Hi everyone, I've just posted this issue with details on what we've been thinking about as the successor to DataPipes. Please leave feedback specific to the RFC on that issue's comments: #1318. Also a few of us will be at PyTorch Conference this week, hoping to connect with as many of you as we can!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests