DaftTorchDataset lazy loading #3115

mplemay · 2024-10-24T14:16:45Z

mplemay
Oct 24, 2024

Hi all,

I’m looking to train a PyTorch model in a distributed setting using Daft. However, I’ve run into an issue with the DaftTorchDataset, as it calls df.collect(), which causes the training job to run out of memory due to the size of the dataset.

Would it be possible to implement a “lazy” version of this torch.utils.data.Dataset class that doesn’t load the entire dataset into memory? If so, what approach would you recommend for achieving this?

Cheers,
Matt

Answered by jaychia

Oct 31, 2024

For instance, if your data is a CSV and you need to read rows 100, 1, 2, 3, ...

Daft has to read the data until it finds row 100 first before it can do any work. In the worst case, it may have to read all the data to find the n-1th row. There are certain storage formats that may alleviate this, also in practice storing data to nvme SSD and memmapping can help quite a bit, but the general case of efficient random sampling is fairly non-trivial.

What we often recommend is for folks to store "cheap" metadata so that it can be loaded into memory effectively. Then you can make use of the URL download expression to lazily materialize expensive data such as images.

Hope this helps!

View full answer

jaychia · 2024-10-25T05:42:35Z

jaychia
Oct 25, 2024
Maintainer

Hi! Have you considered using the iterator version of this? That should do lazy materialization.

https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/dataframe_methods/daft.DataFrame.to_torch_iter_dataset.html

7 replies

mplemay Oct 25, 2024
Author

Hey @jaychia, thanks for the tip! Unfortunately, torch.utils.data.IterableDataset doesn’t work with torch.utils.data.Sampler (see PyTorch documentation). This means we would need to partition the dataset and incorporate sampling within our iterator, which isn’t ideal. Even though running __getitem__ is inefficient, would it still be possible to do this regardless?

jaychia Oct 31, 2024
Maintainer

I see... accessing the dataset by index is fairly expensive (we have to read all the data before to know how many rows there are)

Unfortunately there isn't a great solution here. Daft works best as a streaming dataloader!

jaychia Oct 31, 2024
Maintainer

For instance, if your data is a CSV and you need to read rows 100, 1, 2, 3, ...

Daft has to read the data until it finds row 100 first before it can do any work. In the worst case, it may have to read all the data to find the n-1th row. There are certain storage formats that may alleviate this, also in practice storing data to nvme SSD and memmapping can help quite a bit, but the general case of efficient random sampling is fairly non-trivial.

What we often recommend is for folks to store "cheap" metadata so that it can be loaded into memory effectively. Then you can make use of the URL download expression to lazily materialize expensive data such as images.

Hope this helps!

Answer selected by mplemay

uditrana Nov 18, 2024

Is there a reasonable approach/example of using IterableDataset that implements a form of shuffling or sampling?

Also, in the second case where you implement cheap metadata for sampling and then retrieve the actual data on-the-fly (i.e. lazily), does Daft support flexible workflows that aren't just url.download? E.g. load saved tensors from local storage? Would you recommend this implemented as a UDF?

jaychia Nov 20, 2024
Maintainer

Is there a reasonable approach/example of using IterableDataset that implements a form of shuffling or sampling?

Also, in the second case where you implement cheap metadata for sampling and then retrieve the actual data on-the-fly (i.e. lazily), does Daft support flexible workflows that aren't just url.download? E.g. load saved tensors from local storage? Would you recommend this implemented as a UDF?

Shuffling does require materialization, but if your data is mostly metadata then that materialization should be fairly cheap.

For distributed training, I'd recommend pre-sharding your data in storage and having each worker be responsible for a given shard. In previously experience, I've found per-shard shuffling of data to be sufficient when doing distributed training (global shuffles are a lot more expensive, and didn't give much increase in model performance).

Note also that .url.download does let you "download" data from local storage! You can then load those bytes into a tensor. You can use a UDF, or if you have a pretty reproducible use-case we could probably speed it up by making it a native expression as well (e.g. .tensor.from_file() or similar).

Hope this helps!

uditrana Nov 20, 2024

I agree with per-shard shuffling. If I had to implement this myself per-shard shuffling (or using a per-shard materialized buffer to sample from) would probably provide good enough performance.

I was optimistically hoping Daft had a canonical solution for defining shards and per-shard shuffling under the hood since you must be doing something like that under-the-hood for the lazy IterableDataloader.

I will update yall on the solution I go with... it is good to know that .url.download() supports that. Maybe you should rename it to uri instead of url if its more broad.

Seems like my 2 options are:

Use a iterable style dataloader with udfs/url columns for the actual data that gets loaded in lazily as the iterable runs. Maybe I can figure out how to join it with a fully materialized cheap metadata df (that can be shuffled) without executing the udfs/uri.downloads() until iter() is actually called?
Implement my own dataset class with metadata fully materialized and do all the loading per sample (probably without daft) lazily at getitem

jaychia Nov 22, 2024
Maintainer

(1) is actually already how it works! Let me know if I'm misinterpreting your question...

df = daft.read_parquet(...)
df = df.with_column("data", df["urls"].url.download())
iterator = df.iter_rows()

The above will lazily download the data from URLs as you iterate through the dataset. Computation (reading of Parquet and downloading of data) is triggered when you call next() on the iterator.

Today Daft is partition-based, so you might see this running quite a bit of computation especially if you have really large partitions, but you can also try this with our new streaming execution engine using the feature flag: DAFT_RUNNER=native. You can play around with the sizes of the "morsels" in this engine as well for your use-case to get better memory stability as it will run computations on these morsels in a lazy/pipelined manner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DaftTorchDataset lazy loading #3115

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DaftTorchDataset lazy loading #3115

mplemay Oct 24, 2024

Replies: 1 comment · 7 replies

jaychia Oct 25, 2024 Maintainer

mplemay Oct 25, 2024 Author

jaychia Oct 31, 2024 Maintainer

jaychia Oct 31, 2024 Maintainer

uditrana Nov 18, 2024

jaychia Nov 20, 2024 Maintainer

uditrana Nov 20, 2024

jaychia Nov 22, 2024 Maintainer

mplemay
Oct 24, 2024

Replies: 1 comment 7 replies

jaychia
Oct 25, 2024
Maintainer

mplemay Oct 25, 2024
Author

jaychia Oct 31, 2024
Maintainer

jaychia Oct 31, 2024
Maintainer

jaychia Nov 20, 2024
Maintainer

jaychia Nov 22, 2024
Maintainer