-
Hi all, I’m looking to train a PyTorch model in a distributed setting using Daft. However, I’ve run into an issue with the DaftTorchDataset, as it calls df.collect(), which causes the training job to run out of memory due to the size of the dataset. Would it be possible to implement a “lazy” version of this torch.utils.data.Dataset class that doesn’t load the entire dataset into memory? If so, what approach would you recommend for achieving this? Cheers, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
Hi! Have you considered using the iterator version of this? That should do lazy materialization. |
Beta Was this translation helpful? Give feedback.
For instance, if your data is a CSV and you need to read rows 100, 1, 2, 3, ...
Daft has to read the data until it finds row 100 first before it can do any work. In the worst case, it may have to read all the data to find the n-1th row. There are certain storage formats that may alleviate this, also in practice storing data to nvme SSD and memmapping can help quite a bit, but the general case of efficient random sampling is fairly non-trivial.
What we often recommend is for folks to store "cheap" metadata so that it can be loaded into memory effectively. Then you can make use of the URL download expression to lazily materialize expensive data such as images.
Hope this helps!