Skip to content

DaftTorchDataset lazy loading #3115

Answered by jaychia
mplemay asked this question in Q&A
Oct 24, 2024 · 1 comments · 7 replies
Discussion options

You must be logged in to vote

For instance, if your data is a CSV and you need to read rows 100, 1, 2, 3, ...

Daft has to read the data until it finds row 100 first before it can do any work. In the worst case, it may have to read all the data to find the n-1th row. There are certain storage formats that may alleviate this, also in practice storing data to nvme SSD and memmapping can help quite a bit, but the general case of efficient random sampling is fairly non-trivial.

What we often recommend is for folks to store "cheap" metadata so that it can be loaded into memory effectively. Then you can make use of the URL download expression to lazily materialize expensive data such as images.

Hope this helps!

Replies: 1 comment 7 replies

Comment options

You must be logged in to vote
7 replies
@mplemay
Comment options

@jaychia
Comment options

@jaychia
Comment options

Answer selected by mplemay
@uditrana
Comment options

@jaychia
Comment options

@uditrana
Comment options

@jaychia
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants