-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Better caching #4591
Comments
There is some initial work along these lines here (used in a dataset reader here). I think we definitely want to figure out how to include this in our main @epwalsh, do we have an issue open already for this? If not, I'll keep this open and add it to either the 2.0 or the performance milestone (maybe 2.0?). |
@matt-gardner I don't think we had a separate issue tracking caching until now, so adding this to 2.0 sounds good to me. |
Should I create a .md page or a Google Docs?
I am leaning towards (2) since we are reading the instances in memory sequentially, which might be better than a key:value solution like lmdb. |
Hey @OhadRubin, I've already started a design document which I'll link to on this thread once it's ready to be seen. |
There is a good discussion here about I'm working on an API design that would be agnostic to the backend or ser/deserialization method we use, so we can decide on that later. |
So it seems there are different use cases for caching, maybe I can write some code that extends |
@OhadRubin that would be great if you got started on that! Right now just I'm focusing on the overall API, and how that would integrate into our data pipeline, so I don't think that collides with what you want to work on. Once I'm finished with the API skeleton we should be able to plug in the |
So I should inherit from DatasetReader and override _instances_to_cache_file, correct? |
@OhadRubin actually no, we're working off of the |
Ok, so I'll assume I am iterating over objects with a serialize method, is that ok? |
Yes, |
Hey, @epwalsh, after looking a bit more into |
Hey @OhadRubin, well after the talking with the team a little more yesterday we decided it would probably be more beneficial to cache tensor dicts instead of actual So in most cases, this is a |
There is a overhead of around 760 bytes for saving a
Output:
Edit: |
API design document (work in progress): https://docs.google.com/document/d/1EBAcPF19NM7bYuwDKmHN4Ws361p0Eo4vjIpSA67l1rQ/edit?usp=sharing |
I would consider using |
Since reading is so much faster now, the case for caching is pretty weak. We're abandoning this issue for now. |
Something along the lines of what fairseq has:
Here
For example, in MMapIndexedDataset, they calculate the sizes for each field and then read exactly that size into an array.
Here, @matt-gardner mentions that this is the type of caching that you guys might implement. A database that just stores tensors directly.
The text was updated successfully, but these errors were encountered: