Better caching #4591

OhadRubin · 2020-08-24T00:41:18Z

Something along the lines of what fairseq has:
Here
For example, in MMapIndexedDataset, they calculate the sizes for each field and then read exactly that size into an array.
Here, @matt-gardner mentions that this is the type of caching that you guys might implement. A database that just stores tensors directly.

matt-gardner · 2020-08-24T16:05:48Z

There is some initial work along these lines here (used in a dataset reader here). I think we definitely want to figure out how to include this in our main Instance creation pipeline, but that'll come a little bit later. If you're interested in helping to design / work on this, contributions welcome. This is a big enough piece of work that some kind of design document (or detailed github description) should come before any code is written.

@epwalsh, do we have an issue open already for this? If not, I'll keep this open and add it to either the 2.0 or the performance milestone (maybe 2.0?).

epwalsh · 2020-08-24T16:24:59Z

@matt-gardner I don't think we had a separate issue tracking caching until now, so adding this to 2.0 sounds good to me.

OhadRubin · 2020-08-24T23:59:23Z

Should I create a .md page or a Google Docs?
Anyway, most of the logic will use the to_tensor method of each field, right?
For fields that don't have to_tensor, like MetadataField, use jsonpickle, and write that into the file.
Regarding the way we serialize the tensors, we can either:

use TensorCache, which uses lmdb.
use something similar to MMapIndexedDataset, where each tensor converted into a numpy array is and be written/read to a file sequentially.

I am leaning towards (2) since we are reading the instances in memory sequentially, which might be better than a key:value solution like lmdb.

epwalsh · 2020-08-25T16:59:37Z

Hey @OhadRubin, I've already started a design document which I'll link to on this thread once it's ready to be seen.

epwalsh · 2020-08-26T16:09:21Z

There is a good discussion here about lmdb: #4578 (comment).

I'm working on an API design that would be agnostic to the backend or ser/deserialization method we use, so we can decide on that later.

OhadRubin · 2020-08-26T17:54:05Z

So it seems there are different use cases for caching, maybe I can write some code that extends TensorCache into a Registrable and implement a MMapIndexedDataset style TensorCache.
@epwalsh, what do you think?

epwalsh · 2020-08-26T18:00:13Z

@OhadRubin that would be great if you got started on that!

Right now just I'm focusing on the overall API, and how that would integrate into our data pipeline, so I don't think that collides with what you want to work on. Once I'm finished with the API skeleton we should be able to plug in the TensorCache / MMapIndexedDataset / whatever we go with.

OhadRubin · 2020-08-26T19:23:08Z

So I should inherit from DatasetReader and override _instances_to_cache_file, correct?
I understand there are plans to move caching to DataLoader, but for now, that should be ok, right?

epwalsh · 2020-08-26T19:25:37Z

@OhadRubin actually no, we're working off of the vision branch now, which will become AllenNLP 2.0. In that branch the caching mechanism in the DatasetReader has been removed. Maybe just start by implementing this in a standalone file?

OhadRubin · 2020-08-26T19:30:18Z

Ok, so I'll assume I am iterating over objects with a serialize method, is that ok?

epwalsh · 2020-08-26T19:43:22Z

Yes, ~~but we can assume these objects are indexed Instances.~~

OhadRubin · 2020-08-27T14:09:58Z

Hey, @epwalsh, after looking a bit more into MMapIndexedDataset, much of the benefit of it comes from knowing the structure and sizes of everything, can I assume serialize returns a Dict[str, Union[numpy.array, str]]?

epwalsh · 2020-08-27T14:38:18Z

Hey @OhadRubin, well after the talking with the team a little more yesterday we decided it would probably be more beneficial to cache tensor dicts instead of actual Instance objects. In other words, we want to be caching the result of Instance.as_tensor_dict().

So in most cases, this is a Dict[str, torch.Tensor], which would play nicely with the MMapIndexedDataset approach. However, there are some exceptions. In particular, when an Instance contains a MetaDataField, Instance.as_dict_dict() could contain pretty much anything.

OhadRubin · 2020-08-27T16:57:47Z

There is a overhead of around 760 bytes for saving a Tensor with torch.save, compared to numpy.tobytes.
So i'll use numpy.

import torch
import io
import pickle
import numpy as np

for dtype in [torch.bool, torch.float16,torch.float32,torch.int16,torch.int32,torch.int64]:
    print(dtype)
    for size in range(2,5):
        in_list = list(range(10**size))
        torch_tensor = torch.tensor(in_list,dtype=dtype).detach()
        buffer = io.BytesIO()
        torch.save(torch_tensor, buffer, pickle_protocol=pickle.HIGHEST_PROTOCOL)
        torch_res = buffer.getbuffer().tobytes()
        np_res = torch_tensor.numpy().tobytes(order='C')
        print(f"size: {10**size} - torch_res: {len(torch_res)}, np_res: {len(np_res)}")

Output:

torch.bool
size: 100 - torch_res: 824, np_res: 100
size: 1000 - torch_res: 1720, np_res: 1000
size: 10000 - torch_res: 10744, np_res: 10000
torch.float16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.float32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.int32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int64
size: 100 - torch_res: 1528, np_res: 800
size: 1000 - torch_res: 8760, np_res: 8000
size: 10000 - torch_res: 80760, np_res: 80000

Edit:
@dirkgr, for images this is not negligible correct? The image features are 2000 x float32, so it is a 10% improvement in space.
And since torch.from_numpy and Tensor.numpy() are zero-copy operation, you get 10% less space usage for free.

epwalsh · 2020-08-27T17:52:24Z

API design document (work in progress): https://docs.google.com/document/d/1EBAcPF19NM7bYuwDKmHN4Ws361p0Eo4vjIpSA67l1rQ/edit?usp=sharing

dirkgr · 2020-09-11T01:05:12Z

numpy.tobytes() does not remember the data type, size, or value order of the tensor. There might be other stuff. torch.save() does, and we need that stuff to be preserved.

I would consider using numpy.tobytes() and torch.from_numpy() anyways because it allows us to create a torch tensor that's backed by memory-mapped bytes, no reading required. But we have some other hurdles to clear before we can do that. So for now, let's stick with torch.save().

dirkgr · 2021-02-22T18:44:51Z

Since reading is so much faster now, the case for caching is pretty weak. We're abandoning this issue for now.

OhadRubin added the Feature request label Aug 24, 2020

matt-gardner added this to the 2.0 milestone Aug 24, 2020

OhadRubin closed this as completed Aug 26, 2020

OhadRubin reopened this Aug 26, 2020

OhadRubin mentioned this issue Aug 27, 2020

Added MMapIndexedCache #4611

Open

matt-gardner added the Under Development label Aug 28, 2020

dirkgr modified the milestones: 2.0, 2.1 Nov 9, 2020

dirkgr closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better caching #4591

Better caching #4591

OhadRubin commented Aug 24, 2020 •

edited

Loading

matt-gardner commented Aug 24, 2020

epwalsh commented Aug 24, 2020

OhadRubin commented Aug 24, 2020 •

edited

Loading

epwalsh commented Aug 25, 2020

epwalsh commented Aug 26, 2020

OhadRubin commented Aug 26, 2020

epwalsh commented Aug 26, 2020

OhadRubin commented Aug 26, 2020

epwalsh commented Aug 26, 2020

OhadRubin commented Aug 26, 2020

epwalsh commented Aug 26, 2020 •

edited

Loading

OhadRubin commented Aug 27, 2020 •

edited

Loading

epwalsh commented Aug 27, 2020

OhadRubin commented Aug 27, 2020 •

edited

Loading

epwalsh commented Aug 27, 2020

dirkgr commented Sep 11, 2020

dirkgr commented Feb 22, 2021

Better caching #4591

Better caching #4591

Comments

OhadRubin commented Aug 24, 2020 • edited Loading

matt-gardner commented Aug 24, 2020

epwalsh commented Aug 24, 2020

OhadRubin commented Aug 24, 2020 • edited Loading

epwalsh commented Aug 25, 2020

epwalsh commented Aug 26, 2020

OhadRubin commented Aug 26, 2020

epwalsh commented Aug 26, 2020

OhadRubin commented Aug 26, 2020

epwalsh commented Aug 26, 2020

OhadRubin commented Aug 26, 2020

epwalsh commented Aug 26, 2020 • edited Loading

OhadRubin commented Aug 27, 2020 • edited Loading

epwalsh commented Aug 27, 2020

OhadRubin commented Aug 27, 2020 • edited Loading

epwalsh commented Aug 27, 2020

dirkgr commented Sep 11, 2020

dirkgr commented Feb 22, 2021

OhadRubin commented Aug 24, 2020 •

edited

Loading

OhadRubin commented Aug 24, 2020 •

edited

Loading

epwalsh commented Aug 26, 2020 •

edited

Loading

OhadRubin commented Aug 27, 2020 •

edited

Loading

OhadRubin commented Aug 27, 2020 •

edited

Loading