Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Better caching #4591

Closed
OhadRubin opened this issue Aug 24, 2020 · 17 comments
Closed

Better caching #4591

OhadRubin opened this issue Aug 24, 2020 · 17 comments

Comments

@OhadRubin
Copy link
Contributor

OhadRubin commented Aug 24, 2020

Something along the lines of what fairseq has:
Here
For example, in MMapIndexedDataset, they calculate the sizes for each field and then read exactly that size into an array.
Here, @matt-gardner mentions that this is the type of caching that you guys might implement. A database that just stores tensors directly.

@matt-gardner
Copy link
Contributor

There is some initial work along these lines here (used in a dataset reader here). I think we definitely want to figure out how to include this in our main Instance creation pipeline, but that'll come a little bit later. If you're interested in helping to design / work on this, contributions welcome. This is a big enough piece of work that some kind of design document (or detailed github description) should come before any code is written.

@epwalsh, do we have an issue open already for this? If not, I'll keep this open and add it to either the 2.0 or the performance milestone (maybe 2.0?).

@matt-gardner matt-gardner added this to the 2.0 milestone Aug 24, 2020
@epwalsh
Copy link
Member

epwalsh commented Aug 24, 2020

@matt-gardner I don't think we had a separate issue tracking caching until now, so adding this to 2.0 sounds good to me.

@OhadRubin
Copy link
Contributor Author

OhadRubin commented Aug 24, 2020

Should I create a .md page or a Google Docs?
Anyway, most of the logic will use the to_tensor method of each field, right?
For fields that don't have to_tensor, like MetadataField, use jsonpickle, and write that into the file.
Regarding the way we serialize the tensors, we can either:

  1. use TensorCache, which uses lmdb.
  2. use something similar to MMapIndexedDataset, where each tensor converted into a numpy array is and be written/read to a file sequentially.

I am leaning towards (2) since we are reading the instances in memory sequentially, which might be better than a key:value solution like lmdb.

@epwalsh
Copy link
Member

epwalsh commented Aug 25, 2020

Hey @OhadRubin, I've already started a design document which I'll link to on this thread once it's ready to be seen.

@epwalsh
Copy link
Member

epwalsh commented Aug 26, 2020

There is a good discussion here about lmdb: #4578 (comment).

I'm working on an API design that would be agnostic to the backend or ser/deserialization method we use, so we can decide on that later.

@OhadRubin
Copy link
Contributor Author

So it seems there are different use cases for caching, maybe I can write some code that extends TensorCache into a Registrable and implement a MMapIndexedDataset style TensorCache.
@epwalsh, what do you think?

@OhadRubin OhadRubin reopened this Aug 26, 2020
@epwalsh
Copy link
Member

epwalsh commented Aug 26, 2020

@OhadRubin that would be great if you got started on that!

Right now just I'm focusing on the overall API, and how that would integrate into our data pipeline, so I don't think that collides with what you want to work on. Once I'm finished with the API skeleton we should be able to plug in the TensorCache / MMapIndexedDataset / whatever we go with.

@OhadRubin
Copy link
Contributor Author

So I should inherit from DatasetReader and override _instances_to_cache_file, correct?
I understand there are plans to move caching to DataLoader, but for now, that should be ok, right?

@epwalsh
Copy link
Member

epwalsh commented Aug 26, 2020

@OhadRubin actually no, we're working off of the vision branch now, which will become AllenNLP 2.0. In that branch the caching mechanism in the DatasetReader has been removed. Maybe just start by implementing this in a standalone file?

@OhadRubin
Copy link
Contributor Author

Ok, so I'll assume I am iterating over objects with a serialize method, is that ok?

@epwalsh
Copy link
Member

epwalsh commented Aug 26, 2020

Yes, but we can assume these objects are indexed Instances.

@OhadRubin
Copy link
Contributor Author

OhadRubin commented Aug 27, 2020

Hey, @epwalsh, after looking a bit more into MMapIndexedDataset, much of the benefit of it comes from knowing the structure and sizes of everything, can I assume serialize returns a Dict[str, Union[numpy.array, str]]?

@epwalsh
Copy link
Member

epwalsh commented Aug 27, 2020

Hey @OhadRubin, well after the talking with the team a little more yesterday we decided it would probably be more beneficial to cache tensor dicts instead of actual Instance objects. In other words, we want to be caching the result of Instance.as_tensor_dict().

So in most cases, this is a Dict[str, torch.Tensor], which would play nicely with the MMapIndexedDataset approach. However, there are some exceptions. In particular, when an Instance contains a MetaDataField, Instance.as_dict_dict() could contain pretty much anything.

@OhadRubin
Copy link
Contributor Author

OhadRubin commented Aug 27, 2020

There is a overhead of around 760 bytes for saving a Tensor with torch.save, compared to numpy.tobytes.
So i'll use numpy.

import torch
import io
import pickle
import numpy as np

for dtype in [torch.bool, torch.float16,torch.float32,torch.int16,torch.int32,torch.int64]:
    print(dtype)
    for size in range(2,5):
        in_list = list(range(10**size))
        torch_tensor = torch.tensor(in_list,dtype=dtype).detach()
        buffer = io.BytesIO()
        torch.save(torch_tensor, buffer, pickle_protocol=pickle.HIGHEST_PROTOCOL)
        torch_res = buffer.getbuffer().tobytes()
        np_res = torch_tensor.numpy().tobytes(order='C')
        print(f"size: {10**size} - torch_res: {len(torch_res)}, np_res: {len(np_res)}")

Output:

torch.bool
size: 100 - torch_res: 824, np_res: 100
size: 1000 - torch_res: 1720, np_res: 1000
size: 10000 - torch_res: 10744, np_res: 10000
torch.float16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.float32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int16
size: 100 - torch_res: 952, np_res: 200
size: 1000 - torch_res: 2744, np_res: 2000
size: 10000 - torch_res: 20728, np_res: 20000
torch.int32
size: 100 - torch_res: 1144, np_res: 400
size: 1000 - torch_res: 4728, np_res: 4000
size: 10000 - torch_res: 40760, np_res: 40000
torch.int64
size: 100 - torch_res: 1528, np_res: 800
size: 1000 - torch_res: 8760, np_res: 8000
size: 10000 - torch_res: 80760, np_res: 80000

Edit:
@dirkgr, for images this is not negligible correct? The image features are 2000 x float32, so it is a 10% improvement in space.
And since torch.from_numpy and Tensor.numpy() are zero-copy operation, you get 10% less space usage for free.

@epwalsh
Copy link
Member

epwalsh commented Aug 27, 2020

@dirkgr
Copy link
Member

dirkgr commented Sep 11, 2020

numpy.tobytes() does not remember the data type, size, or value order of the tensor. There might be other stuff. torch.save() does, and we need that stuff to be preserved.

I would consider using numpy.tobytes() and torch.from_numpy() anyways because it allows us to create a torch tensor that's backed by memory-mapped bytes, no reading required. But we have some other hurdles to clear before we can do that. So for now, let's stick with torch.save().

@dirkgr dirkgr modified the milestones: 2.0, 2.1 Nov 9, 2020
@dirkgr
Copy link
Member

dirkgr commented Feb 22, 2021

Since reading is so much faster now, the case for caching is pretty weak. We're abandoning this issue for now.

@dirkgr dirkgr closed this as completed Feb 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants