Skip to content

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases #1737

Open
@tc-wolf

Description

@tc-wolf

Is your feature request related to a problem? Please describe.
I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:

<system>
You are an assistant for domain <x>, you summarize information and blah blah blah.<eot>
<user>
Document 1: Some information that is relevant
Document 2: Other information
Document 3: Final information

User question: "How do I do <y>?"<eot>

In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).

This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.

Describe the solution you'd like
I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing LlamaDiskCache but:

  • Cache is not mutable once built
    • (Does not pop in __getitem__)
  • It uses a trie for finding the longest matching prefix (if any) in the cache
  • It has a convenience factory method for building from a list of prompts

So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.

Complication / Details with this

I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.

I.e., skip loading and set seed for reproducibility: tc-wolf/llama.cpp@ea43d92

I don't think that this will be a factor anymore because ggml-org/llama.cpp#9294 has removed serializing / deserializing the RNG when saving.

Describe alternatives you've considered

  • Use lower-level state saving functions (rather than pickling llama.save_state() to save less on-disk than full model file
  • Use more efficient strategy for saving - right now if each key has the same system prompt (for example), that will be saved independently for every stored prompt. There's a lot of space that could be saved if dedupe and only save each prefix once, but it complicates the saving/loading logic.
    • Could also allow for partial matches when checking cache - right now a key has to be a full prefix of the input tokens, but could try and look for partial match to allow for more graceful failure.
    • This also complicates the __getitem__ logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions