LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases

**Is your feature request related to a problem? Please describe.**
I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:

```
<system>
You are an assistant for domain <x>, you summarize information and blah blah blah.<eot>
<user>
Document 1: Some information that is relevant
Document 2: Other information
Document 3: Final information

User question: "How do I do <y>?"<eot>
```

In order to minimize latency, I've [developed](https://github.com/tc-wolf/llama-cpp-python/blob/bumped_llama_cpp_with_disk_cache/llama_cpp/llama_cache.py#L169-L280) a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) [script for doing this](https://github.com/tc-wolf/llama-cpp-python/blob/bumped_llama_cpp_with_disk_cache/examples/high_level_api/create_disk_cache.py) is also in my branch).

This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.

**Describe the solution you'd like**
I'd like to upstream the [LlamaStaticDiskCache](https://github.com/tc-wolf/llama-cpp-python/blob/bumped_llama_cpp_with_disk_cache/llama_cpp/llama_cache.py#L169-L280) class in my branch. It's very similar to the existing `LlamaDiskCache` but:

* Cache is not mutable once built
  * (Does not `pop` in `__getitem__`)
* It uses a trie for finding the longest matching prefix (if any) in the cache
* It has a convenience factory method for building from a list of prompts

So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.

### Complication / Details with this

I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.

I.e., skip loading and set seed for reproducibility: https://github.com/tc-wolf/llama.cpp/commit/ea43d922e4169f4fd7622e4a8e6eca92ef921038

I don't think that this will be a factor anymore because https://github.com/ggerganov/llama.cpp/pull/9294 has removed serializing / deserializing the RNG when saving.

**Describe alternatives you've considered**
* Use lower-level state saving functions (rather than pickling `llama.save_state()` to save less on-disk than full model file
* Use more efficient strategy for saving - right now if each key has the same system prompt (for example), that will be saved independently for every stored prompt. There's a lot of space that could be saved if dedupe and only save each prefix once, but it complicates the saving/loading logic.
  * Could also allow for partial matches when checking cache - right now a key has to be a full prefix of the input tokens, but could try and look for partial match to allow for more graceful failure.
  * This also complicates the `__getitem__` logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases #1737

Complication / Details with this

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases #1737

Description

Complication / Details with this

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions