Skip to content

Implement Llama longest prefix cache #158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
abetlen opened this issue May 5, 2023 · 4 comments
Closed

Implement Llama longest prefix cache #158

abetlen opened this issue May 5, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@abetlen
Copy link
Owner

abetlen commented May 5, 2023

Opening this up to track the development of the new caching behaviour I'm planning to implement. This will leverage 2 significant improvements

  • Reduced llama state size which is now a function of evaluated tokens
  • Improved efficiency of Llama.generate which now only eval's prompt tokens that are not already in the context window
@abetlen abetlen added the enhancement New feature or request label May 5, 2023
@abetlen abetlen pinned this issue May 5, 2023
@abetlen abetlen changed the title Implement Llama longest_prefix_cache Implement Llama longest prefix cache May 7, 2023
@abetlen abetlen closed this as completed in 0e94a70 May 7, 2023
@abetlen
Copy link
Owner Author

abetlen commented May 7, 2023

This looks like it's good to go now, the cache is in-memory currently so you'll need to set CACHE_SIZE or --cache-size in bytes to the maximum cache size for your available RAM. If anyone wants to test this out let me know if it works as expected, I'll probably push this release later tonight.

@abetlen abetlen unpinned this issue May 7, 2023
@ghost
Copy link

ghost commented May 8, 2023

Tried this out in oobabooga and it's a game-changer for chats with frequent editing. Before this the model had to spend a lot of time reingesting the entire prompt even on a small edit.

@abetlen
Copy link
Owner Author

abetlen commented May 9, 2023

@eiery glad to hear! I should point out though that text-generation-webui is just using the new generate rewind behaviour and not the full state cache. Basically "rewind" is useful for a single long conversation or generating multiple responses for a single prompt, the full state cache let's you restore completely differrent conversations.

@ibehnam
Copy link

ibehnam commented Sep 1, 2024

@abetlen I'm curious about using this feature in the Python API. I tried this approach but re-running the same llm always returns "... remaining 1 prompt tokens to eval" even though it should not eval anything because the prompt hasn't changed. 🤔

from llama_cpp import Llama, LlamaDiskCache

cache = LlamaDiskCache(cache_dir=".../.cache")


model_path_2 = "...Phi-3-mini-4k-instruct-q4.gguf"
llm = Llama(
    model_path=model_path_2,
    n_gpu_layers=-1,
    flash_attn=True,
    n_threads=10,
    use_mlock=False,
    seed=42,
    verbose=True,
)
llm.reset()
llm.set_cache(cache)

# %%

prompt = "#User:\nFrom now on, your name is Behnam.\n#Assistant:\n"

r = llm(
    prompt=prompt,
    max_tokens=12,
    temperature=0.0,
    stop=[
        "\n\n",
        "#User:",
        "#Assistant:",
    ],
)

print(r["choices"][0]["text"])

Which gives:

Llama._create_completion: cache hit
**Llama.generate: 21 prefix-match hit, remaining 1 prompt tokens to eval**

llama_print_timings:        load time =     125.70 ms
llama_print_timings:      sample time =       1.61 ms [/](https://file+.vscode-resource.vscode-cdn.net/)    12 runs   (    0.13 ms per token,  7453.42 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms [/](https://file+.vscode-resource.vscode-cdn.net/)     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =     358.08 ms [/](https://file+.vscode-resource.vscode-cdn.net/)    12 runs   (   29.84 ms per token,    33.51 tokens per second)
llama_print_timings:       total time =     395.63 ms [/](https://file+.vscode-resource.vscode-cdn.net/)    12 tokens
Hello! I'm Behnam, at your service
Llama._create_completion: cache save
Llama.save_state: saving llama state
Llama.save_state: got state size: 13112306
Llama.save_state: allocated state
Llama.save_state: copied llama state: 13112306
Llama.save_state: saving 13112306 bytes of llama state
LlamaDiskCache.__setitem__: called
LlamaDiskCache.__setitem__: set
LlamaDiskCache.__setitem__: trim

I also tried appending the model output to the previous prompt and sending it to the llm again:

...
agg_prompt = prompt + r["choices"][0]["text"] + "#User: What is your name now?\n#Assistant:"

r = llm(
    prompt=agg_prompt,
    max_tokens=32,
    temperature=0.0,
    stop=["\n\n", "#User:", "#Assistant:"]
)

print(r["choices"][0]["text"])

But again got similar results:

**Llama.generate: 33 prefix-match hit, remaining 15 prompt tokens to eval**

llama_print_timings:        load time =     131.47 ms
llama_print_timings:      sample time =       2.94 ms [/](https://file+.vscode-resource.vscode-cdn.net/)    18 runs   (    0.16 ms per token,  6122.45 tokens per second)
llama_print_timings: prompt eval time =     165.82 ms [/](https://file+.vscode-resource.vscode-cdn.net/)    15 tokens (   11.05 ms per token,    90.46 tokens per second)
llama_print_timings:        eval time =     407.14 ms [/](https://file+.vscode-resource.vscode-cdn.net/)    17 runs   (   23.95 ms per token,    41.75 tokens per second)
llama_print_timings:       total time =     589.39 ms [/](https://file+.vscode-resource.vscode-cdn.net/)    32 tokens
 
My name is Behnam. How can I assist you today?
Llama._create_completion: cache save
Llama.save_state: saving llama state
Llama.save_state: got state size: 25695602
Llama.save_state: allocated state
Llama.save_state: copied llama state: 25695602
Llama.save_state: saving 25695602 bytes of llama state
LlamaDiskCache.__setitem__: called
LlamaDiskCache.__setitem__: set
LlamaDiskCache.__setitem__: trim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants