Skip to content
This repository has been archived by the owner on Feb 2, 2025. It is now read-only.

Add prompt caching #14

Closed
wants to merge 8 commits into from

Conversation

irthomasthomas
Copy link

@irthomasthomas irthomasthomas commented Sep 9, 2024

  • Introduced new caching options for user and system prompts in the ClaudeOptions class.

Files changed:

  • llm_claude_3.py:
    • Added cache_prompt and cache_system options to ClaudeOptions.
    • Modified message handling in ClaudeMessages to support caching functionality.
    • Updated response processing to include cache control mechanisms.
  • README.md: Added new section on caching options and usage.

I do not think that I like the interface, tbh, but see what you think. It's hard to trade off features and complexity.
I did not think that caching should be turned on by default, due to the 25% premium on cache inputs.
This first implementation is less flexible than the API allows. The API allows for you to include both cached and none-cached system and user prompts in one request. I wasn't sure how to do that without editing the main projects cli.py?

I think ideally the interface would be
llm --system "not cached system prompt" --cached-system "this system prompt will be cached" --cached-user "a user prompt to be cached" "none cached prompt?"
But how do we handle the case of requesting ONLY a cached user prompt? I think the current cli demands a prompt, so that would need updating.

Anyway, here is the current implementation.
To use it, you use prompt and --system as normal, and then you can choose to cache or not cache either one by adding -o cache_prompt 1 and/or -o cache_system 1
Then you get cache metadata returned in the json.

Added option cache_prompt and cache_system

To use

llm -m claude-3.5-sonnet --system $(cat long-system.txt) "prompt" -o cache_system 1

"usage": {"input_tokens": 10000, "output_tokens": 500, "cache_creation_input_tokens": 10000, "cache_read_input_tokens": 0}}

This first prompt requests to cache the system prompt. I used a file of 10,000 tokens, and as this is the first prompt we see "cache_creation_input_tokens": 10000 returned in the json. This means we paid 25% more for those tokens, but future requests using the same system prompt will be discounted 90%, as long we re-prompt within the cache TTL, currently 5 minutes (refreshed each time).

llm -m claude-3.5-sonnet --system $(cat long-system.txt) "a new non-cached user prompt. " -o cache_system 1

"usage": {"input_tokens": 10001, "output_tokens": 500, "cache_creation_input_tokens": 0, "cache_read_input_tokens":10000 }}

This sends a new user prompt but the same system prompt, along with the cache_system flag this means we will use the cached system prompt from the previous command. We can see we hit the cache by the usage response: "cache_read_input_tokens":10000

llm -m claude-3.5-sonnet --system $(cat long-system.txt) "$(cat long-prompt.txt)" -o cache_system 1 -o cache_prompt 1

"usage": {"input_tokens": 15000 "output_tokens": 500, "cache_creation_input_tokens": 5000, "cache_read_input_tokens":10000 }}

This time we CREATE a USER prompt cache, and also READ a SYSTEM cache. Hence:
"cache_creation_input_tokens": 5000, "cache_read_input_tokens":10000

llm -m claude-3.5-sonnet --system $(cat long-system.txt) "$(cat long-prompt.txt)" -o cache_system 1 -o cache_prompt 1

"usage": {"input_tokens": 15000 "output_tokens": 500, "cache_creation_input_tokens": 0, "cache_read_input_tokens" :15000

Finally, running the same command again causes both system and user prompt cache reads.

- Introduced new caching options for user and system prompts in the ClaudeOptions class.
- Implemented caching logic within ClaudeMessages to facilitate prompt caching based on user input.
- Updated README.md to reflect new caching options with usage instructions.

Files changed:
- README.md: Added new section on caching options and usage.
- llm_claude_3.py:
  - Added cache_prompt and cache_system options to ClaudeOptions.
  - Modified message handling in ClaudeMessages to support caching functionality.
  - Updated response processing to include cache control mechanisms.
@irthomasthomas irthomasthomas changed the title Add prompt caching functionality and update README documentation Add prompt caching Sep 9, 2024
- Added support for Anthropic's Prompt Caching feature to improve performance and reduce costs.
- Updated the README to include detailed instructions and benefits of the new caching functionality.
- Modified `llm_claude_3.py` to implement caching options for user and system prompts in the `ClaudeOptions` class.
- Adjusted message-building logic to conditionally apply caching based on user turns.

Changes Summary:
- .gitignore: Added `.artefacts/` to the ignore list.
- README.md: Expanded documentation on prompt caching with usage examples and performance benefits.
- llm_claude_3.py:
  - Updated cache_prompt and cache_system defaults to None.
  - Revamped `build_messages` method for enhanced user turn handling.
  - Modified client initialization to include caching headers.
  - Adjusted message processing when streaming responses.
@irthomasthomas irthomasthomas marked this pull request as ready for review September 14, 2024 09:39
@irthomasthomas
Copy link
Author

I made some improvements. I've been using this version for a few days and it seems to work well. It automatically caches continued conversations if they where flagged for caching.

Here is a long conversation with prompt caching enabled. It compares the cached price to what it would have cost without.
Screenshot_20240913_144918

@irthomasthomas
Copy link
Author

irthomasthomas commented Sep 16, 2024

Now that I have played with it for a week, I think I want a menu to manage the cache options:

llm cache on/off - toggle always-on caching
llm cache TTL [minutes] - default time to expire caches
llm cache list - list active keep-alive caches

This will also need a prompt option to specify a keep-alive time
llm -o cache_prompt 1 -o keep_alive 60 (cache prompt and keep it active for 60 minutes.)

Also, prefer to use flags like --cache, rather than -o cache_prompt 1, if possible.

Currently we can independently choose to cache system or prompt. Is this useful? Are their cases when we would want to cache one but not the other?

irthomasthomas and others added 6 commits November 20, 2024 17:32
- Add Claude 3.5 Haiku model
- Implement prompt caching for conversation history
- Refactor message building logic to support caching
- Limit cache control blocks to maintain efficiency
- Update message structure to align with new Anthropic API
Add registration for claude-3.5-sonnet-20241022 model with aliases and update Haiku model configuration for better version management.
…vention

- Renamed `llm_claude_3.py` to `llm_claude_3_cache.py` to indicate caching focus
- Updated model registration aliases to include `-cache` suffix
- Adjusted project name in `pyproject.toml` to reflect caching capability
- Removed beta prompt caching method to use standard messages create
@simonw
Copy link
Owner

simonw commented Feb 2, 2025

I'm working on an alternative system which should enable this kind of caching using a new feature in LLM core itself - details here:

@simonw simonw closed this Feb 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants