-
-
Notifications
You must be signed in to change notification settings - Fork 30
Conversation
- Introduced new caching options for user and system prompts in the ClaudeOptions class. - Implemented caching logic within ClaudeMessages to facilitate prompt caching based on user input. - Updated README.md to reflect new caching options with usage instructions. Files changed: - README.md: Added new section on caching options and usage. - llm_claude_3.py: - Added cache_prompt and cache_system options to ClaudeOptions. - Modified message handling in ClaudeMessages to support caching functionality. - Updated response processing to include cache control mechanisms.
- Added support for Anthropic's Prompt Caching feature to improve performance and reduce costs. - Updated the README to include detailed instructions and benefits of the new caching functionality. - Modified `llm_claude_3.py` to implement caching options for user and system prompts in the `ClaudeOptions` class. - Adjusted message-building logic to conditionally apply caching based on user turns. Changes Summary: - .gitignore: Added `.artefacts/` to the ignore list. - README.md: Expanded documentation on prompt caching with usage examples and performance benefits. - llm_claude_3.py: - Updated cache_prompt and cache_system defaults to None. - Revamped `build_messages` method for enhanced user turn handling. - Modified client initialization to include caching headers. - Adjusted message processing when streaming responses.
Now that I have played with it for a week, I think I want a menu to manage the cache options: llm cache on/off - toggle always-on caching This will also need a prompt option to specify a keep-alive time Also, prefer to use flags like --cache, rather than -o cache_prompt 1, if possible. Currently we can independently choose to cache system or prompt. Is this useful? Are their cases when we would want to cache one but not the other? |
- Add Claude 3.5 Haiku model - Implement prompt caching for conversation history - Refactor message building logic to support caching - Limit cache control blocks to maintain efficiency - Update message structure to align with new Anthropic API
Add registration for claude-3.5-sonnet-20241022 model with aliases and update Haiku model configuration for better version management.
…vention - Renamed `llm_claude_3.py` to `llm_claude_3_cache.py` to indicate caching focus - Updated model registration aliases to include `-cache` suffix - Adjusted project name in `pyproject.toml` to reflect caching capability - Removed beta prompt caching method to use standard messages create
I'm working on an alternative system which should enable this kind of caching using a new feature in LLM core itself - details here: |
Files changed:
I do not think that I like the interface, tbh, but see what you think. It's hard to trade off features and complexity.
I did not think that caching should be turned on by default, due to the 25% premium on cache inputs.
This first implementation is less flexible than the API allows. The API allows for you to include both cached and none-cached system and user prompts in one request. I wasn't sure how to do that without editing the main projects cli.py?
I think ideally the interface would be
llm --system "not cached system prompt" --cached-system "this system prompt will be cached" --cached-user "a user prompt to be cached" "none cached prompt?"
But how do we handle the case of requesting ONLY a cached user prompt? I think the current cli demands a prompt, so that would need updating.
Anyway, here is the current implementation.
To use it, you use prompt and --system as normal, and then you can choose to cache or not cache either one by adding -o cache_prompt 1 and/or -o cache_system 1
Then you get cache metadata returned in the json.
Added option
cache_prompt
andcache_system
To use
"usage": {"input_tokens": 10000, "output_tokens": 500, "cache_creation_input_tokens": 10000, "cache_read_input_tokens": 0}}
This first prompt requests to cache the system prompt. I used a file of 10,000 tokens, and as this is the first prompt we see "cache_creation_input_tokens": 10000 returned in the json. This means we paid 25% more for those tokens, but future requests using the same system prompt will be discounted 90%, as long we re-prompt within the cache TTL, currently 5 minutes (refreshed each time).
"usage": {"input_tokens": 10001, "output_tokens": 500, "cache_creation_input_tokens": 0, "cache_read_input_tokens":10000 }}
This sends a new user prompt but the same system prompt, along with the cache_system flag this means we will use the cached system prompt from the previous command. We can see we hit the cache by the usage response: "cache_read_input_tokens":10000
"usage": {"input_tokens": 15000 "output_tokens": 500, "cache_creation_input_tokens": 5000, "cache_read_input_tokens":10000 }}
This time we CREATE a USER prompt cache, and also READ a SYSTEM cache. Hence:
"cache_creation_input_tokens": 5000, "cache_read_input_tokens":10000
"usage": {"input_tokens": 15000 "output_tokens": 500, "cache_creation_input_tokens": 0, "cache_read_input_tokens" :15000
Finally, running the same command again causes both system and user prompt cache reads.