-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💾 feat: Anthropic Prompt Caching #3670
Conversation
…or different valueKeys and cacheTypes
…ct" as we have the correct amount
…nstructions or files are not being resent with every request
…o be as accurate as possible even if we shouldn't update user message token counts
* wip: initial cache control implementation, add typing for transactions handling * feat: first pass of Anthropic Prompt Caching * feat: standardize stream usage as pass in when calculating token counts * feat: Add getCacheMultiplier function to calculate cache multiplier for different valueKeys and cacheTypes * chore: imports order * refactor: token usage recording in AnthropicClient, no need to "correct" as we have the correct amount * feat: more accurate token counting using stream usage data * feat: Improve token counting accuracy with stream usage data * refactor: ensure more accurate than not token estimations if custom instructions or files are not being resent with every request * refactor: cleanup updateUserMessageTokenCount to allow transactions to be as accurate as possible even if we shouldn't update user message token counts * ci: fix tests
To understand: This is always enabled per default and there is no UI-Switch? |
Adding a UI switch as we speak |
@danny-avila Wow, you're fast. Was a question, I don't know if really needed, discussion. Discussion : If I understand correctly, there is according to this Page a minimum of 1024 token for caching to work. Also according to that page cache writes cost 3,75 and reads 0,30 compared to 3 of normal input. So worst case would be 4,05 compared to 3. So it should always be good if at least one Cache usage is done. When the chat has already 1024 tokens, it is probable that there alrwady was multi turn converasation. This makes it probable for further conversation. There is only one exception: the chat is only used from time to time (>5 min cache time). But even then multi turn is towards probable. So statstically here after one cache usage it should already pay out. And would be faster too (cache makes response faster). I don't know about images in cache though. The crucial question is : Is the assumption that in a e.g. 4 minute chat, every chat part apart from the last response is only written one time to cache (only one time cache write price for each text part)? So can a cache be "extended" effectivly or how does this work for a chat? If this is the case and images are no problem and the implementation does work without problems i would lean regarding there is no ui-switch needed. But there probably should be an info somewhere (which the ui-switch , per default on, would effectivly be, so it would be a good solution even for that, if that is not to much maintaining work). Tl,Dr : If not much maintaining work, ui-switch is good, one way or the other. What's important, is the question above, because it decides if cache use is recommandable per default. |
Yes I agree, that's why I decided to make it true as default.
Images are not cached in my implementation. This is due to the limit of 5 cache "blocks", and each image costs one block. It's difficult to wrap my head around, and maintaining image caches adds complexity. But I'm also not sure, I wonder if they do get cached once they are no longer the most recent user messages (the current user message and the one prior, which act as cache "stamps").
it's nice to have, for that reason i will include it, it's a quick implementation.
Can't do this, it's an API thing on their end. From my understanding, it seems to be kept active as long you use the same cache in 5 minute intervals. Something programmatic to keep the cache "warm" is a bit complex and costly, defeats the purpose (it's only kept warm if you keep the turns going). Honestly I speak with some confidence here but needs some rigorous testing. The debug logs offer information to this end, but have not had time to evaluate. From my own experience, and judging from the transaction records, it is saving me a lot of money on those longer context conversations. |
Caching mechanism and implementation
I'm currently using Claude Pro and 5 images (and in general attachments) are also the maximium to use there. (After reading the message again: 4 cache blocks are the limit, so that has nothing to do with it).
"Prompt Caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control." According to that understanding of the sentence above i find the sentence "Note that changes to tool_choice or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created." Tl;Dr.: To my understanding, marking the new user prompt (for new cache write) and the last user prompt (for cache read), and maybe as a bonus, the system prompt if someone uses a large one, is completley sufficient (and will cache images also).
So they probably are cached already :). Logging cache activity
Do you log Holding Cache alive
Keeping the cache "warm" is exactly what i meant in the other response. It would work by sending exactly the same last request again but with Edit: When the user already sent a new message, of course no request on 4min48sec for the old one is needed as the new sent prompt already keeps the cache alive. |
* wip: initial cache control implementation, add typing for transactions handling * feat: first pass of Anthropic Prompt Caching * feat: standardize stream usage as pass in when calculating token counts * feat: Add getCacheMultiplier function to calculate cache multiplier for different valueKeys and cacheTypes * chore: imports order * refactor: token usage recording in AnthropicClient, no need to "correct" as we have the correct amount * feat: more accurate token counting using stream usage data * feat: Improve token counting accuracy with stream usage data * refactor: ensure more accurate than not token estimations if custom instructions or files are not being resent with every request * refactor: cleanup updateUserMessageTokenCount to allow transactions to be as accurate as possible even if we shouldn't update user message token counts * ci: fix tests
Summary
Closes #3661
This PR introduces prompt caching functionality for the Anthropic API, significantly reducing costs for longer, multi-turn conversations using Claude 3.5 Sonnet and Claude 3 Haiku models (Anthropic states that Opus will be added later).
The implementation is an adaptation of Anthropic's cookbook for Multi-turn Conversation with Incremental Caching
With this change comes much more accurate transaction recording for Anthropic API, as the stream usage metadata is being leveraged for provider counts of input tokens, cache input costs, and output tokens.
Furthermore, this PR also introduces a more accurate token counting method for message payload building, which is also possible through stream usage metadata, with some caveats highlighted below.
Stream usage should only be used for user message token count re-calculation if:
false
, with no attachments),promptPrefix
(custom instructions) is not set (also default behavior, unless the user is using presets).In these cases, the legacy token estimations would be more accurate. The reason is, instructions and files can be removed if these conditions are met, which would cause token counts to be inaccurate earlier in the thread.
Revisiting conversations before this change was also considered, and it was reasoned that this is fine since the stream usage metadata would make those conversations more accurate provided the above caveats are not met.
In any case, transaction data is more accurate than before for all anthropic models, and none of the token accounting affects prompt caching other than the models being used, and that the messages payload meets the requirements of Anthropic API.
TODO:
In light of the caveats, adding system messages to the
orderedMessages
token accounting could help us create accurate counts in these cases, potentially as a separate message in the UI. ChatGPT does this through "hidden" system messages, and we could add an option to toggle them as well as view their current placements.More details
addCacheControl
function to handle the addition of cache control to user messages.AnthropicClient
to support prompt caching for compatible models.addCacheControl
function to ensure proper functionality.Change Type
Checklist