💾 feat: Anthropic Prompt Caching #3670

danny-avila · 2024-08-17T07:00:17Z

Summary

This PR introduces prompt caching functionality for the Anthropic API, significantly reducing costs for longer, multi-turn conversations using Claude 3.5 Sonnet and Claude 3 Haiku models (Anthropic states that Opus will be added later).

The implementation is an adaptation of Anthropic's cookbook for Multi-turn Conversation with Incremental Caching

With this change comes much more accurate transaction recording for Anthropic API, as the stream usage metadata is being leveraged for provider counts of input tokens, cache input costs, and output tokens.

Furthermore, this PR also introduces a more accurate token counting method for message payload building, which is also possible through stream usage metadata, with some caveats highlighted below.

Stream usage should only be used for user message token count re-calculation if:

The stream usage is available, with input tokens greater than 0,
the current provider client (anthropic only atm) provides a function to calculate the current token count,
files are being resent with every message (default behavior; or if false, with no attachments),
the promptPrefix (custom instructions) is not set (also default behavior, unless the user is using presets).

In these cases, the legacy token estimations would be more accurate. The reason is, instructions and files can be removed if these conditions are met, which would cause token counts to be inaccurate earlier in the thread.

Revisiting conversations before this change was also considered, and it was reasoned that this is fine since the stream usage metadata would make those conversations more accurate provided the above caveats are not met.

In any case, transaction data is more accurate than before for all anthropic models, and none of the token accounting affects prompt caching other than the models being used, and that the messages payload meets the requirements of Anthropic API.

TODO:

In light of the caveats, adding system messages to the orderedMessages token accounting could help us create accurate counts in these cases, potentially as a separate message in the UI. ChatGPT does this through "hidden" system messages, and we could add an option to toggle them as well as view their current placements.

More details

Added a new addCacheControl function to handle the addition of cache control to user messages.
Updated the AnthropicClient to support prompt caching for compatible models.
Implemented logic to determine if a model supports cache control and apply it when appropriate.
Added new types and updated existing ones to support structured token transactions.
Modified the token spending logic to account for cache write and read operations.
Created unit tests for the addCacheControl function to ensure proper functionality.
Updated transaction handling to support structured token usage recording.
Adds logic for calculating the correct user message token count off of stream usage metadata, allowing us to discard estimations

Change Type

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Checklist

My code adheres to this project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

…s handling

…or different valueKeys and cacheTypes

…ct" as we have the correct amount

…nstructions or files are not being resent with every request

…o be as accurate as possible even if we shouldn't update user message token counts

* wip: initial cache control implementation, add typing for transactions handling * feat: first pass of Anthropic Prompt Caching * feat: standardize stream usage as pass in when calculating token counts * feat: Add getCacheMultiplier function to calculate cache multiplier for different valueKeys and cacheTypes * chore: imports order * refactor: token usage recording in AnthropicClient, no need to "correct" as we have the correct amount * feat: more accurate token counting using stream usage data * feat: Improve token counting accuracy with stream usage data * refactor: ensure more accurate than not token estimations if custom instructions or files are not being resent with every request * refactor: cleanup updateUserMessageTokenCount to allow transactions to be as accurate as possible even if we shouldn't update user message token counts * ci: fix tests

DoS007 · 2024-08-26T15:10:08Z

To understand: This is always enabled per default and there is no UI-Switch?

danny-avila · 2024-08-26T15:19:57Z

To understand: This is always enabled per default and there is no UI-Switch?

Adding a UI switch as we speak

DoS007 · 2024-08-26T17:26:21Z

@danny-avila Wow, you're fast. Was a question, I don't know if really needed, discussion.

Discussion : If I understand correctly, there is according to this Page a minimum of 1024 token for caching to work. Also according to that page cache writes cost 3,75 and reads 0,30 compared to 3 of normal input. So worst case would be 4,05 compared to 3. So it should always be good if at least one Cache usage is done. When the chat has already 1024 tokens, it is probable that there alrwady was multi turn converasation. This makes it probable for further conversation. There is only one exception: the chat is only used from time to time (>5 min cache time). But even then multi turn is towards probable. So statstically here after one cache usage it should already pay out. And would be faster too (cache makes response faster).

I don't know about images in cache though.

The crucial question is : Is the assumption that in a e.g. 4 minute chat, every chat part apart from the last response is only written one time to cache (only one time cache write price for each text part)? So can a cache be "extended" effectivly or how does this work for a chat?

If this is the case and images are no problem and the implementation does work without problems i would lean regarding there is no ui-switch needed. But there probably should be an info somewhere (which the ui-switch , per default on, would effectivly be, so it would be a good solution even for that, if that is not to much maintaining work).

Tl,Dr : If not much maintaining work, ui-switch is good, one way or the other. What's important, is the question above, because it decides if cache use is recommandable per default.
_
Other thing about caching functionallity:
What would be very nice, would be an option to extend caching time "Keep cache alive for further x times" or "keep cache alive longer" and then a choice for 10, 15, ... minutes (cache read 15 seconds or so before cache time end). E.g. i might use e.g. 1 or 2 Times, last one would be 10 minutes( +basic 5 minutes; total 15 minutes). Because caching read costs 1/10 of normal input, this is a good heuristic.

danny-avila · 2024-08-26T17:39:00Z

But even then multi turn is towards probable. So statstically here after one cache usage it should already pay out. And would be faster too (cache makes response faster).

Yes I agree, that's why I decided to make it true as default.

I don't know about images in cache though.

Images are not cached in my implementation. This is due to the limit of 5 cache "blocks", and each image costs one block. It's difficult to wrap my head around, and maintaining image caches adds complexity.

But I'm also not sure, I wonder if they do get cached once they are no longer the most recent user messages (the current user message and the one prior, which act as cache "stamps").

ui-switch is good

it's nice to have, for that reason i will include it, it's a quick implementation.

What would be very nice, would be an option to extend caching time "Keep cache alive for further x times" or "keep cache alive longer" and then a choice for 10, 15, ... minutes (cache read 15 seconds or so before cache time end). E.g. i might use e.g. 1 or 2 Times, last one would be 10 minutes( +basic 5 minutes; total 15 minutes). Because caching read costs 1/10 of normal input, this is a good heuristic.

Can't do this, it's an API thing on their end. From my understanding, it seems to be kept active as long you use the same cache in 5 minute intervals. Something programmatic to keep the cache "warm" is a bit complex and costly, defeats the purpose (it's only kept warm if you keep the turns going).

Honestly I speak with some confidence here but needs some rigorous testing. The debug logs offer information to this end, but have not had time to evaluate. From my own experience, and judging from the transaction records, it is saving me a lot of money on those longer context conversations.

DoS007 · 2024-08-27T15:11:19Z

Caching mechanism and implementation

Images are not cached in my implementation. This is due to the limit of 5 cache "blocks", and each image costs one block. It's difficult to wrap my head around, and maintaining image caches adds complexity.

I'm currently using Claude Pro and 5 images (and in general attachments) are also the maximium to use there. (After reading the message again: 4 cache blocks are the limit, so that has nothing to do with it).

But I'm also not sure, I wonder if they do get cached once they are no longer the most recent user messages (the current user message and the one prior, which act as cache "stamps").

"Prompt Caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control."
In their chat example in their doc they use an additional cache-mark for the system, which, according to the sentence above, isn't needed for just one example conversation they show. This is because the caching in the messages would also implicitly cache the system prompt according to that sentence. I understand it that way, that it's just useful for new chats with the same system prompt, that's why the used it here (i think they tried to oultine that with "... long system prompt").

According to that understanding of the sentence above i find the sentence
"Each of these elements [Tools/system messages/messages/images/tool use and tool results] can be marked with cache_control to enable caching for that portion of the request."
a bit missleading.

"Note that changes to tool_choice or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created."
This can be also a bit missleading, because they just mean image changes and not that presence itself invalidates the cache.

Tl;Dr.: To my understanding, marking the new user prompt (for new cache write) and the last user prompt (for cache read), and maybe as a bonus, the system prompt if someone uses a large one, is completley sufficient (and will cache images also).

"Images are not cached in my implementation"

So they probably are cached already :).

Logging cache activity

Honestly I speak with some confidence here but needs some rigorous testing. The debug logs offer information to this end, but have not had time to evaluate. From my own experience, and judging from the transaction records, it is saving me a lot of money on those longer context conversations.

Do you log cache_read_input_tokens and cache_creation_input_tokens?

Holding Cache alive

What would be very nice, would be an option to extend caching time "Keep cache alive for further x times" or "keep cache alive longer" and then a choice for 10, 15, ... minutes (cache read 15 seconds or so before cache time end). E.g. i might use e.g. 1 or 2 Times, last one would be 10 minutes( +basic 5 minutes; total 15 minutes). Because caching read costs 1/10 of normal input, this is a good heuristic.

Can't do this, it's an API thing on their end. From my understanding, it seems to be kept active as long you use the same cache in 5 minute intervals. Something programmatic to keep the cache "warm" is a bit complex and costly, defeats the purpose (it's only kept warm if you keep the turns going).

Keeping the cache "warm" is exactly what i meant in the other response. It would work by sending exactly the same last request again but with "max_tokens": 1 (e.g. 4min 48 sec after idle from last request). And the rationale is similar to the rationale why prompt caching use is recomandable: Because one cache read costs very less than one cache write (only 8% read cost compared wo cache write). So when someone gets an input, thinks about it, searches something and returns after 7 min, the cache is still warm and cheap. So even if this only happens 1/10 of all cases, it's still cheaper. (How many times someone will want to further keep the cache live is indiviually, so that's why i spoke about an option in settings to set for how many times the cache should be hold alive).

Edit: When the user already sent a new message, of course no request on 4min48sec for the old one is needed as the new sent prompt already keeps the cache alive.

* wip: initial cache control implementation, add typing for transactions handling * feat: first pass of Anthropic Prompt Caching * feat: standardize stream usage as pass in when calculating token counts * feat: Add getCacheMultiplier function to calculate cache multiplier for different valueKeys and cacheTypes * chore: imports order * refactor: token usage recording in AnthropicClient, no need to "correct" as we have the correct amount * feat: more accurate token counting using stream usage data * feat: Improve token counting accuracy with stream usage data * refactor: ensure more accurate than not token estimations if custom instructions or files are not being resent with every request * refactor: cleanup updateUserMessageTokenCount to allow transactions to be as accurate as possible even if we shouldn't update user message token counts * ci: fix tests

danny-avila added 11 commits August 16, 2024 17:34

wip: initial cache control implementation, add typing for transaction…

5ab8d97

…s handling

feat: first pass of Anthropic Prompt Caching

1a32261

feat: standardize stream usage as pass in when calculating token counts

58125f2

feat: Add getCacheMultiplier function to calculate cache multiplier f…

6e8fb19

…or different valueKeys and cacheTypes

chore: imports order

75485a2

refactor: token usage recording in AnthropicClient, no need to "corre…

7762609

…ct" as we have the correct amount

feat: more accurate token counting using stream usage data

7f2851d

feat: Improve token counting accuracy with stream usage data

8842e3e

refactor: ensure more accurate than not token estimations if custom i…

6773af7

…nstructions or files are not being resent with every request

refactor: cleanup updateUserMessageTokenCount to allow transactions t…

fdfc0a2

…o be as accurate as possible even if we shouldn't update user message token counts

ci: fix tests

c240980

danny-avila merged commit a45b384 into main Aug 17, 2024
2 checks passed

danny-avila deleted the feat/anthropic-cache-control branch August 17, 2024 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💾 feat: Anthropic Prompt Caching #3670

💾 feat: Anthropic Prompt Caching #3670

danny-avila commented Aug 17, 2024 •

edited

Loading

DoS007 commented Aug 26, 2024

danny-avila commented Aug 26, 2024

DoS007 commented Aug 26, 2024

danny-avila commented Aug 26, 2024

DoS007 commented Aug 27, 2024 •

edited

Loading

💾 feat: Anthropic Prompt Caching #3670

💾 feat: Anthropic Prompt Caching #3670

Conversation

danny-avila commented Aug 17, 2024 • edited Loading

Summary

More details

Change Type

Checklist

DoS007 commented Aug 26, 2024

danny-avila commented Aug 26, 2024

DoS007 commented Aug 26, 2024

danny-avila commented Aug 26, 2024

DoS007 commented Aug 27, 2024 • edited Loading

Caching mechanism and implementation

Logging cache activity

Holding Cache alive

danny-avila commented Aug 17, 2024 •

edited

Loading

DoS007 commented Aug 27, 2024 •

edited

Loading