-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable prompt cache for anthropic #631
Conversation
Desktop App for this PRThe following build is available for testing: The app is signed and notarized for macOS. After downloading, unzip the file and drag the Goose.app to your Applications folder. This link is provided by nightly.link and will work even if you're not logged into GitHub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do a quick back of the envelope estimation of cost and savings for turning on prompt caching based on the Anthropic pricing?
Sure, add one screenshot from anthropic post, also add my estimations |
Awesome, looks like we can expect some big savings here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really cool @yingjiehe-xyz and I think people will appreciate this a lot.
I wonder if similar exists for openrouter (as people like to use anthropic that way) but yes! very nice!
Yes, it is available in openrouter: https://openrouter.ai/docs/prompt-caching, I am planing on this for the next step |
Enable prompt cache for Anthropic following https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#how-prompt-caching-works.
Generally, we add
"cache_control": {"type": "ephemeral"}
into tool, system and message sections. The cache hit can be verified with the usage response, likeAnd currently, “ephemeral” is the only supported cache type, which corresponds to this 5-minute lifetime.
Cost saving: From the pricing,
Assume we have
N
turns conversion,S
denotes to system prompt length and the average new tokens(user inputs + outputs) length for each turn isM
Before cache: our estimated cost is around
S + (S + M) + (S + M * 2) + ... + (S + M * (N - 1)) = S * N + (N - 1) * N / 2 * M
After cache: our estimated cost is around
(S + M * (N - 1)) * 1.25 + (S + (S + M) + ... + (S + M * (N - 1))) * 0.1 = (S + M * (N - 1)) * 1.25 + (S * N + (N - 1) * N / 2 * M) * 0.1
To compare result 1 and result 2, we need to compare
(S + M * (N - 1)) * 1.25
and(S * N + (N - 1) * N / 2 * M) * 0.9
, if(S + M * (N - 1)) * 1.25
is greater, then cache is cost more and vice visa. Normally, our S is greater than 1000 and M is large as well,(S + M * (N - 1)) * 1.25
should be much smaller which means cache reduces our cost.And some estimation from anthropic post:
Test with
just run-ui
and response verification: