feat: Add Anthropic prompt caching support, add example #1006

vblagoje · 2024-08-19T11:10:23Z

Why:

Introduces prompt caching for AnthropicChatGenerator. As prompt caching will be enabled by default in the near future we don't add a new init parameter for it.

fixes Support prompt caching in Anthropic generators #1004

What:

Added a new feature for prompt caching: Enables caching of prompts for Anthropic LLMs to avoid repeated data fetches, reducing processing time and improving efficiency.
Implemented conditional usage of prompt caching based on configuration: Allows users to enable or disable prompt caching through a configuration flag, giving them control over this feature based on their requirements.
Refined message handling for system and chat interactions: Modifies how system and user chat messages are formatted and processed.
Updated project configurations: Allow print statements in examples

How can it be used:

cache_control ChatMessage meta field
In init of AnthropicChatGenerator we enable caching via extra_headers in generation_kwargs.

See integrations/anthropic/example/prompt_caching.py example for detailed usage

How did you test it:

Added unit tests
Using integrations/anthropic/example/prompt_caching.py example and additional manual tests.

vblagoje · 2024-08-19T14:10:52Z

cc @Emil-io give us some UX feedback 🙏

TuanaCelik · 2024-08-19T15:35:43Z

Hey @vblagoje - here is my feedback that you asked for:

Although I understand that you're saying that prompt caching will be enabled by default in near future, this suggests that users would also be able to turn it off if needed/wanted. So my intuition is that it would still be beneficial to add an init parameter for this, or something similar, which in future may indeed be used to disable
I don't quite see how the example is showcasing the benefit/need for prompt caching. Can you explain a bit further please? Would help me review.

vblagoje · 2024-08-19T16:19:26Z

Hey @vblagoje - here is my feedback that you asked for:

Although I understand that you're saying that prompt caching will be enabled by default in near future, this suggests that users would also be able to turn it off if needed/wanted. So my intuition is that it would still be beneficial to add an init parameter for this, or something similar, which in future may indeed be used to disable

I don't quite see how the example is showcasing the benefit/need for prompt caching. Can you explain a bit further please? Would help me review.

Thanks for feedback:

prompt caching is turned on at ChatMessage level, see how Anthropic examples add cache_control to message here there will be no need to set something on chat generator level
in the example, the benefit is that fetched doc is cached and used in subsequent inferencing with questions (everything is done on Anthropic side so it is not that visible)
- the only visible effect is the speed. Run the example and you'll notice how questions are visibly answered faster (2nd and subsequent questions if we add them).

Emil-io · 2024-08-20T10:27:41Z

Hey @vblagoje - first of all, this looks very interesting! Here my feedback, feel free to correct me if I made some false assumptions.

1. How this fits into Haystack Pipelines
I am trying to figure out the bigger picture and how this fits into Haystack. Prompt Caching works fine when the LLM is used as a stand alone component, but Haystack is not built for that. Moving to something with retrieval, I assume it makes sense to cache a long system prompt, since this is the only part that stays constant? But this would not perfectly align with the way the prompt builder is designed, as it also allows for dynamic changes in the jinja prompt template at any place. But maybe it would be nice to have an example in that way, as I assume most people would not use it as a stand alone component for haystack, but inside some more complex pipeline logic.

2. Specifying the Caching
To correctly specify the caching, the user definitely has to see this example (as this is also not explained in the documentation of the Anthropic Chat Component). Is it intended this way?

vblagoje · 2024-08-20T13:30:24Z

Thanks for feedback Emil.

Yes, it should work with pipelines, I discovered one bug/oversight in ChatPromptBuilder where we don't copy meta from messages - hence, right now, prompt caching won't work with ChatPromptBuilder. Having said that, as long as we set this cache_control somewhere, even in custom component just before the LLM - Anthropic caching should work in pipelines.
Everything revolves around cache_control meta field of the ChatMessage, prompt caching is rather unobtrusive and that's how the authors of these APIs intended them to be.

vblagoje · 2024-08-20T18:24:33Z

I've added data about prompt caching to be printed to stdout, confirming prompt caching

vblagoje · 2024-08-20T19:19:23Z

@julian-risch please run the example yourself to see the prompt caching effect.
See the discussions above between Tuana, Emil, and me regarding this particular solution.
I didn't add tests until we agree on this approach but it should be trivial to verify, the example prints proofs of caching
Note the discovered bug in the process of testing

vblagoje · 2024-08-29T08:24:07Z

@Emil-io have you tried the prompt caching example? @TuanaCelik can you take a look once again, run the example!

julian-risch · 2024-08-30T09:20:12Z

I tried it out and the caching works for me. Tried to measure the speed up but to no avail. Time to first token did not seem to improve for me when I turned caching off or on. Could you double check that? Would be important for a convincing example.

Other feedback: when I wanted to turn off caching, at first I only commented out only generation_kwargs={"extra_headers": {"anthropic-beta": "prompt-caching-2024-07-31"}} and forgot to comment out final_prompt_msg.meta["cache_control"] = {"type": "ephemeral"}. So I ran into an anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'system.0.cache_control: Extra inputs are not permitted'}}. For a better developer experience, we could check that the extra header is set if a message with cache_control is sent to the generator and explain to the user that the header needs to be set. Otherwise the functionality looks good to me. Main use case I see is to reduce TTFT when there is a long system message.

vblagoje · 2024-09-06T09:45:03Z

@Amnah199 please have a look and I'll ask @julian-risch to have one as well. Running the example is a must. The speedup with prompt caching is visible but I expected it to be more prominent. Another, perhaps equally important benefit is the cost saving with caching. In conclusion, it is still important to have this feature added as users will ask for it.

Amnah199 · 2024-09-16T20:35:44Z

@vblagoje, I tried the example, but the printed usage for all questions returned 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0. From the comments in example, I think it shouldn't be the case.
Additionally, response generation time did not significantly improve. Can we tweak the example to make the benefits of caching more obvious?

Emil-io · 2024-09-17T02:09:42Z

@Emil-io have you tried the prompt caching example? @TuanaCelik can you take a look once again, run the example!

HI @vblagoje
sorry, I somehow overlooked this. Lmk if I should still try it out and run the example.

vblagoje · 2024-09-17T06:05:32Z

Have you installed the branch version of Anthropic integration before running the example? And the latest release of haystack-ai?

vblagoje · 2024-09-17T07:18:32Z

@Amnah199 @Emil-io the example should be easier to follow now, please try again 🙏

Amnah199 · 2024-09-17T12:55:20Z

@vblagoje explained the example in more detail and I have tested it. I think this use of prompt caching would make sense in certain use cases. Tagging @julian-risch for reference.

vblagoje · 2024-09-17T13:41:36Z

@julian-risch let's integrate this, I can help @dfokina write a paragraph in AnthropicChatGenerator about it

julian-risch · 2024-09-17T14:11:10Z

@vblagoje ~~I am testing the example code right now. Still getting the "Cache not used" message with prompt_caching.py. It works with my own test code so could there be a small issue in prompt_caching.py?~~

vblagoje · 2024-09-18T12:01:38Z

For some reason Anthropic caching doesn't seem to work on small messages (i.e. a short instruction). Perhaps there is a minima length they require cached content to be. I could recreate prompt_caching example in integration test? cc @julian-risch this is to test when prompt caching gets disabled as beta - perhaps we'll get some warning but I doubt an exception. Perhaps we can monitor the Anthropic prompt caching devs and eventually when prompt caching becomes default - adjust our code base at that time....

julian-risch · 2024-09-18T13:53:05Z

@vblagoje Ah, true. Found the minimum cacheable length in their docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#cache-limitations In that case, let's use 1024 tokens in the integration test? $3.75 / MTok is the cost for cache write so it's still cheap.

vblagoje · 2024-09-18T14:16:50Z

Amazing, will do 🙏

vblagoje · 2024-09-19T07:40:54Z

@Amnah199 and @julian-risch - this one should be ready now, lmk if you see any additional opportunities for improvement

julian-risch

LGTM! 👍 Please don't forget to write a paragraph in AnthropicChatGenerator about it.

vblagoje · 2024-09-19T18:58:35Z

LGTM! 👍 Please don't forget to write a paragraph in AnthropicChatGenerator about it.

Will do 🙏 - keeping this one open until prompt caching docs is integrated and a new release made

vblagoje · 2024-09-20T07:48:58Z

Docs updated https://docs.haystack.deepset.ai/docs/anthropicchatgenerator
@dfokina please move around and adjust the prompt caching section in the docs as you see fit.

vblagoje · 2024-09-20T08:45:32Z

Prompt caching available in anthropic-haystack integration v1.1.0 onward.

Add prompt caching, add example

2db1616

vblagoje changed the title ~~feat: Add prompt caching, add example~~ feat: Add Anthropic prompt caching support, add example Aug 19, 2024

github-actions bot added the type:documentation Improvements or additions to documentation label Aug 19, 2024

vblagoje mentioned this pull request Aug 20, 2024

Copy ChatMessage meta fields in ChatPromptBuilder #1011

Closed

Print prompt caching data in example

37908ce

vblagoje added 2 commits August 20, 2024 20:27

Lint

f50b491

Anthropic allows multiple system messages, simplify

c8d93f1

vblagoje added 2 commits September 2, 2024 11:20

PR feedback

8fba8f1

Update prompt_caching.py example to use ChatPromptBuilder 2.5 fixes

3f9f0ae

vblagoje force-pushed the prompt_caching branch from 0c7a0a6 to 3f9f0ae Compare September 6, 2024 08:54

vblagoje added 3 commits September 6, 2024 11:00

Small fixes

57f2323

Merge branch 'main' into prompt_caching

0175ef2

Add unit tests

b798a7e

vblagoje marked this pull request as ready for review September 6, 2024 09:42

vblagoje requested a review from a team as a code owner September 6, 2024 09:42

vblagoje requested review from Amnah199 and removed request for a team September 6, 2024 09:42

vblagoje requested a review from julian-risch September 6, 2024 09:45

Improve UX for prompt caching example

7c5e16b

vblagoje added 2 commits September 17, 2024 16:47

Merge branch 'main' into prompt_caching

68d410a

Add unit test for _convert_to_anthropic_format

befdbeb

More integration tests

c39a9f2

Update test to turn on/off prompt cache

9ca99f5

julian-risch approved these changes Sep 19, 2024

View reviewed changes

vblagoje merged commit 36f16c1 into main Sep 20, 2024
11 checks passed

vblagoje deleted the prompt_caching branch September 20, 2024 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Anthropic prompt caching support, add example #1006

feat: Add Anthropic prompt caching support, add example #1006

vblagoje commented Aug 19, 2024 •

edited

Loading

vblagoje commented Aug 19, 2024 •

edited

Loading

TuanaCelik commented Aug 19, 2024

vblagoje commented Aug 19, 2024

Emil-io commented Aug 20, 2024

vblagoje commented Aug 20, 2024

vblagoje commented Aug 20, 2024

vblagoje commented Aug 20, 2024

vblagoje commented Aug 29, 2024

julian-risch commented Aug 30, 2024

vblagoje commented Sep 6, 2024 •

edited

Loading

Amnah199 commented Sep 16, 2024

Emil-io commented Sep 17, 2024

vblagoje commented Sep 17, 2024 •

edited

Loading

vblagoje commented Sep 17, 2024

Amnah199 commented Sep 17, 2024

vblagoje commented Sep 17, 2024

julian-risch commented Sep 17, 2024 •

edited

Loading

vblagoje commented Sep 18, 2024

julian-risch commented Sep 18, 2024

vblagoje commented Sep 18, 2024

vblagoje commented Sep 19, 2024

julian-risch left a comment

vblagoje commented Sep 19, 2024

vblagoje commented Sep 20, 2024 •

edited

Loading

vblagoje commented Sep 20, 2024

feat: Add Anthropic prompt caching support, add example #1006

feat: Add Anthropic prompt caching support, add example #1006

Conversation

vblagoje commented Aug 19, 2024 • edited Loading

Why:

What:

How can it be used:

How did you test it:

vblagoje commented Aug 19, 2024 • edited Loading

TuanaCelik commented Aug 19, 2024

vblagoje commented Aug 19, 2024

Emil-io commented Aug 20, 2024

vblagoje commented Aug 20, 2024

vblagoje commented Aug 20, 2024

vblagoje commented Aug 20, 2024

vblagoje commented Aug 29, 2024

julian-risch commented Aug 30, 2024

vblagoje commented Sep 6, 2024 • edited Loading

Amnah199 commented Sep 16, 2024

Emil-io commented Sep 17, 2024

vblagoje commented Sep 17, 2024 • edited Loading

vblagoje commented Sep 17, 2024

Amnah199 commented Sep 17, 2024

vblagoje commented Sep 17, 2024

julian-risch commented Sep 17, 2024 • edited Loading

vblagoje commented Sep 18, 2024

julian-risch commented Sep 18, 2024

vblagoje commented Sep 18, 2024

vblagoje commented Sep 19, 2024

julian-risch left a comment

Choose a reason for hiding this comment

vblagoje commented Sep 19, 2024

vblagoje commented Sep 20, 2024 • edited Loading

vblagoje commented Sep 20, 2024

vblagoje commented Aug 19, 2024 •

edited

Loading

vblagoje commented Aug 19, 2024 •

edited

Loading

vblagoje commented Sep 6, 2024 •

edited

Loading

vblagoje commented Sep 17, 2024 •

edited

Loading

julian-risch commented Sep 17, 2024 •

edited

Loading

vblagoje commented Sep 20, 2024 •

edited

Loading