Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Anthropic prompt caching support, add example #1006

Merged
merged 14 commits into from
Sep 20, 2024
Merged

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Aug 19, 2024

Why:

Introduces prompt caching for AnthropicChatGenerator. As prompt caching will be enabled by default in the near future we don't add a new init parameter for it.

What:

  • Added a new feature for prompt caching: Enables caching of prompts for Anthropic LLMs to avoid repeated data fetches, reducing processing time and improving efficiency.
  • Implemented conditional usage of prompt caching based on configuration: Allows users to enable or disable prompt caching through a configuration flag, giving them control over this feature based on their requirements.
  • Refined message handling for system and chat interactions: Modifies how system and user chat messages are formatted and processed.
  • Updated project configurations: Allow print statements in examples

How can it be used:

  • cache_control ChatMessage meta field
  • In init of AnthropicChatGenerator we enable caching via extra_headers in generation_kwargs.

See integrations/anthropic/example/prompt_caching.py example for detailed usage

How did you test it:

  • Added unit tests
  • Using integrations/anthropic/example/prompt_caching.py example and additional manual tests.

@vblagoje vblagoje changed the title feat: Add prompt caching, add example feat: Add Anthropic prompt caching support, add example Aug 19, 2024
@github-actions github-actions bot added the type:documentation Improvements or additions to documentation label Aug 19, 2024
@vblagoje
Copy link
Member Author

vblagoje commented Aug 19, 2024

cc @Emil-io give us some UX feedback 🙏

@TuanaCelik
Copy link
Member

Hey @vblagoje - here is my feedback that you asked for:

  • Although I understand that you're saying that prompt caching will be enabled by default in near future, this suggests that users would also be able to turn it off if needed/wanted. So my intuition is that it would still be beneficial to add an init parameter for this, or something similar, which in future may indeed be used to disable
  • I don't quite see how the example is showcasing the benefit/need for prompt caching. Can you explain a bit further please? Would help me review.

@vblagoje
Copy link
Member Author

Hey @vblagoje - here is my feedback that you asked for:

  • Although I understand that you're saying that prompt caching will be enabled by default in near future, this suggests that users would also be able to turn it off if needed/wanted. So my intuition is that it would still be beneficial to add an init parameter for this, or something similar, which in future may indeed be used to disable
  • I don't quite see how the example is showcasing the benefit/need for prompt caching. Can you explain a bit further please? Would help me review.

Thanks for feedback:

  • prompt caching is turned on at ChatMessage level, see how Anthropic examples add cache_control to message here there will be no need to set something on chat generator level
  • in the example, the benefit is that fetched doc is cached and used in subsequent inferencing with questions (everything is done on Anthropic side so it is not that visible)
    • the only visible effect is the speed. Run the example and you'll notice how questions are visibly answered faster (2nd and subsequent questions if we add them).

@Emil-io
Copy link

Emil-io commented Aug 20, 2024

Hey @vblagoje - first of all, this looks very interesting! Here my feedback, feel free to correct me if I made some false assumptions.

1. How this fits into Haystack Pipelines
I am trying to figure out the bigger picture and how this fits into Haystack. Prompt Caching works fine when the LLM is used as a stand alone component, but Haystack is not built for that. Moving to something with retrieval, I assume it makes sense to cache a long system prompt, since this is the only part that stays constant? But this would not perfectly align with the way the prompt builder is designed, as it also allows for dynamic changes in the jinja prompt template at any place. But maybe it would be nice to have an example in that way, as I assume most people would not use it as a stand alone component for haystack, but inside some more complex pipeline logic.

2. Specifying the Caching
To correctly specify the caching, the user definitely has to see this example (as this is also not explained in the documentation of the Anthropic Chat Component). Is it intended this way?

@vblagoje
Copy link
Member Author

Thanks for feedback Emil.

  1. Yes, it should work with pipelines, I discovered one bug/oversight in ChatPromptBuilder where we don't copy meta from messages - hence, right now, prompt caching won't work with ChatPromptBuilder. Having said that, as long as we set this cache_control somewhere, even in custom component just before the LLM - Anthropic caching should work in pipelines.

  2. Everything revolves around cache_control meta field of the ChatMessage, prompt caching is rather unobtrusive and that's how the authors of these APIs intended them to be.

@vblagoje
Copy link
Member Author

I've added data about prompt caching to be printed to stdout, confirming prompt caching

@vblagoje
Copy link
Member Author

@julian-risch please run the example yourself to see the prompt caching effect.
See the discussions above between Tuana, Emil, and me regarding this particular solution.
I didn't add tests until we agree on this approach but it should be trivial to verify, the example prints proofs of caching
Note the discovered bug in the process of testing

@vblagoje
Copy link
Member Author

@Emil-io have you tried the prompt caching example? @TuanaCelik can you take a look once again, run the example!

@julian-risch
Copy link
Member

I tried it out and the caching works for me. Tried to measure the speed up but to no avail. Time to first token did not seem to improve for me when I turned caching off or on. Could you double check that? Would be important for a convincing example.

Other feedback: when I wanted to turn off caching, at first I only commented out only generation_kwargs={"extra_headers": {"anthropic-beta": "prompt-caching-2024-07-31"}} and forgot to comment out final_prompt_msg.meta["cache_control"] = {"type": "ephemeral"}. So I ran into an anthropic.BadRequestError: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'system.0.cache_control: Extra inputs are not permitted'}}. For a better developer experience, we could check that the extra header is set if a message with cache_control is sent to the generator and explain to the user that the header needs to be set. Otherwise the functionality looks good to me. Main use case I see is to reduce TTFT when there is a long system message.

@vblagoje vblagoje marked this pull request as ready for review September 6, 2024 09:42
@vblagoje vblagoje requested a review from a team as a code owner September 6, 2024 09:42
@vblagoje vblagoje requested review from Amnah199 and removed request for a team September 6, 2024 09:42
@vblagoje
Copy link
Member Author

vblagoje commented Sep 6, 2024

@Amnah199 please have a look and I'll ask @julian-risch to have one as well. Running the example is a must. The speedup with prompt caching is visible but I expected it to be more prominent. Another, perhaps equally important benefit is the cost saving with caching. In conclusion, it is still important to have this feature added as users will ask for it.

@Amnah199
Copy link
Contributor

@vblagoje, I tried the example, but the printed usage for all questions returned 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0. From the comments in example, I think it shouldn't be the case.
Additionally, response generation time did not significantly improve. Can we tweak the example to make the benefits of caching more obvious?

@Emil-io
Copy link

Emil-io commented Sep 17, 2024

@Emil-io have you tried the prompt caching example? @TuanaCelik can you take a look once again, run the example!

HI @vblagoje
sorry, I somehow overlooked this. Lmk if I should still try it out and run the example.

@vblagoje
Copy link
Member Author

vblagoje commented Sep 17, 2024

Have you installed the branch version of Anthropic integration before running the example? And the latest release of haystack-ai?

@vblagoje
Copy link
Member Author

@Amnah199 @Emil-io the example should be easier to follow now, please try again 🙏

@Amnah199
Copy link
Contributor

@vblagoje explained the example in more detail and I have tested it. I think this use of prompt caching would make sense in certain use cases. Tagging @julian-risch for reference.

@vblagoje
Copy link
Member Author

@julian-risch let's integrate this, I can help @dfokina write a paragraph in AnthropicChatGenerator about it

@julian-risch
Copy link
Member

julian-risch commented Sep 17, 2024

@vblagoje I am testing the example code right now. Still getting the "Cache not used" message with prompt_caching.py. It works with my own test code so could there be a small issue in prompt_caching.py?

@vblagoje
Copy link
Member Author

For some reason Anthropic caching doesn't seem to work on small messages (i.e. a short instruction). Perhaps there is a minima length they require cached content to be. I could recreate prompt_caching example in integration test? cc @julian-risch this is to test when prompt caching gets disabled as beta - perhaps we'll get some warning but I doubt an exception. Perhaps we can monitor the Anthropic prompt caching devs and eventually when prompt caching becomes default - adjust our code base at that time....

@julian-risch
Copy link
Member

@vblagoje Ah, true. Found the minimum cacheable length in their docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#cache-limitations In that case, let's use 1024 tokens in the integration test? $3.75 / MTok is the cost for cache write so it's still cheap.

@vblagoje
Copy link
Member Author

Amazing, will do 🙏

@vblagoje
Copy link
Member Author

@Amnah199 and @julian-risch - this one should be ready now, lmk if you see any additional opportunities for improvement

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 Please don't forget to write a paragraph in AnthropicChatGenerator about it.

@vblagoje
Copy link
Member Author

LGTM! 👍 Please don't forget to write a paragraph in AnthropicChatGenerator about it.

Will do 🙏 - keeping this one open until prompt caching docs is integrated and a new release made

@vblagoje
Copy link
Member Author

vblagoje commented Sep 20, 2024

Docs updated https://docs.haystack.deepset.ai/docs/anthropicchatgenerator
@dfokina please move around and adjust the prompt caching section in the docs as you see fit.

@vblagoje vblagoje merged commit 36f16c1 into main Sep 20, 2024
11 checks passed
@vblagoje vblagoje deleted the prompt_caching branch September 20, 2024 07:49
@vblagoje
Copy link
Member Author

Prompt caching available in anthropic-haystack integration v1.1.0 onward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support prompt caching in Anthropic generators
5 participants