Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] feat: add multimodal support for ChatMessage #145

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

LastRemote
Copy link

Related Issues

Proposed Changes:

Added multimodal support according to deepset-ai/haystack#7848 (comment)
Adjusted openai/anthropic utils to allow converting media contents to their API format (I personally use httpx to send and parse the response. Please let me know if OpenAI/Anthropic SDK expects a different format).

How did you test it?

Added unit tests, and E2E tests with customized httpx-based AzureOpenAI/BedrockAnthropic generators.

Notes for the reviewer

  • I also added _name field back since it is useful in some multi-agent setups. Also finally I have a default value for it so no more headaches when serializing/deserializing this.
  • May I ask what is the intention behind keeping the underscore prefix when serializing and deserializing ChatMessage? Personally I find it very annoying.
  • There is a slight change of behavior in edge cases where a message may have an empty string as the text, or contain multiple text segments. I am trying to make the most sense out of it, but please give a comment if you would prefer the other way.

Checklist

@LastRemote LastRemote requested a review from a team as a code owner December 4, 2024 07:36
@LastRemote LastRemote requested review from mpangrazzi and removed request for a team December 4, 2024 07:36
@vblagoje
Copy link
Member

vblagoje commented Dec 5, 2024

@LastRemote first of all thanks for this effort - kudos. I see how you figured out the same directionI would chose for this endeavour. We need a new dataclass MediaContent to represent images and other modalities like audio in ChatMessageContentT!

Let us get back to you when @anakin87 comes from PTO so we can coordinate this effort. I think we need to break down this PR into several PRs and integration steps:

  1. Make changes to ByteStream to enable base64 (de)encoding
  2. Introduce MediaContent as a new dataclass that gets its content(de)encoded via new ByteStream
  3. Introduce MediaContent support in several ChatGenerators as separate PRs
  4. Make sure that the whole new design makes total sense
  5. Review each ChatGenerators PRs support via individual PR

Thoughts @mpangrazzi ?

@LastRemote
Copy link
Author

LastRemote commented Dec 5, 2024

@LastRemote first of all thanks for this effort - kudos. I see how you figured out the same directionI would chose for this endeavour. We need a new dataclass MediaContent to represent images and other modalities like audio in ChatMessageContentT!

Let us get back to you when @anakin87 comes from PTO so we can coordinate this effort. I think we need to break down this PR into several PRs and integration steps:

  1. Make changes to ByteStream to enable base64 (de)encoding
  2. Introduce MediaContent as a new dataclass that gets its content(de)encoded via new ByteStream
  3. Introduce MediaContent support in several ChatGenerators as separate PRs
  4. Make sure that the whole new design makes total sense
  5. Review each ChatGenerators PRs support via individual PR

Thoughts @mpangrazzi ?

Thanks, glad that it aligns with what your had in mind. Let's hear back from them first and we can break this down into smaller PRs afterwards.

@LastRemote LastRemote changed the title feat: add multimodal support for ChatMessage [DRAFT] feat: add multimodal support for ChatMessage Dec 5, 2024
@mpangrazzi
Copy link

@LastRemote Hi! Discussed with @vblagoje and I agree with him, PR is surely valid but a bit too convoluted. Breaking it down as he suggested would probably be the best for us!

@LastRemote
Copy link
Author

@vblagoje @mpangrazzi Thanks for the feedback, I will try to break this down into smaller PRs. Shall we close this one and use #135 as the megathread to check all the PRs?

@LastRemote
Copy link
Author

Btw here's the first one: #157

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multimodal support for the new ChatMessage class
3 participants