-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
Feature Area
Core functionality
Is your feature request related to a an existing bug? Please link it here.
NA
Describe the solution you'd like
I wish there was a way to inject multipart context within Agents and Tasks (see additional context for further explanations).
Basically, I'd like my system prompt being composed of the basic role/goal/backstory but also having place for multipart content that would be transformed into media/text messages at execution time.
For example :
agent = Agent(role="Cartography analyst", goal="", backstory="", multipart_context=[Image(...), "other gibberish", Image()])This way we would have a compound context for advanced use cases without sacrificing the ease of use for basic examples.
Similarly, some tasks are inherently multimodal in input or output: "analyse this image", "generate an image of...", "add a donkey on this image...".
It would be great to have direct multimodal injection in the task's context in a similar way:
task = Task(description="Analyze the following image", multipart_context=[Image(...)]This infamous "Image()" object could be created either from a placeholder - so it can be interpolated at kickoff() time, or be created from raw data (URL, binary, b64string, etc.).
Did you have any previous thought about setting up multimodality within CrewAi ?
What's your opinion about such suggestion ?
Kind regards,
Tristan.
Describe alternatives you've considered
- Direct injection of multimodal content in
Agent.kickoffmessages - Relying on the current
AddImageToolfeature
Additional context
Currently, CrewAI's multimodal capabilities are limited to tool-based interactions.
Agent
When the multimodal parameter is set to True on an Agent, it automatically configures tools like AddImageTool for handling non-text content.
However, this approach has significant limitations, as agent prompts are text-only: Agent system prompts are purely textual, created by interpolating the agent's goal, backstory, and other attributes into a prompt template. This prevents agents from having multimodal context built into their core instructions.
To bypass this, you must embed the multimodal content directly in the messages of the kickoff method of the Agent. It's not that good a practice as :
- It splits the system prompt in both the Agent's definition and the kickoff method - which would be rather used to handle user messages
- It does not work well in Crews as Agents are most useful to execute tasks, and tasks cannot be "serialized" as multipart messages
Task
Tasks are purely textual. Users cannot provide images or other media directly to tasks without embedding URLs in the description text. This is inefficient as :
- You rely on a tool call to get media information
- You have low control on what prompt is going to be sent to describe what you wanna do with the image.
- Task-chaining involving multimodal data : it would be great to pass the context of one multimodal task to the next task without relying on a tool call
Willingness to Contribute
Yes, I'd be happy to submit a pull request