Skip to content

[FEATURE] Task-native & Agent-native multimodality #3860

@dardT

Description

@dardT

Feature Area

Core functionality

Is your feature request related to a an existing bug? Please link it here.

NA

Describe the solution you'd like

I wish there was a way to inject multipart context within Agents and Tasks (see additional context for further explanations).

Basically, I'd like my system prompt being composed of the basic role/goal/backstory but also having place for multipart content that would be transformed into media/text messages at execution time.

For example :

agent = Agent(role="Cartography analyst", goal="", backstory="", multipart_context=[Image(...), "other gibberish", Image()])

This way we would have a compound context for advanced use cases without sacrificing the ease of use for basic examples.

Similarly, some tasks are inherently multimodal in input or output: "analyse this image", "generate an image of...", "add a donkey on this image...".

It would be great to have direct multimodal injection in the task's context in a similar way:

task = Task(description="Analyze the following image", multipart_context=[Image(...)]

This infamous "Image()" object could be created either from a placeholder - so it can be interpolated at kickoff() time, or be created from raw data (URL, binary, b64string, etc.).

Did you have any previous thought about setting up multimodality within CrewAi ?

What's your opinion about such suggestion ?

Kind regards,

Tristan.

Describe alternatives you've considered

  • Direct injection of multimodal content in Agent.kickoff messages
  • Relying on the current AddImageTool feature

Additional context

Currently, CrewAI's multimodal capabilities are limited to tool-based interactions.

Agent

When the multimodal parameter is set to True on an Agent, it automatically configures tools like AddImageTool for handling non-text content.

However, this approach has significant limitations, as agent prompts are text-only: Agent system prompts are purely textual, created by interpolating the agent's goal, backstory, and other attributes into a prompt template. This prevents agents from having multimodal context built into their core instructions.

To bypass this, you must embed the multimodal content directly in the messages of the kickoff method of the Agent. It's not that good a practice as :

  1. It splits the system prompt in both the Agent's definition and the kickoff method - which would be rather used to handle user messages
  2. It does not work well in Crews as Agents are most useful to execute tasks, and tasks cannot be "serialized" as multipart messages

Task
Tasks are purely textual. Users cannot provide images or other media directly to tasks without embedding URLs in the description text. This is inefficient as :

  1. You rely on a tool call to get media information
  2. You have low control on what prompt is going to be sent to describe what you wanna do with the image.
  3. Task-chaining involving multimodal data : it would be great to pass the context of one multimodal task to the next task without relying on a tool call

Willingness to Contribute

Yes, I'd be happy to submit a pull request

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions