[FEATURE] Task-native & Agent-native multimodality

### Feature Area

Core functionality

### Is your feature request related to a an existing bug? Please link it here.

NA

### Describe the solution you'd like

I wish there was a way to inject multipart context within Agents and Tasks (see additional context for further explanations).

Basically, I'd like my system prompt being composed of the basic role/goal/backstory but also having place for multipart content that would be transformed into media/text messages at execution time.

For example : 

```python
agent = Agent(role="Cartography analyst", goal="", backstory="", multipart_context=[Image(...), "other gibberish", Image()])
```

This way we would have a compound context for advanced use cases without sacrificing the ease of use for basic examples.

Similarly, some tasks are inherently multimodal in input or output: "analyse this image",  "generate an image of...", "add a donkey on this image...".

It would be great to have direct multimodal injection in the task's context in a similar way:

```python
task = Task(description="Analyze the following image", multipart_context=[Image(...)]
```

This infamous "Image()" object could be created either from a placeholder - so it can be interpolated at kickoff() time, or be created from raw data (URL, binary, b64string, etc.).


Did you have any previous thought about setting up multimodality within CrewAi ?

What's your opinion about such suggestion ? 

Kind regards,

Tristan.

### Describe alternatives you've considered

 - Direct injection of multimodal content in `Agent.kickoff` messages
- Relying on the current `AddImageTool` feature

### Additional context

Currently, CrewAI's multimodal capabilities are limited to tool-based interactions. 

**Agent**

When the multimodal parameter is set to True on an Agent, it automatically configures tools like AddImageTool for handling non-text content. 

However, this approach has significant limitations, as agent prompts are text-only: Agent system prompts are purely textual, created by interpolating the agent's goal, backstory, and other attributes into a prompt template. This prevents agents from having multimodal context built into their core instructions.

To bypass this, you must embed the multimodal content directly in the messages of the kickoff method of the Agent. It's not that good a practice as : 
1. It splits the system prompt in both the Agent's definition and the kickoff method - which would be rather used to handle user messages
2. It does not work well in Crews as Agents are most useful to execute tasks, and tasks cannot be "serialized" as multipart messages


**Task**
Tasks are purely textual. Users cannot provide images or other media directly to tasks without embedding URLs in the description text. This is inefficient as : 
1. You rely on a tool call to get media information
2. You have low control on what prompt is going to be sent to describe what you wanna do with the image.
3. Task-chaining involving multimodal data : it would be great to pass the context of one multimodal task to the next task without relying on a tool call

### Willingness to Contribute

Yes, I'd be happy to submit a pull request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Task-native & Agent-native multimodality #3860

Feature Area

Is your feature request related to a an existing bug? Please link it here.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Willingness to Contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Task-native & Agent-native multimodality #3860

Description

Feature Area

Is your feature request related to a an existing bug? Please link it here.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Willingness to Contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions