Skip to content

[BUG] multimodal=True is not supported in Agent.kickoff #3936

@tris16are

Description

@tris16are

Description

Setting multimodal=True is supposedly adding the AddImageTool to the agent's tools, however it is not the case when calling the kickoff method.

The current implementation could be patched quite simply by adding the tool at kickoff time if the parameter is true (it's literaly a one liner).

Steps to Reproduce

See code snippet

Expected behavior

We should observe an AddImage tool call

Screenshots/Code snippets

from crewai import Agent

agent = Agent(role="Image captioner", goal="caption images", backstory="You are used to caption images since you are a kid", multimodal=True)

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Cat_November_2010-1a.jpg/960px-Cat_November_2010-1a.jpg"

result = agent.kickoff(f"What's in this image ? {image_url}")
print(result)

Operating System

Ubuntu 20.04

Python Version

3.10

crewAI Version

1.5.0

crewAI Tools Version

1.5.0

Virtual Environment

Venv

Evidence

Agent.kickoff code is quite explicit about the missing multimodal feature

Possible Solution

Patch the kickoff method:

from crewai.tools.agent_tools.add_image_tool import AddImageTool

def kickoff(...):
        if self.apps:
            platform_tools = self.get_platform_tools(self.apps)
            if platform_tools:
                self.tools.extend(platform_tools)
        if self.mcps:
            mcps = self.get_mcp_tools(self.mcps)
            if mcps:
                self.tools.extend(mcps)

        # PATCH HERE
        if self.multimodal:
               self.tools.extend(AddImageTool())
        # /PATCH HERE

        lite_agent = LiteAgent(
            id=self.id,
            role=self.role,
            goal=self.goal,
            backstory=self.backstory,
            llm=self.llm,
            tools=self.tools or [],
            max_iterations=self.max_iter,
            max_execution_time=self.max_execution_time,
            respect_context_window=self.respect_context_window,
            verbose=self.verbose,
            response_format=response_format,
            i18n=self.i18n,
            original_agent=self,
            guardrail=self.guardrail,
            guardrail_max_retries=self.guardrail_max_retries,
        )

        return lite_agent.kickoff(messages)

But tbh I don't really like doing this in the kickoff as it's not a pure method, if we call kickoff twice we'll have twice as much tools. It'd be better to just declare a method-scoped tools variable and feed it with the current tools, platform tools, mcp tools and multimodal tools

Additional context

/

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions