-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
Description
Setting multimodal=True is supposedly adding the AddImageTool to the agent's tools, however it is not the case when calling the kickoff method.
The current implementation could be patched quite simply by adding the tool at kickoff time if the parameter is true (it's literaly a one liner).
Steps to Reproduce
See code snippet
Expected behavior
We should observe an AddImage tool call
Screenshots/Code snippets
from crewai import Agent
agent = Agent(role="Image captioner", goal="caption images", backstory="You are used to caption images since you are a kid", multimodal=True)
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Cat_November_2010-1a.jpg/960px-Cat_November_2010-1a.jpg"
result = agent.kickoff(f"What's in this image ? {image_url}")
print(result)Operating System
Ubuntu 20.04
Python Version
3.10
crewAI Version
1.5.0
crewAI Tools Version
1.5.0
Virtual Environment
Venv
Evidence
Agent.kickoff code is quite explicit about the missing multimodal feature
Possible Solution
Patch the kickoff method:
from crewai.tools.agent_tools.add_image_tool import AddImageTool
def kickoff(...):
if self.apps:
platform_tools = self.get_platform_tools(self.apps)
if platform_tools:
self.tools.extend(platform_tools)
if self.mcps:
mcps = self.get_mcp_tools(self.mcps)
if mcps:
self.tools.extend(mcps)
# PATCH HERE
if self.multimodal:
self.tools.extend(AddImageTool())
# /PATCH HERE
lite_agent = LiteAgent(
id=self.id,
role=self.role,
goal=self.goal,
backstory=self.backstory,
llm=self.llm,
tools=self.tools or [],
max_iterations=self.max_iter,
max_execution_time=self.max_execution_time,
respect_context_window=self.respect_context_window,
verbose=self.verbose,
response_format=response_format,
i18n=self.i18n,
original_agent=self,
guardrail=self.guardrail,
guardrail_max_retries=self.guardrail_max_retries,
)
return lite_agent.kickoff(messages)But tbh I don't really like doing this in the kickoff as it's not a pure method, if we call kickoff twice we'll have twice as much tools. It'd be better to just declare a method-scoped tools variable and feed it with the current tools, platform tools, mcp tools and multimodal tools
Additional context
/