feat: Add multimodal (audio, image and video) analysis toolkits #1496

harryeqs · 2025-01-23T22:52:35Z

Description

Describe your changes in detail.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #15213 if this solves the issue #15213

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of example)

Implemented Tasks

Subtask 1
Subtask 2
Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide. (required)
My change requires a change to the documentation.
I have updated the tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

@harryeqs

@harryeqs

Aaron617 · 2025-01-25T06:36:21Z

video_toolkit_test.py

+    with exactly the name of the species without any additional text."
+
+    answer = test_toolkit.ask_question_about_video(video_path, question)
+    print(f"\nThe final answer is: {answer}")


will the current toolkit pass these unit tests

Hi Mengkang, I ran the test a few more times and discovered that although it could get the correct number for the first question, it is not returning in the right format hence the answer is likely a hallucination. I am working on fixing this problem and will finish today. Thanks!

Fixed! Changed the vl model to Qwen VL Max again and now it can answer all three questions correctly. However, a major problem is that Qwen VL Max could only support passing 28 images simultaneously (from my experiment), which may cause significant information loss for long videos. I am adding a key_frame_extraction method to try to cope with this problem.

…/camel into feat/multimodal_toolkit

Wendong-Fan

Thanks @harryeqs , some test and pre-commit check seems not passed, could you fix this? thanks!

harryeqs · 2025-01-26T20:52:39Z

Thanks @harryeqs , some test and pre-commit check seems not passed, could you fix this? thanks!

Thanks @Wendong-Fan ! Sorry I am busy working on some feature on the Eigentbot at the moment, will fix the test and add unit tests as soon as possible.

Aaron617 and others added 7 commits January 8, 2025 23:20

update multimodal_toolkit

6cb1002

Create image_toolkit.py

92df377

update unit test for video toolkit

34d22c2

@harryeqs

merge video toolkits and temporary mypy fix

02ad436

add visual and audio transcription pipeline

6b752c6

change model to gpt-4o-mini

81531b4

refactor code for video analysis toolkit

0e5e351

harryeqs self-assigned this Jan 23, 2025

Merge branch 'master' into feat/multimodal_toolkit

6e12346

Wendong-Fan requested review from mohamadkav and Aaron617 January 25, 2025 06:24

Wendong-Fan added the New Feature label Jan 25, 2025

Wendong-Fan added this to the Sprint 21 milestone Jan 25, 2025

Aaron617 reviewed Jan 25, 2025

View reviewed changes

harryeqs and others added 3 commits January 25, 2025 12:07

Merge branch 'master' into feat/multimodal_toolkit

7b6f227

change vl model to Qwen-VL-Max and add _extract_keyframes

bc59bc4

Merge branch 'feat/multimodal_toolkit' of https://github.com/camel-ai…

75ca523

…/camel into feat/multimodal_toolkit

Wendong-Fan reviewed Jan 26, 2025

View reviewed changes

Merge branch 'master' into feat/multimodal_toolkit

1ce1f8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add multimodal (audio, image and video) analysis toolkits #1496

feat: Add multimodal (audio, image and video) analysis toolkits #1496

harryeqs commented Jan 23, 2025

Aaron617 Jan 25, 2025

harryeqs Jan 25, 2025

harryeqs Jan 25, 2025

Wendong-Fan left a comment

harryeqs commented Jan 26, 2025

feat: Add multimodal (audio, image and video) analysis toolkits #1496

Are you sure you want to change the base?

feat: Add multimodal (audio, image and video) analysis toolkits #1496

Conversation

harryeqs commented Jan 23, 2025

Description

Motivation and Context

Types of changes

Implemented Tasks

Checklist

Aaron617 Jan 25, 2025

Choose a reason for hiding this comment

harryeqs Jan 25, 2025

Choose a reason for hiding this comment

harryeqs Jan 25, 2025

Choose a reason for hiding this comment

Wendong-Fan left a comment

Choose a reason for hiding this comment

harryeqs commented Jan 26, 2025