Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add multimodal (audio, image and video) analysis toolkits #1496

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

harryeqs
Copy link
Collaborator

Description

Describe your changes in detail.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #15213 if this solves the issue #15213

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Implemented Tasks

  • Subtask 1
  • Subtask 2
  • Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@harryeqs harryeqs self-assigned this Jan 23, 2025
with exactly the name of the species without any additional text."

answer = test_toolkit.ask_question_about_video(video_path, question)
print(f"\nThe final answer is: {answer}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the current toolkit pass these unit tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mengkang, I ran the test a few more times and discovered that although it could get the correct number for the first question, it is not returning in the right format hence the answer is likely a hallucination. I am working on fixing this problem and will finish today. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Changed the vl model to Qwen VL Max again and now it can answer all three questions correctly. However, a major problem is that Qwen VL Max could only support passing 28 images simultaneously (from my experiment), which may cause significant information loss for long videos. I am adding a key_frame_extraction method to try to cope with this problem.

Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @harryeqs , some test and pre-commit check seems not passed, could you fix this? thanks!

@harryeqs
Copy link
Collaborator Author

Thanks @harryeqs , some test and pre-commit check seems not passed, could you fix this? thanks!

Thanks @Wendong-Fan ! Sorry I am busy working on some feature on the Eigentbot at the moment, will fix the test and add unit tests as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants