Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal: <img x.jpg> will only detect filename, and ignore HTML #54

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

BabyCNM
Copy link
Collaborator

@BabyCNM BabyCNM commented Oct 5, 2024

Why are these changes needed?

The autogen tag parsing system uses HTML-like tags to allow users to input images and audio directly from text. However, this system may mistakenly interpret actual HTML content (such as website source code) as multimodal components for GPT-4o and other VLMs, which is undesirable.

Fortunately, autogen’s tag format differs from HTML. In autogen, file paths do not require quotation marks. To improve parsing accuracy, we’ve introduced a strict_filepath_match parameter for the multimodal utilities. When enabled (True), it ensures that only simple tag contents—without spaces or quotes—are matched, making it especially useful for detecting filenames and ignoring complex HTML syntax. This parameter is turned on (True) for parsing multimodal agents' messages.

Note: This is a custom tagging convention, which could be confusion for some users. Please share any recommendations regarding the current design. Further simplification of the message component is planned for future updates.

Related issue number

Checks

@marklysze
Copy link
Collaborator

Hey @BabyCNM... thanks for submitting this and it looks useful. Would you be able to comment with some short examples of how images and audio are currently included, and then updated examples which work with your code?

I'm happy to test it out :)

@BabyCNM
Copy link
Collaborator Author

BabyCNM commented Oct 24, 2024

Here is an example where current implementation would fail but the edited version will work.

prompt = """Read the screenshot image and the website's source code. Then, answer the user's question.

User Question: is the button below or above the image?
Screenshot: <img C:/User/xyz/Desktop/screenshot_3.jpg>

--- HTML Code ---
<!DOCTYPE html>
<html lang="en">
    <img src="website/relative/path/300.jpg" alt="Placeholder Image">
    <button onclick="alert('Button clicked!')">Click Me</button>
</body>
</html>
"""

Note there are two locations we have "<img" tag. However, the first one should be interpreted as an image to be sent for GPT-4o, and the other one (which is embedded in HTML code) should just be code rather than image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants