Multimodal: <img x.jpg> will only detect filename, and ignore HTML #54

BabyCNM · 2024-10-05T03:42:37Z

Why are these changes needed?

The autogen tag parsing system uses HTML-like tags to allow users to input images and audio directly from text. However, this system may mistakenly interpret actual HTML content (such as website source code) as multimodal components for GPT-4o and other VLMs, which is undesirable.

Fortunately, autogen’s tag format differs from HTML. In autogen, file paths do not require quotation marks. To improve parsing accuracy, we’ve introduced a strict_filepath_match parameter for the multimodal utilities. When enabled (True), it ensures that only simple tag contents—without spaces or quotes—are matched, making it especially useful for detecting filenames and ignoring complex HTML syntax. This parameter is turned on (True) for parsing multimodal agents' messages.

Note: This is a custom tagging convention, which could be confusion for some users. Please share any recommendations regarding the current design. Further simplification of the message component is planned for future updates.

Related issue number

Checks

I've included any doc changes needed for https://autogen-ai.github.io/autogen/. See https://autogen-ai.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

marklysze · 2024-10-21T18:50:55Z

Hey @BabyCNM... thanks for submitting this and it looks useful. Would you be able to comment with some short examples of how images and audio are currently included, and then updated examples which work with your code?

I'm happy to test it out :)

BabyCNM · 2024-10-24T02:47:30Z

Here is an example where current implementation would fail but the edited version will work.

prompt = """Read the screenshot image and the website's source code. Then, answer the user's question.

User Question: is the button below or above the image?
Screenshot: <img C:/User/xyz/Desktop/screenshot_3.jpg>

--- HTML Code ---
<!DOCTYPE html>
<html lang="en">
    <img src="website/relative/path/300.jpg" alt="Placeholder Image">
    <button onclick="alert('Button clicked!')">Click Me</button>
</body>
</html>
"""

Note there are two locations we have "<img" tag. However, the first one should be interpreted as an image to be sent for GPT-4o, and the other one (which is embedded in HTML code) should just be code rather than image.

BabyCNM added 2 commits October 4, 2024 20:36

Multimodal: <img x.jpg> will only detect filename + ignore HTML syntax

8a7561f

Add a few test cases for strict_filepath_match

c775fe6

BabyCNM requested review from sonichi, marklysze, yiranwu0 and qingyun-wu October 5, 2024 03:48

Hk669 requested a review from BeibinLi October 5, 2024 07:26

Hk669 mentioned this pull request Oct 5, 2024

bump version to 0.3.1 #53

Merged

3 tasks

BabyCNM force-pushed the img_tag branch from 6fddd2f to c775fe6 Compare October 9, 2024 01:35

Merge branch 'main' into img_tag

4d46d8c

BabyCNM had a problem deploying to openai1 October 20, 2024 04:33 — with GitHub Actions Error

qingyun-wu added 2 commits October 20, 2024 11:40

Merge branch 'main' into img_tag

13c71f1

Merge branch 'main' into img_tag

36fcb92

Merge branch 'main' into img_tag

fb8671a

odoochain pushed a commit to odoochain/autogen that referenced this pull request Nov 10, 2024

update packages (autogenhub#54)

6199104

odoochain pushed a commit to odoochain/autogen that referenced this pull request Nov 10, 2024

Rename ModelClient to ChatCompletionClient (autogenhub#54)

06ba5d3

BabyCNM had a problem deploying to openai1 November 23, 2024 02:41 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal: <img x.jpg> will only detect filename, and ignore HTML #54

Multimodal: <img x.jpg> will only detect filename, and ignore HTML #54

BabyCNM commented Oct 5, 2024 •

edited

Loading

marklysze commented Oct 21, 2024

BabyCNM commented Oct 24, 2024 •

edited

Loading

Multimodal: <img x.jpg> will only detect filename, and ignore HTML #54

Are you sure you want to change the base?

Multimodal: <img x.jpg> will only detect filename, and ignore HTML #54

Conversation

BabyCNM commented Oct 5, 2024 • edited Loading

Why are these changes needed?

Note: This is a custom tagging convention, which could be confusion for some users. Please share any recommendations regarding the current design. Further simplification of the message component is planned for future updates.

Related issue number

Checks

marklysze commented Oct 21, 2024

BabyCNM commented Oct 24, 2024 • edited Loading

BabyCNM commented Oct 5, 2024 •

edited

Loading

BabyCNM commented Oct 24, 2024 •

edited

Loading