GitHub - syp2ysy/EDC: Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Enhancing Descriptive Captions with Visual Specialists for Multimodal Perception

If you like our project, please give us a star ⭐ on GitHub.

📰 News

[2024.12.15] 🔥🔥 Our DCE pipeline is launching soon—stay tuned !
[2024.12.15] 🔥🤗 Our DCE-1M datasets is released, please check out and download in EDC-1M !

😮 Highlights

DCE leverage visual specialists to replicate various human visual capabilities, and subsequently employ large language models (LLMs) to simulate the human cognitive process. This combined approach enables us to generate high-quality image captions by closely mimicking the way humans perceive and interpret visual information.

Open-Source Accessibility: The DCE pipeline is built entirely using open-source models, providing an accessible, cost-effective solution for generating high-quality image captions without reliance on proprietary technologies.
Customizable and Flexible Design: The pipeline supports a DIY approach, allowing users to integrate and combine different visual specialist models tailored to their specific needs. This flexibility empowers users to generate customized captions enriched with targeted visual information.

Abstract

Training Large Multimodality Models (LMMs)s relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption.

Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Descriptive Captions with Visual Specialists for Multimodal Perception

If you like our project, please give us a star ⭐ on GitHub.

📰 News

😮 Highlights

Abstract

About

Releases

Packages

syp2ysy/EDC

Folders and files

Latest commit

History

Repository files navigation

Enhancing Descriptive Captions with Visual Specialists for Multimodal Perception

If you like our project, please give us a star ⭐ on GitHub.

📰 News

😮 Highlights

Abstract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages