Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Li Zhou, Xu Yuan, Zenghui Sun, Zikun Zhou, Jinsong Lan

* Equally contributing first authors

TAO Technology of Alibaba Group, The Hong Kong Polytechnic University, Peng Cheng Laboratory

📢 Latest Updates

🌟 Featured: We will release the MGLMM demo, code and datasets as soon as possible. 🌟

MGLMM Overview

The pixel-wise understanding capability of existing Large Multimodal Models (LMMs) remains at the instance level, showing the limited ability to generate fine-grained textual responses and segmentation masks even provided with detailed instruction cues. To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research. Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training.

🏆 Contributions

MGLMM Introduction. We propose the Multi-Granularity Large Multimodal Model (MGLMM), the first model capable of seamlessly switching between multi-granularity segmentation and captioning, mainly including panoptic and fine-grained segmentation and captioning. MGLMM achieves state-of-the-art performance on multiple downstream tasks.
Novel Task & Evaluation. We introduce a novel benchmark MGSCData to train and evaluate the ability of multi-granularity segmentation and captioning for LMMs, which comprises over 30K high-quality image-question pairs.
Unify Data Format. We propose a unified data format, which facilitates learning the alignment relationships between object concepts and segmentation masks in multiple granularities.

👁️💬 MGLMM: Multi-granularity Large Multimodal Model

The left side of the figure illustrates the model architecture of MGLMM, and the right side illustrates the proposed unified data format for multi-task learning.

💡 Motivation

The left figure shows a case where the previous work (e.g., GLaMM) overlooks the tennis racket, tennis ball, and microphone in mask and text responses. Besides, these models only possess the ability to describe the image at the instance level and produce corresponding instance masks aligned with the output texts. Hence, these models can hardly perceive the fine-grained objects, such as the player's hat, wristband, and skirt in the right figure, even provided with detailed textual cues. The missing of the above abilities would limit the universality and comprehension of the LMMs.

🔍 Multi-granlarity Segmentation and Captioning Dataset (MGSCData)

We annotate 10K SAM images, which are inherently diverse and exhibit multi-granularity. The resulting dataset comprises 30K conversations and contains over 45M tokens, totaling more than 300K segmentation masks, each accompanied by a short semantic label and a detailed caption.

🚀 Qualitative and Quantitative results

📷 Multi-Granularity Segmentation and Captioning (MGSC)

The MGSC task aims to evaluate the ability of LMMs to seamlessly adjust the granularity of segmentation and captioning.

Performance on multi-granularity segmentation and captioning. We compare our model with GLaMM using METEOR, CIDEr, AP50, mIoU, and mask recall metrics.

📷 Grounded Conversation Generation (GCG)

The GCG task proposed by GLaMM primarily focuses on aligning the textual response with the segmentation mask at the instance level. In comparison to previous models, MGLMM provides high-quality and fine-grained captioning and segmentation results.

Performance on the grounded conversation generation benchmark. We report the metrics including METEOR (M), CIDEr (C), AP50, mIoU, and Mask Recall (MR).

🎯 Referring Expression Segmentation

Our model is also an expert at the traditional referring segmentation task, i.e., producing corresponding segmentation masks based on the provided referring expressions.

Performance on referring and reasoning segmentation benchmarks. The table only shows the cIoU values for referring segmentation.

🖼️ Multiple and Empty Segmentation

MGLMM features the ability to segment multiple targets and reject empty targets, outperforming all competitive models in zero-shot scenarios.

Performance comparison on generalized referring expression segmentation dataset, which contains multiple or empty segmentation targets.

📷 Image Captioning

Our model also achieves excellent performance on the image-level captioning.

Performance comparison on image-level captioning.

📜 Citation

@misc{zhou2024instructionguidedmultigranularitysegmentationcaptioning,
      title={Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model}, 
      author={Li Zhou and Xu Yuan and Zenghui Sun and Zikun Zhou and Jingsong Lan},
      year={2024},
      eprint={2409.13407},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.13407}, 
}

🙏 Acknowledgement

We are thankful to LLaVA, LISA, and GLaMM for releasing their models and code as open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches