ChatterBox

ChatterBox: Multi-round Multimodal Referring and Grounding

Yunjie Tian*¹, Tianren Ma*¹, Lingxi Xie², Jihao Qiu¹, Xi Tang¹, Yuan Zhang¹, Jianbin Jiao¹, Qi Tian², Qixiang Ye¹

¹ University of Chinese Academy of Sciences, ² HUAWEI Inc.

Abstract

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions.

Overview

The architecture of the ChatterBox model.

Key Contributions:

CB-300K - We establish the CB-300K benchmark to facilitate the research in multi-round referring and grounding.
Chatterbox Model - We establish the ChatterBox model in a dual-branch architecture to solve multi-round referring and grounding problem.

Updates

Jan. 24th, 2024: The paper, code, and dataset is released.

Release

Install

Clone this repository and navigate to ChatterBox folder

git clone https://github.com/sunsmarterjie/ChatterBox
cd ChatterBox

Install Packages

conda create -n chatterbox python=3.11.5 
conda activate chatterbox
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt
pip install deepspeed==0.11.1
unzip mmcv-1.4.7.zip
cd mmcv-1.4.7/
MMCV_WITH_OPS=1 pip install -e .
cd ../model/GroundingDINO/ops
python setup.py build install

Train

We build visual branch of ChatterBox using GroundingDINO and DINO, we provide GroundDINO version now.

Prepare datasets/models:

Download CB-300K, VG, COCO2017, COCO2014, RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, OpenSource, clip-vit-large-patch14, LLaVA-Instruct-150K, llava-llama-2-13b, CB-materials, groundingdino_swinb.

├── datasets
|   ├── CB-300K
|   |    ├── CB-MRG
|   |    ├── CB-LC
│   │    └── ...
|   ├── VG
|   |    ├── VG_100K
|   |    ├── VG_100K_2
│   │    └── ...
│   ├── MSCOCO2017
|   |    ├── train2017
│   │    └── ...
│   ├── MSCOCO2014
|   |    ├── train2014
│   │    └── ...
│   ├── Flickr30K
|   |    ├── flickr30k-images
│   │    └── ...
│   ├── llava_instruct_150k.json
|   ├── CB_materials
|            ├── CB-refcoco-GND
|            ├── CB-coco-GND
|            ├── CB-refcoco-REF
│            └── ...
│── clip-vit-large-patch14
|             ├── config.json
│             └── ...
│── llava-llama-2-13b-chat-lightning-preview
|                      ├── config.json
│                      └── ...
│── OpenSource
|        ├── finetune_refcoco_train.json
|        ├── finetune_refcoco+_train.json
│        └── ...
├── groundingdino_swinb_cogcoor.pth

Train ChatterBox on 8xA800 GPUs (80GB).

python startup_stage1.py  # stage1
python startup_stage2.py  # stage2

Evaluation

See details at evaluation.

Demo

Coming soon

Citation

If this project has been helpful or if you've used our dataset, please cite:

@article{tian2024chatterbox,
  title={ChatterBox: Multi-round Multimodal Referring and Grounding},
  author={Tian, Yunjie and Ma, Tianren and Xie, Lingxi and Qiu, Jihao and Tang, Xi and Zhang, Yuan and Jiao, Jianbin and Tian, Qi and Ye, Qixiang},
  journal={arXiv preprint arXiv:2401.13307},
  year={2024}
}

Acknowledgment

This project is based on LLaVA (paper, code), LISA (paper, code), GPT4RoI (paper, code), thanks for their excellent works.

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
CB-300K		CB-300K
assets		assets
config		config
evaluation		evaluation
gpt4roi		gpt4roi
groundingdino/util		groundingdino/util
mmdet		mmdet
model		model
utils		utils
LICENSE		LICENSE
README.md		README.md
mmcv-1.4.7.zip		mmcv-1.4.7.zip
refcocog_referring_test.py		refcocog_referring_test.py
requirements.txt		requirements.txt
startup_stage1.py		startup_stage1.py
startup_stage2.py		startup_stage2.py
train_stage1.py		train_stage1.py
train_stage2.py		train_stage2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatterBox

ChatterBox: Multi-round Multimodal Referring and Grounding

Abstract

Overview

Updates

Release

Contents

Install

Train

Evaluation

Demo

Citation

Acknowledgment

About

Releases

Packages

Contributors 4

Languages

License

sunsmarterjie/ChatterBox

Folders and files

Latest commit

History

Repository files navigation

ChatterBox

ChatterBox: Multi-round Multimodal Referring and Grounding

Abstract

Overview

Updates

Release

Contents

Install

Train

Evaluation

Demo

Citation

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages