1 Great Bay University
2 Harbin Institute of Technology, Shenzhen
3 University of Oxford
4 Shenzhen Campus of Sun Yat-sen University
*Corresponding author
- [09/2024] We have released code about fine-tuning and ADPO's.
- [07/2024] We have released the collected AVinstruct dataset.
- [07/2024] Our work has been accepted by ECCV 2024!
- [03/2024] Arxiv paper released.
- [03/2024] Project page released.
We introduce the CAT, enhancing MLLM in three ways:
1) We design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models.
2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations.
3) We propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects.
We have collect an audio-visual joint instruction dataset, named AVinstruct, details in Data.md.
The Fine-tuning process is in here SFT.md.
The ADPO process is in here ADPO.md.
@misc{ye2024cat,
title={CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios},
author={Qilang Ye and Zitong Yu and Rui Shao and Xinyu Xie and Philip Torr and Xiaochun Cao},
year={2024},
eprint={2403.04640},
archivePrefix={arXiv},
primaryClass={cs.CV}
}