This is the code release for ICASSP 2023 Paper "MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning", implemented with Pytorch.
Title: MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning
Authors: Ruize Xu, Ruoxuan Feng, Shi-xiong Zhang, Di Hu
🚀 Project page here: Project Page
📄 Paper here: Paper
🔍 Supplementary material: Supplementary
Recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model`s performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise
-
Download Original Dataset: CREMAD, SSW60, Voxceleb1&2, and UCF 101(supplementary).
-
Preprocessing:
- CREMAD: Refer to OGM-GE for video processing.
- SSW60: Refer to the original repo for details.
- Voxceleb1&2: After extracting frames (2fps) from the raw video, we utilize RetinaFace to extract and align faces. The official pipeline trains on Voxceleb2 and test on the Voxceleb1 test set, and we add validation on the manually-made Voxceleb2 test set. The annotation is in
/data
folder.
- ubuntu 18.04
- CUDA Version: 11.6
- Python: 3.9.7
- torch: 1.10.1
- torchaudio: 0.10.1
- torchvision: 0.11.2
You can train your model on the provided datasets (e.g. CREMAD) simply by running:
python main_CD.py --train --fusion_method gated --mmcosine True --scaling 10
Apart from fusion methods and scaling parameters, you can also adjust the setting such as batch_size
, lr_decay
, epochs
, etc.
You can also record intermediate variables through tensorboard by nominating use_tensorboard
and tensorboard_path
for saving logs.
If you find this work useful, please consider citing it.
@inproceedings{xu2023mmcosine,
title={MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning},
author={Xu, Ruize and Feng, Ruoxuan and Zhang, Shi-Xiong and Hu, Di},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}
This research was supported by Public Computing Cloud, Renmin University of China.
If you have any detailed questions or suggestions, you can email us: xrz0315@ruc.edu.cn