Paper | Supplementary | Arxiv | Video | Poster
by Tanvir Mahmud,
Yapeng Tian,
Diana Marculescu
T-VSL incorporates the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures.
To setup the environment, please simply run
pip install -r requirements.txt
Data can be downloaded from Sound of Pixels
Data can be downloaded from Mix and Localize: Localizing Sound Sources in Mixtures
Data can be downloaded from Localizing Visual Sounds the Hard Way
For training the T-VSL model, please run
python main.py --train_data_path ./data/vggsound \
--mode train --test_data_path ./data/vggsound \
--test_gt_path ./metadata/vggsound_duet_test.csv \
--output_dir ./path/to/output/dir \
--id vggsound_duet --model tvsl \
--trainset vggsound_duet --num_class 221 \
--testset vggsound_duet --epochs 100 \
--batch_size 256 --init_lr 0.01 \
--lr_schedule cos --multiprocessing_distributed \
--ngpu 4 --port 11342 --ciou_thr 0.3 \
--iou_thr 0.3 --save_visualizations \
--audioclip_ckpt_path ./path/to/audioclip/pretrained/ckpt
For testing and visualization, simply run
python main.py --mode test \
--train_data_path ./data/vggsound \
--test_data_path ./data/vggsound \
--test_gt_path ./metadata/vggsound_duet_test.csv \
--output_dir ./path/to/output/dir \
--id vggsound_duet --model tvsl \
--trainset vggsound_duet --num_class 221 \
--testset vggsound_duet --epochs 100 \
--batch_size 256 --init_lr 0.01 \
--lr_schedule cos --multiprocessing_distributed \
--ngpu 4 --port 11342 --ciou_thr 0.3 \
--iou_thr 0.3 --save_visualizations \
--load /path/to/pretrained/ckpt \
--audioclip_ckpt_path ./path/to/audioclip/pretrained/ckpt
This codebase is based on AVGN and AudioCLIP. Thanks for their amazing works.
T-VSL is licensed under a UT Austin Research LICENSE.
If you find this work useful, please consider citing our paper:
@inproceedings{mahmud2024t,
title={T-vsl: Text-guided visual sound source localization in mixtures},
author={Mahmud, Tanvir and Tian, Yapeng and Marculescu, Diana},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={26742--26751},
year={2024}
}