Skip to content

enyac-group/T-VSL

Repository files navigation

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures (CVPR 2024)

Paper | Supplementary | Arxiv | Video | Poster
by Tanvir Mahmud, Yapeng Tian, Diana Marculescu

T-VSL incorporates the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures.

MA-AVT Illustration

Environment

To setup the environment, please simply run

pip install -r requirements.txt

Datasets

MUSIC

Data can be downloaded from Sound of Pixels

VGG-Instruments

Data can be downloaded from Mix and Localize: Localizing Sound Sources in Mixtures

VGG-Sound Source

Data can be downloaded from Localizing Visual Sounds the Hard Way

Train

For training the T-VSL model, please run

python main.py --train_data_path ./data/vggsound \
        --mode train --test_data_path ./data/vggsound \
        --test_gt_path ./metadata/vggsound_duet_test.csv \
        --output_dir ./path/to/output/dir \
        --id vggsound_duet --model tvsl \
        --trainset vggsound_duet --num_class 221 \
        --testset vggsound_duet --epochs 100 \
        --batch_size 256 --init_lr 0.01 \
        --lr_schedule cos --multiprocessing_distributed \
        --ngpu 4 --port 11342 --ciou_thr 0.3 \
        --iou_thr 0.3 --save_visualizations \
        --audioclip_ckpt_path ./path/to/audioclip/pretrained/ckpt

Test

For testing and visualization, simply run

python main.py --mode test \
        --train_data_path ./data/vggsound \
        --test_data_path ./data/vggsound \
        --test_gt_path ./metadata/vggsound_duet_test.csv \
        --output_dir ./path/to/output/dir \
        --id vggsound_duet --model tvsl \
        --trainset vggsound_duet --num_class 221 \
        --testset vggsound_duet --epochs 100 \
        --batch_size 256 --init_lr 0.01 \
        --lr_schedule cos --multiprocessing_distributed \
        --ngpu 4 --port 11342 --ciou_thr 0.3 \
        --iou_thr 0.3 --save_visualizations \
        --load /path/to/pretrained/ckpt \
        --audioclip_ckpt_path ./path/to/audioclip/pretrained/ckpt

👍 Acknowledgments

This codebase is based on AVGN and AudioCLIP. Thanks for their amazing works.

LICENSE

T-VSL is licensed under a UT Austin Research LICENSE.

Citation

If you find this work useful, please consider citing our paper:

BibTeX

@inproceedings{mahmud2024t,
  title={T-vsl: Text-guided visual sound source localization in mixtures},
  author={Mahmud, Tanvir and Tian, Yapeng and Marculescu, Diana},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={26742--26751},
  year={2024}
}

Contributors

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published