CACE-Net

Here is the PyTorch implementation of our paper.

Paper Title: "CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization"

Authors: Xiang He*, Xiangxi Liu*, Yang Li*, Dongcheng Zhao, Guobin Shen, Qingqun Kong, Xin Yang, Yi Zeng

Accepted by: MM '24: Proceedings of the 32st ACM International Conference on Multimedia

[arxiv] [paper] [code]

Figure 1: Method Overview showing the main components of CACE-Net.

Content tree

This repository contains three folders, AVELCLIP, CACE, and Encoders, for Targeted fine-tuning, CACE-Net, and efficient encoders, respectively.

+---AVELCLIP  # Targeted fine-tuning
|   \---audioset_tagging_cnn
|       +---checkpoints
|       +---clip
|       +---pytorch
|       +---resources
|       +---scripts
|       \---utils
+---CACE  # CACE-Net
|   +---braincog
|   +---configs
|   +---data
|   +---dataset
|   +---model
|   \---utils
\---Encoders  # efficient encoders
    \---audioset_tagging_cnn
        +---pytorch
        +---resources
        +---scripts
        \---utils

AVE Dataset & Features Prepare

We highly appreciate @YapengTian for the shared features and code.

Download Raw Video

The AVE dataset is a subset of the AudioSet, which contains 4097 videos with a total of 28 event categories and 1 background category. The dataset can be obtained from https://drive.google.com/open?id=1FjKwe79e0u96vdjIVwfRQ1V6SoDHe7kK.

Download Visual and Audio Features

The AVE-ECCV18 repository offers the results of feature extraction from videos using the Vggish audio model pre-trained on the Audioset dataset and the Vgg-19 visual model pre-trained on the ImageNet dataset.

Note that the number of features here is 4143 instead of 4097 because a video may belong to different categories at the same time.

Usage

Train

If you want to retrain yourself to verify the results in the paper, please refer to the commands in scripts run_aba.sh and run_supp.sh.

As an example, the script for using our method on the AVE dataset would look like this:

CUDA_VISIBLE_DEVICES=1 python supv_main.py --gpu 1 --lr 0.0007 --clip_gradient 0.5 --snapshot_pref "./Exps/Supv_supp/expLoss" --n_epoch 200 --b 64 --test_batch_size 64 --print_freq 10 --seed 3917 --guide Co-Guide --psai 0.3 --contrastive --Lambda 0.6 --contras_coeff 1.0  # This will get 80.796% accuracy, results in our paper.

This took about 0.5h on a 40G-A100 GPU

validation

We also provide model weights for the experimental results in the paper. The well-trained model can be found at here.

The script for validation can be：

CUDA_VISIBLE_DEVICES=1 python supv_main.py --seed 3917 --gpu 1 --test_batch_size 64 --guide Co-Guide --psai 0.3 --contrastive --Lambda 0.6 --contras_coeff 1.0 --evaluate --resume /home/hexiang/CACE/Exps/Supv_supp/expLoss_Seed3917_guide_Co-Guide_psai_0.3_Contrastive_True_contras-coeff_1.0__lambda_0.6/model_epoch_46_top1_80.796_task_Supervised_best_model_psai_0.3_lambda_0.6.pth.tar

The output results are as follows:

Loading Checkpoint: /home/hexiang/CACE/Exps/Supv_supp/expLoss_Seed3917_guide_Co-Guide_psai_0.3_Contrastive_True_contras-coeff_1.0__lambda_0.6/model_epoch_46_top1_80.796_task_Supervised_best_model_psai_0.3_lambda_0.6.pth.tar

2024-07-17 03:57:14,726 INFO
Start Evaluation..
2024-07-17 03:57:16,209 INFO Test Epoch [0][0/7]        Loss 2.2928 (2.2928)    Prec@1 75.000 (75.000)
2024-07-17 03:57:16,532 INFO **************************************************************************         Evaluation results (acc): 80.7960%.
2024-07-17 03:57:16,532 INFO completed in 1.81 seconds.

Re-extraction of features (optional)

Using the features provided in AVE-ECCV18 for comparison with other methods is a fair approach. If you want to further improve task performance, consider using more efficient encoders to extract features.

Video frame and audio file extraction

Video frame extraction: We use cv2.VideoCapture, please refer to the file visual_feature_extractor.py.
Audio file extraction: We use moviepy, please refer to the file audioclip.py

efficient encoders

Visual Encoder: We use ResNet50, EfficientNet is also ok. Please refer to the file visual_feature_extractor.py.
Audio Encoder: We use CNN14, from PANN. Please refer to the file inference.py.

The commands for extracting audio features are as follows:

python inference.py audio_tagging --model_type Cnn14 --checkpoint_path /home/hexiang/Encoders/audioset_tagging_cnn/checkpoints/Cnn14_mAP=0.431.pth --audio_path="non.wav" --cuda

Targeted fine-tuning

In efficient encoders, pre-trained models can be fine-tuned specifically for audio-visual localization tasks to obtain more generalized representations.

The fine-tuning script is shown below:

python main.py --batch-size 128 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0

Results on AVE dataset

AVCA	BECE	Efficient Encoder	Accuracy(%)
-	-	-	78.83
✓			78.93
	✓		80.30
✓	✓		80.80
✓	✓	✓	82.36

Figure 2: Visualization of different attention guidance methods.

Citation

If our paper is useful for your research, please consider citing it:

@inproceedings{10.1145/3664647.3681503,
author = {He, Xiang and Liu, Xiangxi and Li, Yang and Zhao, Dongcheng and Shen, Guobin and Kong, Qingqun and Yang, Xin and Zeng, Yi},
title = {CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization},
year = {2024},
isbn = {9798400706868},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3664647.3681503},
doi = {10.1145/3664647.3681503},
booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
pages = {985–993},
numpages = {9},
keywords = {audio-visual co-guidance attention, audio-visual event localization, contrastive enhancement},
location = {Melbourne VIC, Australia},
series = {MM '24}
}

Acknowledgements

This code began with CMRAN and CMBS, the code for the visualization is from AVE-ECCV18 and the implementation of SNN part is from Brain-Cog. Thanks for their great work. If you are confused about using it or have other feedback and comments, please feel free to contact us via hexiang2021@ia.ac.cn.

Have a good day !

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
AVELCLIP		AVELCLIP
CACE		CACE
Encoders		Encoders
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
att_vis.jpg		att_vis.jpg
method.jpg		method.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CACE-Net

Content tree

AVE Dataset & Features Prepare

Download Raw Video

Download Visual and Audio Features

Usage

Train

validation

Re-extraction of features (optional)

Video frame and audio file extraction

efficient encoders

Targeted fine-tuning

Results on AVE dataset

Citation

Acknowledgements

About

Releases

Packages

Languages

License

Brain-Cog-Lab/CACE-Net

Folders and files

Latest commit

History

Repository files navigation

CACE-Net

Content tree

AVE Dataset & Features Prepare

Download Raw Video

Download Visual and Audio Features

Usage

Train

validation

Re-extraction of features (optional)

Video frame and audio file extraction

efficient encoders

Targeted fine-tuning

Results on AVE dataset

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages