Progressive Spatio-temporal Perception for Audio-Visual Question Answering (ACMMM'23) [arXiv]
PyTorch code accompanies our PSTP-Net.
Guangyao Li, Wenxuan Hou, Di Hu
python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy
-
Clone this repo
git clone https://github.com/GeWu-Lab/PSTP-Net.git
-
Download data
MUSIC-AVQA: https://gewu-lab.github.io/MUSIC-AVQA/
-
Feature extraction
feat_script/extract_clip_feat python extract_patch-level_feat.py
-
Training
python main_train.py \ --temp_select True --segs 12 --top_k 2 \ --spat_select True --top_m 25 \ --a_guided_attn True \ --global_local True \ --batch-size 64 --epochs 30 --lr 1e-4 --gpu 0 \ --checkpoint PSTP_Net \ --model_save_dir models_pstp
-
Testing
python main_test.py
If you find this work useful, please consider citing it.
coming soon!
This research was supported by Public Computing Cloud, Renmin University of China.