Repetitive Activity Counting by Sight and Sound (CVPR 2021)
Yunhua Zhang, Ling Shao, Cees G.M. Snoek
demo.mp4
- Python 3.7.4
- PyTorch 1.4.0
- librosa 0.8.0
- opencv 3.4.2
- tqdm 4.54.1
- We provide an example video and the corresponding audio file with scale variation challenge for the demo code.
- The pretrained checkpoints of our model can be downloaded at this link.
- To run the demo code:
python run_demo.py
- The "VGGSound" folder is modified from their original repository.
- The "sync_batchnorm" folder comes from this repository.
- As cited in the paper, the regression function for counting uses the technique proposed in this paper "Deep expectation of real and apparent age from a single image without facial landmarks".
- Some variables with "sr" in the names (sample rate) are for the temporal stride decision module.
- The performance of the released model on Countix-AV and Extreme Countix-AV is a bit higher than that reported in the paper, due to some hyperparameter adjustments.
- In our experiment, we extract the audio files (.wav) from videos by "moviepy", using the following code:
import moviepy.editor as mp
clip = mp.VideoFileClip(path_to_video).subclip(start_time, end_time)
clip.audio.write_audiofile(path_for_save)
If you want our extracted audio files, pls send me an email or create an issue with your email address.
For the following code, we train the modules separately so two NVIDIA 1080Ti GPUs are enough for the training. The visual model is trained on Countix, and the audio model and the cross-modal modules are trained on Countix-AV. The resulted overall model is expected to test on Countix-AV. To test on the Countix dataset, the reliablity estimation should be retrained on the Countix dataset. For our model, the hyparameters influence the performance to some extent, see the supplementary material for more details. To be specific, we try the number of branches from 20 to 50 to find the best one and for the margin for the temporal stride decision module, we try from 1.0 to 3.0.
- Train the visual counting model
python train.py
Then, generate the counting predictions with the model of the sample rate from 1 to 7. After that, run this script to get the csv file for training the temporal stride decision module:
python generate_csv4sr.py
- Train the temporal stride decision module based on the visual modality only
python train_sr.py
- Train the temporal stride decision module based on sight and sound
python train_sr_audio.py
- Train the audio counting model
python train_audio.py
- Train the reliability estimation module
python train_conf.py
- Here we use the ResNet (2+1)D model and replacing it with a better model, e.g. mmaction2, should obtain a better performance.
- The code provided by https://github.com/Xiaodomgdomg/Deep-Temporal-Repetition-Counting is helpful.
We provide the train, validation, and test sets of Countix-AV dataset in CountixAV_train.csv, CountixAV_val.csv, and CountixAV_test.csv.
The dataset can be downloaded at this link
If you have any problems with the code, feel free to send an email to me: y.zhang9@uva.nl or create an issue.