this repo is for menovideo associated with the paper 'Data Efficient Video Transformer for Violence Detection' (DeVTR)
one of big challenges facing researchers in computer vision with transformers especially in video tasks is the need for large data and high computational resources , our method called DeVTR (Data Efficient Video Transformer for Violence Detection) to overcame these challenges (he need for large data and high computational resources )
In this work, we propose a data-efficient video transformer (DeVTr) based on the transformer network as a Spatio-temporal learning method with a pre-trained 2d-Convolutional neural network (2d-CNN) as an embedding layer for the input data. The model has been trained and tested on the Real-life violence dataset (RLVS) and achieved an accuracy of 96.25%. A comparison of the result for the suggested method with previous techniques illustrated that the suggested method provides the best result among all the other studies for violence event detection.
the model achieved 96.25% based on RLVS dataset and also worth to mention that it was better than TimeSformer in both memory efficiency and convergence speed and accuracy
Comparing results of DeVTr vs other methods based on RLVS Dataset
saliency map for random video of violence action
the menovideo package help you build video action recognation / video understanding model based on
1- build using our Novel model DeVTR with full costmaztion
2- video dataset reader and preprocessing to easly read videos and make them as pytorch ready dataloaders
3- Timedistributed warper similar to keras timedistributed warper which can help you easly build (classical CNN+LSTM )
this is new novel transformer network combined with Conv net to build a highly accuract video action recognation model with limited data and hw rescources
install
pip install menovideo
import it
import menovideo.menovideo as menoformer
import menovideo.videopre as vide_reader
init DeVTr model without pre-trained wights
model = menoformer.DeVTr()
init DeVTr with pre-trained wigths the trained wights can be downloaded from this url
wight = 'drive/MyDrive/Colab Notebooks/transformers/violance-detaction-myresearch/vg19bn40convtransformer-ep-0.pth'
model2 = menoformer.DeVTr(w= wight , base ='default')
using the video reader and pre-processing helpers parameters is :
- pandas dataframe contain the path and label of each video
- number of frames for the singal video
- RGB is the number of color channles
- h is the hieght of the frame for each video
- w is the width of the frame for each video
valid_dataset = vide_reader.TaskDataset(valid_df,timesep=time_stp,rgb=RGB,h=H,w=W)
for detlied example of using the labrary use package_test.ipynb
To cite our paper/code:
@INPROCEEDINGS{9530829, author={Abdali, Almamon Rasool}, booktitle={2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)}, title={Data Efficient Video Transformer for Violence Detection}, year={2021}, volume={}, number={}, pages={195-199}, doi={10.1109/COMNETSAT53002.2021.9530829}}