Yet another multimodal video feature extractor.
- unimodal: audio-only, visual-only
- multimodal: audio, visual, text
- multi GPU: multiple GPU supports
- multilingual: english, japanese VLM backbones
- synchronization: same-dimensional audio-visual feature (sequence length should be same)
FFmpeg
apt install ffmpeg
Pytorch
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 torchtext==0.16.0 --index-url https://download.pytorch.org/whl/cu118
- : TIMM models
Action
- : I3D
- : Slowfast
- : VideoMAE
Optical flow
- : RAFT
Audio-only
- : PANNs
- : VGGish
Image-text
- : CLIP
- : Japanese CLIP
Video-text
- : CLIP4Clip
- : InternVideo
Audio-text
- : CLAP (Microsoft)
- : CLAP (LAION)
pytest tests
mypy firefly
ruff check firefly