Skip to content

[WIP] Yet another multimodal video-audio feature extractor based on recent research

License

Notifications You must be signed in to change notification settings

awkrail/firefly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

firefly

Yet another multimodal video feature extractor.

Features

  • unimodal: audio-only, visual-only
  • multimodal: audio, visual, text
  • multi GPU: multiple GPU supports
  • multilingual: english, japanese VLM backbones
  • synchronization: same-dimensional audio-visual feature (sequence length should be same)

Libraries

FFmpeg

apt install ffmpeg

Pytorch

pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 torchtext==0.16.0 --index-url https://download.pytorch.org/whl/cu118

Models

Vision-only

  • : TIMM models

Action

  • : I3D
  • : Slowfast
  • : VideoMAE

Optical flow

  • : RAFT

Audio-only

  • : PANNs
  • : VGGish

Image-text

  • : CLIP
  • : Japanese CLIP

Video-text

  • : CLIP4Clip
  • : InternVideo

Audio-text

  • : CLAP (Microsoft)
  • : CLAP (LAION)

Test

pytest tests

Mypy + Ruff

mypy firefly
ruff check firefly

About

[WIP] Yet another multimodal video-audio feature extractor based on recent research

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages