This is the PyTorch code for Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing. This code is developed on the code of AV-HuBERT.
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner.
You can find checkpoint of our model in here.
Move the checkpoint to checkpoints
.
conda create -n vsp-llm python=3.9 -y
conda activate vsp-llm
git clone https://github.com/Sally-SH/VSP-LLM.git
cd VSP-LLM
pip install -r requirements.txt
cd fairseq
pip install --editable ./
- Download AV-HuBERT pre-trained model
AV-HuBERT Large (LSR3 + VoxCeleb2)
from here. - Download LLaMA2-7B from here.
Move the AV-HuBERT pre-trained model checkpoint and the LLaMA2-7B checkpoint to checkpoints
.
Follow Auto-AVSR preparation to preprocess the LRS3 dataset.
Then, follow AV-HuBERT preparation from step 3 to create manifest of LRS3 dataset.
Follow the steps in clustering
to create:
{train,valid}.km
frame-aligned pseudo label files. Thelabel_rate
is the same as the feature frame rate used for clustering, which is 25Hz for AV-HuBERT features by default.
.
├── lrs3
│ ├── lrs3_video_seg24s # Preprocessed video and audio data
│ └── lrs3_text_seg24s # Preprocessed text data
├── muavic_dataset # Mix of VSR data and VST(En-X) data
│ ├── train.tsv # List of audio and video path for training
│ ├── train.wrd # List of target label for training
│ ├── train.cluster_counts # List of clusters to deduplicate speech units in training
│ ├── test.tsv # List of audio and video path for testing
│ ├── test.wrd # List of target label for testing
│ └── test.cluster_counts # List of clusters to deduplicate speech units in testing
└── test_data
├── vsr
│ └── en
│ ├── test.tsv
│ ├── test.wrd
│ └── test.cluster_counts
└── vst
└── en
├── es
: ├── test.tsv
: ├── test.wrd
: └── test.cluster_counts
└── pt
├── test.tsv
├── test.wrd
└── test.cluster_counts
The test manifest is provided in labels
. You need to replace the path of the LRS3 in the manifest file with your preprocessed LRS3 dataset path using the following command:
cd src/dataset
python replace_path.py --lrs3 /path/to/lrs3
Then modified test amanifest is saved in dataset
Open the training script (scripts/train.sh
) and replace these variables:
# path to train dataset dir
DATA_PATH=???
# path where output trained models will be located
OUT_PATH=???
Run the training script:
$ bash scripts/train.sh
Open the decoding script (scripts/decode.sh
) and replace these variables:
# language direction (e.g 'en' for VSR task / 'en-es' for En to Es VST task)
LANG=???
# path to the trained model
MODEL_PATH=???
# path where decoding results and scores will be located
OUT_PATH=???
Run the decoding script:
$ bash scripts/decode.sh