Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev mzy #73

Merged
merged 2 commits into from
May 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ body:
attributes:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the
existing and past issues](https://github.com/facebookresearch/llama-recipes/issues), the [FAQ](https://github.com/facebookresearch/llama-recipes/blob/main/docs/FAQ.md)
existing and past issues](https://github.com/ddlBoJack/SLAM-LLM/issues).

- type: textarea
id: system-info
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature-request.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: 🚀 Feature request
description: Submit a proposal/request for a new llama-recipes feature
description: Submit a proposal/request for a new slam-llm feature

body:
- type: textarea
Expand Down
7 changes: 1 addition & 6 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,4 @@ wandb/
log/
*.log
outputs/
data/

.gitignore
examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b_noself.sh
examples/asr_librispeech/scripts/decode_hubert_xtralarge_linear_vicuna_7b_copy.sh
examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b_copy.sh
data/
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ developers to train custom multimodal large language model (MLLM), focusing on <
5. [Acknowledge](#acknowledge)

# News
- [Update May. 20, 2024] Recipes for [music caption (MC)](examples/mc_musiccaps/README.md) has been supported.
- [Update May. 8, 2024] Recipes for [visual speech recognition (VSR)](examples/vsr_LRS3/README.md) has been supported.
- [Update May. 4, 2024] Recipes for [zero-shot text-to-speech (TTS)](examples/vallex/README.md) has been supported.
- [Update Apr. 28, 2024] Recipes for [automated audio captioning (AAC)](examples/aac_audiocaps/README.md) has been supported.
Expand Down Expand Up @@ -66,6 +67,8 @@ We provide reference implementations of various LLM-based speech, audio, and mus
- [Visual Speech Recognition (VSR)](examples/vsr_LRS3/README.md)
- **Audio Task**
- [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
- **Music Task**
- [Music Caption (MC)](examples/mc_musiccaps/README.md)

## Configuration Priority
We provide hierarchical configuration inheritance relationships as follows:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,3 @@ python $code_dir/inference_asr_batch.py \
# ++dataset_config.normalize=true \
# ++model_config.encoder_projector=q-former \
# ++dataset_config.fix_length_audio=64 \

python src/slam_llm/utils/whisper_tn.py ${decode_log}_gt ${decode_log}_gt.proc
python src/slam_llm/utils/whisper_tn.py ${decode_log}_pred ${decode_log}_pred.proc
python src/slam_llm/utils/compute_wer.py ${decode_log}_gt.proc ${decode_log}_pred.proc ${decode_log}.proc.wer
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,3 @@ python $code_dir/inference_asr_batch.py \
# ++dataset_config.normalize=true \
# ++model_config.encoder_projector=q-former \
# ++dataset_config.fix_length_audio=64 \

python src/slam_llm/utils/whisper_tn.py ${decode_log}_gt ${decode_log}_gt.proc
python src/slam_llm/utils/whisper_tn.py ${decode_log}_pred ${decode_log}_pred.proc
python src/slam_llm/utils/compute_wer.py ${decode_log}_gt.proc ${decode_log}_pred.proc ${decode_log}.proc.wer
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,3 @@ python $code_dir/inference_asr_batch.py \
# ++dataset_config.normalize=true \
# ++model_config.encoder_projector=q-former \
# ++dataset_config.fix_length_audio=64 \

python src/slam_llm/utils/whisper_tn.py ${decode_log}_gt ${decode_log}_gt.proc
python src/slam_llm/utils/whisper_tn.py ${decode_log}_pred ${decode_log}_pred.proc
python src/slam_llm/utils/compute_wer.py ${decode_log}_gt.proc ${decode_log}_pred.proc ${decode_log}.proc.wer
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Music Caption
# MC_MusicCaps

## Performance and checkpoints
Here is a recipe for music captioning, using MusicFM as encoder. We only train the linear projector. For more about MusicFM and its checkpoints, please refer to [this repository](https://github.com/minzwon/musicfm).

The following results are obtained by training on the LP-MusicCaps-MC training set and evaluating on the LP-MusicCaps-MC test set.
The following results are obtained by training on the [LP-MusicCaps-MC](https://huggingface.co/datasets/seungheondoh/LP-MusicCaps-MC) training set and evaluating on the [LP-MusicCaps-MC](https://huggingface.co/datasets/seungheondoh/LP-MusicCaps-MC) test set.
Encoder | Projector | LLM | BLEU-1 | METEOR | SPICE | SPIDER
|---|---|---|---|---|---|---
[MusicFM(pretrained with MSD)](https://huggingface.co/minzwon/MusicFM/resolve/main/pretrained_msd.pt) | [Linear](https://drive.google.com/file/d/1-9pob6QvJRoq5Dy-LZbiDfF6Q7QRO8Au/view?usp=sharing)(~18.88M) | [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) | 25.6 | 10.0 | 8.7 | 6.9
Expand Down
78 changes: 78 additions & 0 deletions examples/mc_musiccaps/model/slam_model_mir.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import torch
import os
import logging
from slam_llm.models.slam_model import (
slam_model,
setup_tokenizer,
setup_encoder,
setup_encoder_projector,
setup_llm,
)
from slam_llm.utils.train_utils import print_model_size

logger = logging.getLogger(__name__)

def model_factory(train_config, model_config, **kwargs):
# return necessary components for training
tokenizer = setup_tokenizer(train_config, model_config, **kwargs)

encoder = setup_encoder(train_config, model_config, **kwargs)

# llm
llm = setup_llm(train_config, model_config, **kwargs)

# projector
encoder_projector = setup_encoder_projector(
train_config, model_config, **kwargs
)
model = slam_model_mir(
encoder,
llm,
encoder_projector,
tokenizer,
train_config,
model_config,
**kwargs,
)

ckpt_path = kwargs.get(
"ckpt_path", None
) # FIX(MZY): load model ckpt(mainly projector, related to model_checkpointing/checkpoint_handler.py: save_model_checkpoint_peft)

if ckpt_path is not None:
logger.info("loading other parts from: {}".format(ckpt_path))
ckpt_dict = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt_dict, strict=False)

print_model_size(
model,
train_config,
(
int(os.environ["RANK"])
if train_config.enable_fsdp or train_config.enable_ddp
else 0
),
)
return model, tokenizer


class slam_model_mir(slam_model):
def __init__(
self,
encoder,
llm,
encoder_projector,
tokenizer,
train_config,
model_config,
**kwargs,
):
super().__init__(
encoder,
llm,
encoder_projector,
tokenizer,
train_config,
model_config,
**kwargs,
)
156 changes: 0 additions & 156 deletions examples/music_caption/model/slam_model_mir.py

This file was deleted.

4 changes: 0 additions & 4 deletions examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b.sh
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,3 @@ python $code_dir/inference_vsr_batch.py \
# +dataset_config.normalize=true \
# +model_config.encoder_projector=q-former \
# +dataset_config.fix_length_audio=64 \

python src/slam_llm/utils/whisper_tn.py ${decode_log}_gt ${decode_log}_gt.proc
python src/slam_llm/utils/whisper_tn.py ${decode_log}_pred ${decode_log}_pred.proc
python src/slam_llm/utils/compute_wer.py ${decode_log}_gt.proc ${decode_log}_pred.proc ${decode_log}.proc.wer
11 changes: 3 additions & 8 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,7 @@ build-backend = "hatchling.build"
[project]
name = "slam-llm"
version = "0.0.1"
authors = [
{ name="Hamid Shojanazeri", email="hamidnazeri@meta.com" },
{ name="Matthias Reso", email="mreso@meta.com" },
{ name="Geeta Chauhan", email="gchauhan@meta.com" },
]
description = "To be done"
description = "SLAM-LLM is a deep learning toolkit that allows researchers and developers to train custom multimodal large language model (MLLM), focusing on Speech, Language, Audio, Music processing. We provide detailed recipes for training and high-performance checkpoints for inference."
readme = "README.md"
requires-python = ">=3.8"
classifiers = [
Expand All @@ -26,8 +21,8 @@ tests = ["pytest-mock"]
auditnlg = ["auditnlg"]

[project.urls]
"Homepage" = "https://github.com/facebookresearch/llama-recipes/"
"Bug Tracker" = "https://github.com/facebookresearch/llama-recipes/issues"
"Homepage" = "https://github.com/ddlBoJack/SLAM-LLM"
"Bug Tracker" = "https://github.com/ddlBoJack/SLAM-LLM/issues"

[tool.hatch.build]
exclude = [
Expand Down
19 changes: 0 additions & 19 deletions scripts/compute_wer.sh

This file was deleted.

Loading
Loading