Skip to content

Commit

Permalink
Merge pull request #190 from X-LANCE/dev-slam-omni
Browse files Browse the repository at this point in the history
Add reproduction for SLAM-Omni
  • Loading branch information
ddlBoJack authored Jan 22, 2025
2 parents ad0be72 + c9b7d9e commit 43b5293
Show file tree
Hide file tree
Showing 219 changed files with 81,661 additions and 24 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,20 @@ __pycache__
.ipynb_checkpoints
.vscode
debug.py
debug.ipynb
debug.sh
.idea/*
transformers
wandb/
log/
*.log
outputs/
data/
jobs/
debug/
audio/

examples/s2s/scripts/debug
examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b_noself.sh
examples/asr_librispeech/scripts/decode_hubert_xtralarge_linear_vicuna_7b_copy.sh
examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b_copy.sh
Expand Down
43 changes: 36 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,22 @@ developers to train custom multimodal large language model (MLLM), focusing on <
6. [Citation](#citation)

# News
- [Update Jan. 22, 2025] 🔥🔥🔥 Full reproduction (including all data preparation, model training, and inference) for [SLAM-Omni](examples/s2s/README.md) has been supported.
![](docs/slam-omni-model.png)
- SLAM-Omni is a **timbre-controllable** voice interaction system that requires only **single-stage training** and minimal resources to achieve high-quality, end-to-end speech dialogue, supporting multi-turn conversations in both Chinese and English. ([paper](https://arxiv.org/abs/2412.15649), [demo](https://slam-omni.github.io))
- We have fully reproduced the **training and inference** processes of SLAM-Omni and open-sourced all related training datasets. The provided code framework theoretically supports all codec-based spoken dialogue models. Additionally, we offer the reproduction code for [Mini-Omni](https://github.com/gpt-omni/mini-omni).

<table class="center">
<tr>
<td width=50% style="border: none">
<video controls autoplay loop src="https://github.com/user-attachments/assets/73597edb-0d66-453b-b10c-8cf8dd3cae18" muted="false"></video>
</td>
<td width=50% style="border: none">
<video controls autoplay loop src="https://github.com/user-attachments/assets/7a797491-0509-4da8-8662-f2107bd8856a" muted="false"></video>
</td>
</tr>
</table>

- [Update Nov. 17, 2024] Recipes for [LLM-Based Contextual ASR](examples/contextual_asr/README.md) have been supported.
- [Update Nov. 5, 2024] Recipes for [speech emotion captioning (SEC)](examples/sec_emotioncaps/README.md) with [emotion2vec](https://github.com/ddlBoJack/emotion2vec) as the encoder has been supported.
- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) with [EAT](https://github.com/cwx-worst-one/EAT) as the encoder have been supported.
Expand Down Expand Up @@ -94,13 +110,17 @@ We provide reference implementations of various LLM-based speech, audio, and mus
- Text-to-Speech (TTS)
- [VALL-E-X](examples/vallex/README.md)
- [Speech Emotion Captioning (SEC)](examples/sec_emotioncaps/README.md)
- Voice Interaction System
- [SLAM-Omni](examples/s2s/README.md)

- **Audio Task**
- [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
- [SLAM-AAC](examples/slam_aac/README.md)
- [DRCap](examples/drcap_zeroshot_aac/README.md)

- Spatial Audio Understanding
- [BAT](examples/seld_spatialsoundqa/README.md)

- **Music Task**
- [Music Caption (MC)](examples/mc_musiccaps/README.md)

Expand Down Expand Up @@ -163,24 +183,33 @@ CoT-ST:
}
```

SLAM-Omni:
```
@article{chen2024slam,
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
journal={arXiv preprint arXiv:2412.15649},
year={2024}
}
```

## Audio Task
SLAM-AAC:
```
@article{chen2024slam,
@article{chen2025slam,
title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
journal={arXiv preprint arXiv:2410.09503},
year={2024}
journal={Proc. ICASSP},
year={2025}
}
```
DRCap:
```
@article{li2024drcap,
@article{li2025drcap,
title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
journal={arXiv preprint arXiv:2410.09472},
year={2024}
journal={Proc. ICASSP},
year={2025}
}
```
BAT:
Expand All @@ -191,4 +220,4 @@ BAT:
journal={Proc. ICML},
year={2024}
}
```
```
Binary file added docs/slam-omni-model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
147 changes: 147 additions & 0 deletions examples/s2s/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# SLAM-Omni
[![Python 3.10](https://img.shields.io/badge/Python-3.10-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![arXiv](https://img.shields.io/badge/arXiv-2412.15649-B31B1B.svg)](https://arxiv.org/abs/2412.15649) [![GitHub Demo Page](https://img.shields.io/badge/Github-Demo%20Page-orange.svg)](https://slam-omni.github.io/) [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

(*Reproduction of the [paper](https://arxiv.org/abs/2412.15649) SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.*)

## Environment Setup
Set up the environment using the following commands after preparing the SLAM-LLM environment:
```bash
pip install -r ./examples/s2s/requirements.txt
```

Alternatively, you can use our provided Docker image:
```bash
docker pull worstchan/slam-omni:v0
docker run -it --gpus all --name slam-omni worstchan/slam-omni:v0 /bin/bash
```

## Data Preparation

Our project supports two data formats: **Parquet** and **JSONL**. The open-source datasets are available on the Hugging Face Hub in **Parquet** format. Examples usage is provided in [this notebook](./demo/demo_data/demo.ipynb).

### Supported Datasets
We provide three re-synthesized datasets for SLAM-Omni training:
- [VoiceAssistant-400K](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni): Single-round English dialogue dataset.
- [UltraChat-300K](https://huggingface.co/datasets/worstchan/UltraChat-300K-SLAM-Omni): Multi-round English dialogue dataset.
- [Belle_1.4M](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni): Multi-round Chinese dialogue dataset.

#### Usage
You can load any of these datasets using the following code:
```python
from datasets import load_dataset

# Replace "DATASET_NAME" with one of the following:
# - "worstchan/VoiceAssistant-400K-SLAM-Omni"
# - "worstchan/UltraChat-300K-SLAM-Omni"
# - "worstchan/Belle_1.4M-SLAM-Omni"

ds = load_dataset("DATASET_NAME")
```

### JSONL
We also support JSONL format for its concise structure. Below is an example:
```jsonl
{"key": "1", "source_wav": "/xxx/1.wav", "source_text": "Can you recommend some Chinese food for me?", "target_wav": "/xxx/1.wav", "target_text": "Sure! I recommend trying dumplings, Peking duck, and mapo tofu for a mix of flavors and textures in Chinese cuisine. These dishes offer a good balance of savory, spicy, and crispy elements."}
```

## Checkpoints
We reproduced the single-stage fine-tuning results of SLAM-Omni with a group size of **3**. The following checkpoints are available for download:
- [Single-Round Dialogue (English)](https://drive.google.com/drive/folders/1ZmM1h5ZTvS-piuN-msmctmZdi51GWLAu?usp=sharing): Trained on VoiceAssistant-400K.
- [Multi-Round Dialogue (English)](https://drive.google.com/drive/folders/1xBNrqR2LWC0uEjezjx4aUgdsbstisboS?usp=sharing): Trained on VoiceAssistant-400K and UltraChat-300K.
- [Multi-Round Dialogue (Chinese)](https://drive.google.com/drive/folders/1sExIp-UDdL37gb-mh9YlhuDIib0-wUVP?usp=sharing): Trained on Belle_1.4M.


## Training

You can pre-train the S2S model using TTS or ASR tasks with our provided scripts, though we recommend proceeding directly to fine-tuning. Alternatively, you may directly train a TTS or ASR model under the SLAM-Omni framework. For detailed instructions, refer to the [pre-training README](./scripts/pretrain/README.md).

### Fine-tuning
We provide two primary fine-tuning options for **SLAM-Omni** modeling:
```bash
# Fine-tune with grouping strategy (Recommended)
bash ./examples/s2s/scripts/finetune/finetune_s2s_group.sh

# Fine-tune without grouping
bash ./examples/s2s/scripts/finetune/finetune_s2s.sh
```

We also include scripts for reproducing [Mini-Omni](https://github.com/gpt-omni/mini-omni). Note that this requires the original [VoiceAssistant-400K](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K) dataset for training:
```bash
bash ./examples/s2s/scripts/finetune/mini-omni/finetune_s2s.sh
```

#### Note💫
Our framework theoretically supports **all codec-based spoken dialogue model training**. Simply re-synthesize the target tokens (e.g., CosyVoice2 tokens) during training for compatibility.

## Inference
We provide scripts for both **online** and **batch** inference. You can use the trained model or the provided checkpoints for inference. For detailed guidance, refer to [inference README](./scripts/inference/README.md).



### Online Inference
Run the following commands for real-time inference:

```bash
# Multi-turn (Recommended)
bash ./examples/s2s/scripts/inference/inference_s2s_online_multi-round.sh

# Single-turn
bash ./examples/s2s/scripts/inference/inference_s2s_online.sh
```

For Mini-Omni modeling, use the following commands:
```bash
# Single-turn non-streaming
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online.sh

# Single-turn streaming
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online_stream.sh
```


### Batch Inference

For batch inference, ensure the data format matches the training format (**Parquet** or **JSONL**). Use the following commands:

```bash
# SLAM-Omni framework
bash ./examples/s2s/scripts/inference/inference_s2s_batch.sh

# Mini-Omni framework
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_batch.sh
```

## TODO
- [ ] Add evaluation scripts.
- [ ] Add streaming inference scripts for SLAM-Omni.


<!-- ## Gradio Demo -->

## Citation
SLAM-Omni:
```bibtex
@article{chen2024slam,
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
journal={arXiv preprint arXiv:2412.15649},
year={2024}
}
```
Mini-Omni:
```bibtex
@article{xie2024mini,
title={Mini-omni: Language models can hear, talk while thinking in streaming},
author={Xie, Zhifei and Wu, Changqiao},
journal={arXiv preprint arXiv:2408.16725},
year={2024}
}
```

## Acknowledgement
- We borrow some code from [Mini-Omni](https://github.com/gpt-omni/mini-omni) for SNAC-based modeling.
- We borrow some code from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for the vocoder.


## License
Our code is released under MIT License. The Chinese dialogue model is licensed under GPL-3.0 due to its use of Belle data and is intended for research purposes only.
Binary file added examples/s2s/audio_prompt/en/prompt_1.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_2.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_3.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_4.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_5.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/en/prompt_6.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_1.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_2.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_3.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_4.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_5.wav
Binary file not shown.
Binary file added examples/s2s/audio_prompt/zh/prompt_6.wav
Binary file not shown.
19 changes: 19 additions & 0 deletions examples/s2s/conf/ds_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4
}
},
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu"
}
}
}
3 changes: 3 additions & 0 deletions examples/s2s/conf/prompt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dataset_config:
# we put prompt here, because the hydra override in shell script only support a small subset of chars
prompt: "Conduct a spoken conversation with the user. "
2 changes: 2 additions & 0 deletions examples/s2s/conf/prompt_asr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
dataset_config:
prompt: "Transcribe the provided audio into accurate text. "
4 changes: 4 additions & 0 deletions examples/s2s/conf/prompt_tts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dataset_config:
# we put prompt here, because the hydra override in shell script only support a small subset of chars
# prompt: "Transcribe speech to text. Output the transcription directly without redundant content. Ensure that the output is not duplicated. "
prompt: "Generate a natural and expressive spoken version of the given text. "
47 changes: 47 additions & 0 deletions examples/s2s/deepspeed_finetune_s2s.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
from slam_llm.pipeline.finetune_deepspeed import main as train
from slam_llm.utils.deepspeed_utils import deepspeed_main_wrapper

import logging
from dataclasses import dataclass, field
from omegaconf import DictConfig, ListConfig, OmegaConf
from s2s_config import ModelConfig, TrainConfig, DataConfig, LogConfig


@dataclass
class RunConfig:
dataset_config: DataConfig = field(default_factory=DataConfig)
model_config: ModelConfig = field(default_factory=ModelConfig)
train_config: TrainConfig = field(default_factory=TrainConfig)
log_config: LogConfig = field(default_factory=LogConfig)
debug: bool = field(default=False, metadata={"help": "Use pdb when true"})
metric: str = field(default="acc", metadata={"help": "The metric for evaluation"})
deepspeed_config: str = field(default="examples/asr_librispeech/conf/ds_config.json", metadata={"help": "The metric for evaluation"})


@deepspeed_main_wrapper(config_name=None, version_base=None)
def main_hydra(cfg: DictConfig):
run_config = RunConfig()
cfg = OmegaConf.merge(run_config, cfg)
def to_plain_list(cfg_item):
if isinstance(cfg_item, ListConfig):
return OmegaConf.to_container(cfg_item, resolve=True)
elif isinstance(cfg_item, DictConfig):
return {k: to_plain_list(v) for k, v in cfg_item.items()}
else:
return cfg_item

# kwargs = to_plain_list(cfg)
kwargs = cfg
log_level = getattr(logging, kwargs.get("log_level", "INFO").upper())

logging.basicConfig(level=log_level)

if kwargs.get("debug", False):
import pdb;
pdb.set_trace()

train(kwargs)


if __name__ == "__main__":
main_hydra()
Loading

0 comments on commit 43b5293

Please sign in to comment.