Skip to content
/ SoCodec Public

Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications

License

Notifications You must be signed in to change notification settings

hhguo/SoCodec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis

Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng

This repository contains inference scripts for SoCodec, an ultra-low-bitrate speech codec, dedicated to speech language models, introduced in the paper titled SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis.

Paper
📈 Demo Site
Model Weights

👉 With SoCodec, you can compress audio into discrete codes at an ultra low 0.47 kbps bitrate and a short 120ms frameshift.
👌 It can be used as a drop-in replacement for EnCodec or other multi-stream codecs for speech language modeling applications.
📚 The released checkpoint only supports Chinese now. The training of the multi-lingual version is in progress.

News

  • Sep 2024 (v1.0):
    • We have released the checkpoint and inference code of SoCodec

Installation

Clone the repository and install dependencies:

git clone https://github.com/hhguo/SoCodec
cd SoCodec
mkdir ckpts && cd ckpts
wget https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt
wget https://huggingface.co/hhguo/SoCodec/resolve/main/socodec_16384x4_120ms_16khz_chinese.safetensors
wget https://huggingface.co/hhguo/SoCodec/resolve/main/mel_vocoder_80dim_10ms_16khz.safetensors

Usage

# For analysis-synthesis
python example.py -i ground_truth.wav -o synthesis.wav
# For speech analysis
python example.py -i ground_truth.wav -o features.pt
# For token-to-audio synthesis
python example.py -i features.pt -o synthesis.wav

Pretrained Models

We provide the pretrained models on Hugging Face Collections.

Model Name Frame Shift Codebook Size Number of Streams Dataset
socodec_16384x4_120ms_16khz_chinese 120ms 16384 4 WenetSpeech4TTS

We also provide the pretrained vocoders to convert the Mel spectrogram from socodec to the waveform.

Model Name Frame Shift Mel Bins fmax Upsampling Ratio Dataset
mel_vocoder_80dim_10ms_16khz 16 kHz 80 8000 160 WenetSpeech4TTS

TODO

  • Provide the checkpoint and inference code of multi-stream LLM
  • Provide the single-codebook version
  • Provide a higher-quality neural vocoder
  • Provide a multi-lingual version (Chinese, English, etc.)

References

About

Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages