Skip to content

The repository of Typhoon2-Audio, Thai audio-language model that supports speech-in and speech-out

License

Notifications You must be signed in to change notification settings

scb-10x/typhoon2-audio

Repository files navigation

Typhoon2-Audio

arxiv Models License: Apache-2.0


The repository of Typhoon2-Audio, speech/audio-language model that supports speech-in and speech-out. It is built upon the Typhoon2 LLM, and it is optimized for Thai and English languages.

Usage

Requirements

  • torch==2.3.0
  • transformers==4.45.2
  • fairseq==0.12.2
  • flash-attn

Installation

# Python 3.10
pip install pip==24.0
pip install transformers==4.45.2
pip install fairseq==0.12.2 # fairseq required pip==24.0 to install & only worked only on python 3.10
pip install flash-attn==2.5.9.post1

Load Model

import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "scb10x/llama3.1-typhoon2-audio-8b-instruct",
    torch_dtype=torch.float16, 
    trust_remote_code=True
)
model.to("cuda")

Inference - Single turn example

Note that: audio file (audio_url) must be 16000 Hz.

conversation = [
    {"role": "system", "content": "You are a helpful female assistant named ไต้ฝุ่น."},
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
            },
            {"type": "text", "text": "Transcribe this audio"},
        ],
    },
]
x = model.generate(
    conversation=conversation,
    max_new_tokens=500,
    do_sample=True,
    num_beams=1,
    top_p=0.9,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=0.7,
)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])

Inference - Multi turn example

Note that: audio file (audio_url) must be 16000 Hz.

conversation_multi_turn = [
    {
        "role": "system",
        "content": "You are a helpful female assistant named ไต้ฝุ่น. Respond conversationally to the speech provided in the language it is spoken in.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
                # บอกชื่อเมืองใหญ่ๆในอเมริกามาให้หน่อยสิ -- "List some names of US cities"
            },
            {
                "type": "text",
                "text": "",
            },
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "โอเคค่ะ, ฉันจะบอกชื่อเมืองใหญ่ๆ ในอเมริกาให้คุณฟัง:\n\n1. นิวยอร์ก\n2. ลอสแอนเจลิส\n3. ชิคาโก\n4. ฮิวสตัน\n5. ฟิลาเดลเฟีย\n6. บอสตัน\n7. ซานฟรานซิสโก\n8. วอชิงตัน ดี.ซี. (Washington D.C.)\n9. แอตแลนต้า\n10. ซีแอตเทิล\n\nถ้าคุณต้องการข้อมูลเพิ่มเติมหรือมีคำถามอื่นๆ กรุณาถามได้เลยค่ะ'",
            },
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2284cd76e1c875525ff75327a2fc3610.wav",
                # แล้วถ้าเป็นประเทศอังกฤษล่ะ -- "How about the UK"

            },
        ],
    },
]
x = model.generate(conversation=conversation_multi_turn)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])

TTS functionality

y = model.synthesize_speech("Hello, my name is ไต้ฝุ่น I am a language model specialized in Thai")
# y => numpy array

To run a demo

Additional packages for hosting the demo:

pip install gradio_webrtc==0.0.27
pip install twilio==9.4.1
pip install onnxruntime-gpu==1.20.1
python demo.py

Note: GPU is required to run a demo

To Do

  • Merge LoRA weights
  • Integrate Encoder + LLM + Generator + Vocoder
  • Local build to upload to HF
  • Implement Typhoon2AudioForConditionalGeneration and Typhoon2Audio2AudioForConditionalGeneration
  • Test loading normal and auto class
  • Implement .forward() for Typhoon2AudioForConditionalGeneration
  • Implement .generate() for Typhoon2AudioForConditionalGeneration
  • Implement .forward() for Typhoon2Audio2AudioForConditionalGeneration
  • Implement .generate() for Typhoon2Audio2AudioForConditionalGeneration
  • Allow streaming for .generate() for Typhoon2AudioForConditionalGeneration
  • Allow streaming for .generate_stream() for Typhoon2Audio2AudioForConditionalGeneration
  • Allow multi-turn for .generate() for Typhoon2AudioForConditionalGeneration
  • Allow multi-turn for .generate() for Typhoon2Audio2AudioForConditionalGeneration
  • Add TTS functionality to Typhoon2Audio2AudioForConditionalGeneration
  • Move prompt pattern to Qwen2-Audio input style: vllm-project/vllm#9248
  • Write doc & method string
  • Allow flash_attention for LLM
  • Allow device_map="auto"
  • Make the code self-contained (LLM)
  • Make the code self-contained (Vocoder) -- done but requires import fairseq

Build a model locally

Look at this script:

python local_build.py

Acknowledgements

We are grateful to the previous open-source projects that provide useful resources for the development of Typhoon2-Audio, with notable projects including:

Citation

Typhoon 2 Technical Report:

@misc{typhoon2,
      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, 
      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
      year={2024},
      eprint={2412.13702},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13702}, 
}

The first Typhoon-Audio work, focusing on improved understanding and instruction following as well as Thai performance:

@article{manakul2024enhancing,
  title={Enhancing low-resource language and instruction following capabilities of audio language models},
  author={Manakul, Potsawee and Sun, Guangzhi and Sirichotedumrong, Warit and Tharnpipitchai, Kasima and Pipatanakul, Kunat},
  journal={arXiv preprint arXiv:2409.10999},
  year={2024}
}

About

The repository of Typhoon2-Audio, Thai audio-language model that supports speech-in and speech-out

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages