Typhoon2-Audio

Technical Report: Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models (Section 5)
Model Weights: https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct
Demo: https://audio.opentyphoon.ai/

The repository of Typhoon2-Audio, speech/audio-language model that supports speech-in and speech-out. It is built upon the Typhoon2 LLM, and it is optimized for Thai and English languages.

Usage

Requirements

torch==2.3.0
transformers==4.45.2
fairseq==0.12.2
flash-attn

Installation

# Python 3.10
pip install pip==24.0
pip install transformers==4.45.2
pip install fairseq==0.12.2 # fairseq required pip==24.0 to install & only worked only on python 3.10
pip install flash-attn==2.5.9.post1

Load Model

import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "scb10x/llama3.1-typhoon2-audio-8b-instruct",
    torch_dtype=torch.float16, 
    trust_remote_code=True
)
model.to("cuda")

Inference - Single turn example

Note that: audio file (audio_url) must be 16000 Hz.

conversation = [
    {"role": "system", "content": "You are a helpful female assistant named ไต้ฝุ่น."},
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
            },
            {"type": "text", "text": "Transcribe this audio"},
        ],
    },
]
x = model.generate(
    conversation=conversation,
    max_new_tokens=500,
    do_sample=True,
    num_beams=1,
    top_p=0.9,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=0.7,
)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])

Inference - Multi turn example

Note that: audio file (audio_url) must be 16000 Hz.

conversation_multi_turn = [
    {
        "role": "system",
        "content": "You are a helpful female assistant named ไต้ฝุ่น. Respond conversationally to the speech provided in the language it is spoken in.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
                # บอกชื่อเมืองใหญ่ๆในอเมริกามาให้หน่อยสิ -- "List some names of US cities"
            },
            {
                "type": "text",
                "text": "",
            },
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "โอเคค่ะ, ฉันจะบอกชื่อเมืองใหญ่ๆ ในอเมริกาให้คุณฟัง:\n\n1. นิวยอร์ก\n2. ลอสแอนเจลิส\n3. ชิคาโก\n4. ฮิวสตัน\n5. ฟิลาเดลเฟีย\n6. บอสตัน\n7. ซานฟรานซิสโก\n8. วอชิงตัน ดี.ซี. (Washington D.C.)\n9. แอตแลนต้า\n10. ซีแอตเทิล\n\nถ้าคุณต้องการข้อมูลเพิ่มเติมหรือมีคำถามอื่นๆ กรุณาถามได้เลยค่ะ'",
            },
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2284cd76e1c875525ff75327a2fc3610.wav",
                # แล้วถ้าเป็นประเทศอังกฤษล่ะ -- "How about the UK"

            },
        ],
    },
]
x = model.generate(conversation=conversation_multi_turn)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])

TTS functionality

y = model.synthesize_speech("Hello, my name is ไต้ฝุ่น I am a language model specialized in Thai")
# y => numpy array

To run a demo

Demo: https://audio.opentyphoon.ai/

Additional packages for hosting the demo:

pip install gradio_webrtc==0.0.27
pip install twilio==9.4.1
pip install onnxruntime-gpu==1.20.1

python demo.py

Note: GPU is required to run a demo

To Do

Build a model locally

Look at this script:

python local_build.py

Acknowledgements

We are grateful to the previous open-source projects that provide useful resources for the development of Typhoon2-Audio, with notable projects including:

SALMONN: https://github.com/bytedance/SALMONN
Llama-Omni: https://github.com/ictnlp/LLaMA-Omni

Citation

Typhoon 2 Technical Report:

@misc{typhoon2,
      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, 
      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
      year={2024},
      eprint={2412.13702},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13702}, 
}

The first Typhoon-Audio work, focusing on improved understanding and instruction following as well as Thai performance:

@article{manakul2024enhancing,
  title={Enhancing low-resource language and instruction following capabilities of audio language models},
  author={Manakul, Potsawee and Sun, Guangzhi and Sirichotedumrong, Warit and Tharnpipitchai, Kasima and Pipatanakul, Kunat},
  journal={arXiv preprint arXiv:2409.10999},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
assets		assets
examples		examples
test		test
typhoon2_audio		typhoon2_audio
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
local_build.py		local_build.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Typhoon2-Audio

Usage

Requirements

Installation

Load Model

Inference - Single turn example

Inference - Multi turn example

TTS functionality

To run a demo

To Do

Build a model locally

Acknowledgements

Citation

About

Releases

Packages

Contributors 3

Languages

License

scb-10x/typhoon2-audio

Folders and files

Latest commit

History

Repository files navigation

Typhoon2-Audio

Usage

Requirements

Installation

Load Model

Inference - Single turn example

Inference - Multi turn example

TTS functionality

To run a demo

To Do

Build a model locally

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages