Skip to content

Latest commit

 

History

History
190 lines (129 loc) · 7.39 KB

README.md

File metadata and controls

190 lines (129 loc) · 7.39 KB


Table of Contents
  1. About The Project
  2. Getting Started
  3. How it works
  4. Contributing
  5. Acknowledgments
  6. Contact

About The Project

SMIT is a versatile tool designed to streamline the integration of audio modality into your LLMs. Currently, SMIT exclusively supports audio as a new modality. However, our goal is to expand its capabilities to accommodate any new modality seamlessly. We welcome contributions from the open-source community to help us achieve this aim.

(back to top)

Getting Started

Welcome to SMIT! Follow these simple steps to get started:

Begin by cloning the SMIT repository to your local machine using Git:

git clone https://github.com/Thytu/SMIT/
cd SMIT

We highly recommend using a virtual environment to manage dependencies and prevent conflicts. Create and activate a virtual environment using your preferred tool (e.g., virtualenv, conda):

# Example using virtualenv
virtualenv venv
source venv/bin/activate

Once inside the project directory and your virtual environment is activated, install the required dependencies listed in requirements.txt using pip:

pip install -r requirements.txt

Run the Example

You can quickly run the default example provided in SMI by executing the following command:

python src/main.py

This will train the amazing abacaj/phi-2-super model to do ASR using the librispeech_asr dataset and facebook/hubert-large-ls960-ft as speech encoder, reproducing the Thytu/phi-2-audio-super model.

Important

It's essential to ensure a minimum of 30GB of available VRAM to execute this command successfully. For users with >=80GB of VRAM, it's recommended to deactivate quantization while decreasing the batch size to expedite the training process. You can achieve this by running:

python src/main.py ~model.decoder.quantization_config ++training.training_args.per_device_train_batch_size=1

Customize Your Model

To customize your own Language Model (LLM), create a configuration file. You can use the provided config file template as a starting point. Then, use Hydra syntax to provide your configuration file:

python src/main.py model=my_config

Hydra offers extensive options for parameter overriding, allowing you to tailor the model according to your specific requirements. Refer to Hydra documentation for more details on customization options.

Inference

Once your model is trained, you can effortlessly load it for inference:

model = SMIT.from_pretrained("path_to_your_safetensor")

For inference tasks, you can utilize the generate method:

model.generate("Tell me how to add a modality to my model")

To employ the generate method with multiple modalities, follow this approach:

model.generate(
    prompt=[
        "Tell me how to add a modality to my model",
        "Transcribe this audio from speech to text {audio}",
    ],
    raw_speech=[None, you_audio],
)

Note

When providing multiple prompts, ensure that the length of raw_speech matches the length of prompt.

(back to top)

How it works

SMIT simplifies the process of enhancing your LLM with audio capabilities, following the principles outlined in the this paper. By linking a speech encoder to an decoder using a trainable linear projector adding to your LLM the audio modality. SLMA automates the integration process by making it as easy as configuring a single file.

To use SMIT, simply define your desired configurations in the provided config file, it will then handle the rest, seamlessly incorporating the audio modality into your models.

Untitled-2022-08-10-1416

(back to top)

Contributing

There are mutliple ways to contribute to that projects, either regarding the UX (i.e doc / even making the example faster) or regarding the core product itself (i.e handling Vision modality). Any contributions you make are greatly appreciated, if you have a suggestion that would make this better feel free to tell me :D You can also check the open issues for more things to improve.

Don't forget to give the project a star! 🌟 Thanks again!

(back to top)

Acknowledgments

This project draws significant inspiration from the An Embarrassingly Simple Approach for LLM with Strong ASR Capacity paper. I thank the authors for sharing their expertise. Huge thanks to the CoolKids for their help in debugging some pesky issues I ran into. And last but definitely not the least, a massive thank you to Oursin – this project simply wouldn't exist without you!

(back to top)

Contact

Hey, I'm Valentin De Matos, passionate about AI and always working on some new side project.

You can reach me out at vltn.dematos@gmail.com and if you want more information you can always

(back to top)