MIO: A Foundation Model on Multimodal Tokens

Environment Setup

cd MIO
conda create -n mio python=3.10
conda activate mio
pip install -r requirements.txt

tokenization_mio.py: (1) image/speech preprocessing, quantization, and decoding; (2) multimodal tokenization and detokenization; (3) applying the chat template.
utils.py: extracting the frames from the video (both keyframe extraction and uniform frame extraction).
infer.py: inference script for MIO with the examples.
/image_tokenizer
/speech_tokenizer

Please read the TODOs and examples in the infer.py script to understand how to run the inference for each modality.

python infer.py

Set the most appropriate generation config (it's recommended to conduct a hyperparameter search).
Pay attention to the input data structures, formats, instructions, and the prompt templates.
Tokenize the input data and don't forget to apply the chat template in the suitable mode (voice v.s. std).
Generate the responses.
Detokenize the responses and save the results (detokenized_{modality}_{sample_id}_{image/speech_index}.{suffix}).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
generated_images		generated_images
generated_speeches		generated_speeches
image_tokenizer		image_tokenizer
speech_tokenizer		speech_tokenizer
test_data		test_data
README.md		README.md
infer.py		infer.py
requirements.txt		requirements.txt
tokenization_mio.py		tokenization_mio.py
utils.py		utils.py