-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support huggingface AutoTokenizer #127
Comments
Hi 👋 |
Thank you very much. We are working on a multimodal project that requires a tokenizer to include tokens for both text and symbolic music. Therefore, we need to merge the BPE results from Miditok into the LLAMA BPE vocabulary, or merge the LLAMA BPE vocabulary into the Miditok BPE vocabulary. We are familiar with Hugging Face's tokenizer but not very familiar with Miditok. Is there a quick way to achieve the above requirement now? P.S. This is the method we used previously. First, we trained a model using SentencePiece based on the original tokens (REMI no BPE), and then we used a script (https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py) to merge it into the LLAMA dictionary. |
I'm not totally familiar with the merging of tokenizers / vocabulary. |
Initially, we considered separating text and music into two modalities, meaning their token IDs would not overlap. However, we later realized that different business scenarios require different ways of representing music tokens. Sometimes we need a tokenization method suitable for music generation (REMI), sometimes we need a representation that is human-readable (ABC), and sometimes we need an accurate representation of sheet music format (MusicXML). Designing a separate tokenizer for each task is impractical, so we decided to use a more flexible approach similar to text. That's why we allowed the tokens for text and music to overlap, considering them as different dialects of the same language. |
That's probably because I don't have all the details, but I fail to see how a unique music tokenizer would not be practical (on the contrary!). If you want to cover several music file formats all at once (MIDI/abc/musicXML), I would suggest to chose one among them and convert the files with the other formats to the chosen one, and use one music tokenizer for this format. Merging vocabs is possible for text, as text from all languages are made of bytes that a text tokenizer can process. This is different with music, as the data is loaded and processed differently. |
The tokenizer of Miditok is excellent, but currently, most models are based on Hugging Face. Is it possible to make Miditok compatible with AutoTokenizer so that a large number of Hugging Face projects and related historical code can easily use the Miditok tokenizer without modifying the code?
The text was updated successfully, but these errors were encountered: