Support huggingface AutoTokenizer #127

oiabtt · 2024-01-13T14:50:41Z

The tokenizer of Miditok is excellent, but currently, most models are based on Hugging Face. Is it possible to make Miditok compatible with AutoTokenizer so that a large number of Hugging Face projects and related historical code can easily use the Miditok tokenizer without modifying the code?

Natooz · 2024-01-13T14:59:53Z

Hi 👋
A direct implementation with with the actual AutoTokenizer module from the Hugging Face packages cannot be implemented, or unless the HF team does it within the package, but I really doubt that as they have little interest in doing so. 😄
However, you can replace it with miditok.MIDITokenizer.from_pretrained() to have the same result! A code change is still needed, hopefully this isn't much.

oiabtt · 2024-01-13T15:11:20Z

Hi 👋 A direct implementation with with the actual AutoTokenizer module from the Hugging Face packages cannot be implemented, or unless the HF team does it within the package, but I really doubt that as they have little interest in doing so. 😄 However, you can replace it with miditok.MIDITokenizer.from_pretrained() to have the same result! A code change is still needed, hopefully this isn't much.

Thank you very much. We are working on a multimodal project that requires a tokenizer to include tokens for both text and symbolic music. Therefore, we need to merge the BPE results from Miditok into the LLAMA BPE vocabulary, or merge the LLAMA BPE vocabulary into the Miditok BPE vocabulary. We are familiar with Hugging Face's tokenizer but not very familiar with Miditok. Is there a quick way to achieve the above requirement now?

P.S. This is the method we used previously. First, we trained a model using SentencePiece based on the original tokens (REMI no BPE), and then we used a script (https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py) to merge it into the LLAMA dictionary.

Natooz · 2024-01-13T15:18:45Z

I'm not totally familiar with the merging of tokenizers / vocabulary.
Before answering more specifically, how do you plan to use the merged tokenizer?
We are talking about two totally different modalities here, that require different preprocessing. Is it impossible to use the two tokenizer separately for the appropriate data type (text/MIDI)? The model could still be fed with the token ids (integers) coming from both, as long as they do not "overlap" (i.e. text ids range from 0 to $x_1$, and MIDI ids range from $x_1 + 1$ to $x_2$

oiabtt · 2024-01-13T15:33:02Z

I'm not totally familiar with the merging of tokenizers / vocabulary. Before answering more specifically, how do you plan to use the merged tokenizer? We are talking about two totally different modalities here, that require different preprocessing. Is it impossible to use the two tokenizer separately for the appropriate data type (text/MIDI)? The model could still be fed with the token ids (integers) coming from both, as long as they do not "overlap" (i.e. text ids range from 0 to x1, and MIDI ids range from x1+1 to x2

Initially, we considered separating text and music into two modalities, meaning their token IDs would not overlap. However, we later realized that different business scenarios require different ways of representing music tokens. Sometimes we need a tokenization method suitable for music generation (REMI), sometimes we need a representation that is human-readable (ABC), and sometimes we need an accurate representation of sheet music format (MusicXML). Designing a separate tokenizer for each task is impractical, so we decided to use a more flexible approach similar to text. That's why we allowed the tokens for text and music to overlap, considering them as different dialects of the same language.

Natooz · 2024-01-13T15:52:19Z

That's probably because I don't have all the details, but I fail to see how a unique music tokenizer would not be practical (on the contrary!). If you want to cover several music file formats all at once (MIDI/abc/musicXML), I would suggest to chose one among them and convert the files with the other formats to the chosen one, and use one music tokenizer for this format.

Merging vocabs is possible for text, as text from all languages are made of bytes that a text tokenizer can process. This is different with music, as the data is loaded and processed differently.
What you could do is build a separate set of bytes dedicated only to the music attributes (i.e. the music vocabulary, pitch, durations...). Then tokenize your music files, convert the ids to bytes (using the dedicated set of music bytes), and save them as text files that the HF tokenizer could load.

oiabtt closed this as completed Jan 23, 2024

Natooz mentioned this issue Jan 29, 2024

can now use MIDITokenizer.from_pretrained similarly to AutoTokenizer #142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support huggingface AutoTokenizer #127

Support huggingface AutoTokenizer #127

oiabtt commented Jan 13, 2024

Natooz commented Jan 13, 2024

oiabtt commented Jan 13, 2024

Natooz commented Jan 13, 2024

oiabtt commented Jan 13, 2024

Natooz commented Jan 13, 2024

Support huggingface AutoTokenizer #127

Support huggingface AutoTokenizer #127

Comments

oiabtt commented Jan 13, 2024

Natooz commented Jan 13, 2024

oiabtt commented Jan 13, 2024

Natooz commented Jan 13, 2024

oiabtt commented Jan 13, 2024

Natooz commented Jan 13, 2024