Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support huggingface AutoTokenizer #127

Closed
oiabtt opened this issue Jan 13, 2024 · 5 comments
Closed

Support huggingface AutoTokenizer #127

oiabtt opened this issue Jan 13, 2024 · 5 comments

Comments

@oiabtt
Copy link

oiabtt commented Jan 13, 2024

The tokenizer of Miditok is excellent, but currently, most models are based on Hugging Face. Is it possible to make Miditok compatible with AutoTokenizer so that a large number of Hugging Face projects and related historical code can easily use the Miditok tokenizer without modifying the code?

@Natooz
Copy link
Owner

Natooz commented Jan 13, 2024

Hi 👋
A direct implementation with with the actual AutoTokenizer module from the Hugging Face packages cannot be implemented, or unless the HF team does it within the package, but I really doubt that as they have little interest in doing so. 😄
However, you can replace it with miditok.MIDITokenizer.from_pretrained() to have the same result! A code change is still needed, hopefully this isn't much.

@oiabtt
Copy link
Author

oiabtt commented Jan 13, 2024

Hi 👋 A direct implementation with with the actual AutoTokenizer module from the Hugging Face packages cannot be implemented, or unless the HF team does it within the package, but I really doubt that as they have little interest in doing so. 😄 However, you can replace it with miditok.MIDITokenizer.from_pretrained() to have the same result! A code change is still needed, hopefully this isn't much.

Thank you very much. We are working on a multimodal project that requires a tokenizer to include tokens for both text and symbolic music. Therefore, we need to merge the BPE results from Miditok into the LLAMA BPE vocabulary, or merge the LLAMA BPE vocabulary into the Miditok BPE vocabulary. We are familiar with Hugging Face's tokenizer but not very familiar with Miditok. Is there a quick way to achieve the above requirement now?

P.S. This is the method we used previously. First, we trained a model using SentencePiece based on the original tokens (REMI no BPE), and then we used a script (https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py) to merge it into the LLAMA dictionary.

@Natooz
Copy link
Owner

Natooz commented Jan 13, 2024

I'm not totally familiar with the merging of tokenizers / vocabulary.
Before answering more specifically, how do you plan to use the merged tokenizer?
We are talking about two totally different modalities here, that require different preprocessing. Is it impossible to use the two tokenizer separately for the appropriate data type (text/MIDI)? The model could still be fed with the token ids (integers) coming from both, as long as they do not "overlap" (i.e. text ids range from 0 to $x_1$, and MIDI ids range from $x_1 + 1$ to $x_2$

@oiabtt
Copy link
Author

oiabtt commented Jan 13, 2024

I'm not totally familiar with the merging of tokenizers / vocabulary. Before answering more specifically, how do you plan to use the merged tokenizer? We are talking about two totally different modalities here, that require different preprocessing. Is it impossible to use the two tokenizer separately for the appropriate data type (text/MIDI)? The model could still be fed with the token ids (integers) coming from both, as long as they do not "overlap" (i.e. text ids range from 0 to x1, and MIDI ids range from x1+1 to x2

Initially, we considered separating text and music into two modalities, meaning their token IDs would not overlap. However, we later realized that different business scenarios require different ways of representing music tokens. Sometimes we need a tokenization method suitable for music generation (REMI), sometimes we need a representation that is human-readable (ABC), and sometimes we need an accurate representation of sheet music format (MusicXML). Designing a separate tokenizer for each task is impractical, so we decided to use a more flexible approach similar to text. That's why we allowed the tokens for text and music to overlap, considering them as different dialects of the same language.

@Natooz
Copy link
Owner

Natooz commented Jan 13, 2024

That's probably because I don't have all the details, but I fail to see how a unique music tokenizer would not be practical (on the contrary!). If you want to cover several music file formats all at once (MIDI/abc/musicXML), I would suggest to chose one among them and convert the files with the other formats to the chosen one, and use one music tokenizer for this format.

Merging vocabs is possible for text, as text from all languages are made of bytes that a text tokenizer can process. This is different with music, as the data is loaded and processed differently.
What you could do is build a separate set of bytes dedicated only to the music attributes (i.e. the music vocabulary, pitch, durations...). Then tokenize your music files, convert the ids to bytes (using the dedicated set of music bytes), and save them as text files that the HF tokenizer could load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants