-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi voc tokenizers decoding #59
Comments
Hi, thanks for reporting this bug ! 😃 I’ll look into it in a few days when I’ll have a computer |
I would gladly appreciate any help in solving this issue! The attached zip file detokenize_multivocab_error.zip contains a .ipynb file with examples of errors for each multi-vocab tokenizer, as well as respective params and (generated) sequence that causes the errors. |
Hi 👋, As promised I took a look around. Now It also seemed that some of the calls of notebook didn't send tokens with the good format, but I must admit that this was not very well documented. Here is a new section that will be added in the documentation in the next commit, still in progress:Depending on the tokenizer at use, the format of the tokens returned by the The format is deduced from the This results in four situations, where I is the number of tracks, T is the number of tokens (or time steps) and C the number of subtokens per time step:
Note that the I dimension will represent a Some tokenizer examples to illustrate:
Here is the code from the notebook with some fixes / tweaks as the json files seem to not have been saved with the expected formatsfrom pathlib import Path
from copy import deepcopy
import miditok
mumidi = miditok.MuMIDI(params='2/mumidi_params.json')
print('MuMIDI: ', mumidi)
octuplemono = miditok.OctupleMono(params='2/octuple_params.json')
print('Octuple Mono: ', octuplemono)
cpword = miditok.CPWord(params='2/cpword_params.json')
print('CP Word: ', cpword)
"""from miditoolkit import MidiFile
midi = MidiFile("tests/Multitrack_MIDIs/All The Small Things.mid")
toto = octuplemono.midi_to_tokens(midi)
octuplemono.save_tokens(toto, "toto.json")
data_octuplemono = octuplemono.load_tokens("toto.json")"""
data_mumidi = mumidi.load_tokens(path=Path('2/mumidi_error_detokenizing_sequence.json'))
print('MuMIDI: ')
print(data_mumidi['ids'][:5])
print()
data_octuplemono = octuplemono.load_tokens(path=Path('2/octuple_error_detokenizing_sequence.json'))
print('Octuple Mono: ')
print(data_octuplemono['ids'][:5])
print()
data_cpword = cpword.load_tokens(path=Path('2/cpword_error_detokenizing_sequence.json'))
print('CP Word: ')
print(data_cpword['ids'][:5])
print()
tokSeq_mumidi = miditok.TokSequence(ids=data_mumidi['ids'])
tokSeq_completed_mumidi = deepcopy(tokSeq_mumidi)
mumidi.complete_sequence(tokSeq_completed_mumidi)
tokSeq_octuplemono=[miditok.TokSequence(ids=data_octuplemono['ids'])]
tokSeq_completed_octuplemono = deepcopy(tokSeq_octuplemono)
for seq in tokSeq_completed_octuplemono:
octuplemono.complete_sequence(seq)
data_octuplemono = [data_octuplemono['ids']]
tokSeq_cpword=[miditok.TokSequence(ids=data_cpword['ids'])]
tokSeq_completed_cpword = deepcopy(tokSeq_cpword)
for seq in tokSeq_completed_cpword:
cpword.complete_sequence(seq)
data_cpword = [data_cpword['ids']]
# List of lists
mumidi.tokens_to_midi(data_mumidi['ids'])
octuplemono.tokens_to_midi(data_octuplemono)
cpword.tokens_to_midi(data_cpword)
# TokSequence
mumidi.tokens_to_midi(tokSeq_mumidi)
octuplemono.tokens_to_midi(tokSeq_octuplemono)
cpword.tokens_to_midi(tokSeq_cpword)
# Completed TokSequence
mumidi.tokens_to_midi(tokSeq_completed_mumidi)
octuplemono.tokens_to_midi(tokSeq_completed_octuplemono)
cpword.tokens_to_midi(tokSeq_completed_cpword) I'm still working on it |
The last commit (089fa74) brings a few modifications in the The code from my message above still produce errors:
For MuMIDI, we can set a default Program in case none is found for the first note. For CPWord I guess you should check again how the tokens are generated in your code. |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hey!
I've encountered a problem and would like a second opinion on it. Note that this bug report is concerning the earlier miditok version
2.0.6
.EDIT: Just checked, on the newest version (
2.1.1
) the problem persists.Right now I'm doing experiments with Transformer architecture and multi voc tokenizers (CPWord, Octuple and MuMIDI).
On every training epoch end, I'm calling a PyTorch Lightning Callback to generate some samples. Now, when I'm calling
tokens_to_midi()
I've encountered numerous problems.These issues occurred while using MuMIDI tokenizer:
(Note that, for the processing the data I add PAD_None event to every token if the token length is not the max.)
1. If I transform the sequence to
TokSequence
and runcomplete_sequence()
and then input it to thetokens_to_midi()
then I'm getting this issue:2. If I transform the sequence to a list of lists the error is the same as 0.
3. I wrote a function to filter out 0 after generation so that it resembles the way the MuMIDI tokenizer outputs the tokens originally (different length depending on the token type).
3a. I transform such list of lists to
TokSequence
and runcomplete_sequence()
and then input it to thetokens_to_midi()
similar to 1. I'm getting the same error (the error output is from the notebook though):3b. I directly input the list of lists (with different lengths) to the
token_to_midi()
function, and I'm getting the following error:Edit: Here is the example_sequence.zip with the tokenizer parameters to get the error in detokenization.
The text was updated successfully, but these errors were encountered: