-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using REMI for tokenization with use_time_signatures=True, many duplicate measures can be encoded. #74
Comments
This problem usually occurs in MIDI files with a large number of measures. |
Thank you for the file! This was indeed a bug requiring to modify the time encoding / decoding quite a bit, than again for spotting it! I'll include this MIDI files in the MIDI test cases (unless you do not want). |
Fix merged ! Edit: I tested with additional files containing multiple time signature changes, everything seems alright. Do you have other MIDI files you would like to be tested ? In case you would like to it yourself, you can use the test_one_track.py script in the tests. |
The MIDI tokens look normal now, but I'm a bit confused because the original MusicXML file has 124 measures, but the corresponding MIDI file after REMI tokenizer only has 107 measures (in [6354774_Macabre Waltz]). Why is this happening? |
Are you using the latest commit on the main branch ? |
I'm in the latest commit on the main branch, but
results is 93 |
Yes this is because of the use of rests. As some rests may last longer than a bar, some bars does not come with |
Yes, you're right. I got the result of 122 with use_rests=False. Strangely, there are 124 measures in the corresponding MusicXML, but the MIDI was obtained by converting an MXL file using MuseScore. Why isn't it matching here? |
I don't know for sure, but maybe some rests were "cut" at the beginning or the end of the original musicXML file ? 🤷♂️ |
Maybe, I'll look for the reason again. At least now, the REMI tokenizer is working properly. |
Hi, I think I have found the reason. First, I made a mistake that when converting musicxml to midi through musescore, it will have two tracks, but I only took the first one, which resulted in fewer measures (such as 6354774_Macabre Waltz). Secondly, musescore will merge adjacent notes, so there will be a long note at the end, which will not be split by REMI tokenizer, resulting in the duration of some tokens being greater than TimeSig, which also leads to the inconsistency of the number of measures. debug
notes in last bar (one track)
REMI tokens in last bar
Compared with midi, in the original mxl, the chord at the end is divided into two parts in two measures. Moreover, REMI did not split it when tokenizer otherWhen I tried to use REMI detokenizer and save it as midi, I opened the converted midi with musescore and found that it generated several empty measures (measure 7 and 14). |
I'm looking into it. |
Sorry, I missed this requirement. Now it outputs normally when detokenizer. |
Well thank you, thanks to this MIDI I realized that the decoding in multiple stream with several identical programs would merge tracks. I'm working on a fix |
Your file covered a case non-covered yet when using both time signatures and rests that I managed to fix! Edit: if you ever encounter other problematic cases please don't hesitate to share them, that's thanks to this that bugs can be found and fixed :) |
Great, thank you very much. Besides, when I tried to split notes in the tokenizer so that the duration of each measure would not exceed the time signature, I did get the correct number of measures. However, this also introduced a problem, namely that the notes at the measure boundaries were split into two, requiring additional markers or merging during detokenization to ensure consistency with the original. Moreover, I realized that only when it comes to musicxml would a correct number of measures be necessary, and others may not need this feature. |
I'm note sure to follow at 100% what's being done exactly 🤔 |
What I mean is that if a note's duration spans across two measures, then in musicxml it is split into two notes and connected with a beam to indicate that they are played together. However, in the MIDI tokenizer, this note will only be placed in one measure and will have a longer duration, such as 9.0.4 when the time signature is 8/4, which will make the total duration of the measure exceed the time signature and also reduce the number of measures in the MIDI tokenizer. If I want to solve this problem, I need to split the notes that exceed the measure duration and put the excess part in the second measure. However, this also introduces other problems, namely that during detokenization, this type of note needs to be merged. |
Ok, thank you for clarifying, I'm not very familiar with musicXML. More broadly about the support of musicXML in MidiTok, this could be done one day, if there is enough demand or if someone is motivated to add it. It shouldn't be that hard, basically juste adding a layer of musicXML <--> MIDI conversion. |
Yes, what I tried was the conversion process audio -> MIDI file -> MIDI tokens -> MXL tokens -> MusicXML, sometimes the number of notes in the MIDI file is too large and usually needs to be split into multiple parts (multiple measures in each part), so it is necessary to align the number and position of notes between the MIDI (tokens) and MXL (tokens). |
What do you mean by "MIDI tokens (MidiTok) --> MXL tokens"? Ultimately this is something that could be worth to integrate in Musescore if I follow correctly |
Yes, your understanding is absolutely correct. I have already implemented this part, but the overall effectiveness is limited due to the scarcity of data. |
This issue is stale because it has been open for 30 days with no activity. |
miditok version
2.1.5
Problem summary
When using REMI for tokenization with use_time_signatures=True, many duplicate measures can be encoded.
Steps to reproduce
from miditok import REMI, REMIPlus, TokenizerConfig, MIDITokenizer
config = TokenizerConfig(use_tempos=True, nb_tempos=240, tempo_range=(20, 259), use_rests=True, use_time_signatures=True, time_signature_range={8: (3,12), 4: (1,6)}, use_programs=False)
tokenizer = REMI(config)
tokens = tokenizer(midi_path)
print("MIDI tokens: \n", tokens)
Expected vs. actual behavior
MIDI tokens:
[TokSequence(tokens=[[..., 'Velocity_79', 'Duration_1.0.8', 'Bar_None', 'TimeSig_4/4', 'Bar_None', 'TimeSig_4/4', 'Bar_None', 'TimeSig_4/4', 'Bar_None', 'TimeSig_4/4', 'Bar_None', 'TimeSig_4/4', 'Bar_None', 'TimeSig_4/4', 'Bar_None', 'TimeSig_4/4', 'Position_0', 'Position_3', 'Pitch_37', 'Velocity_79', 'Duration_1.0.8', 'Position_7', 'Pitch_66', 'Velocity_79', 'Duration_2.0.8', 'Position_11', ...]
debug
/usr/local/lib/python3.10/dist-packages/miditok/tokenizations/remi.py: line.101
if previous_note_end > event.time, the current_bar will not be updated
Also, I found that the number of measures in the tokens obtained by MIDI tokenizer is less than expected. For example, I have a MusicXML file with 210 measures, which I converted to MIDI using MuseScore, and it should have retained information such as tempo and time signature. However, after using REMI tokenizer, I only got 201 measures in the tokens.
The text was updated successfully, but these errors were encountered: