Releases: Natooz/MidiTok
v2.1.3 New tokenization workflow, speedups, time signature and PyTorch data loading module
This big update brings a few important changes and improvements.
A new common tokenization workflow for all tokenizers.
We distinguish now three types of tokens:
- Global MIDI tokens, which represent attributes and events affecting the music globally, such as the tempo or time signature;
- Track tokens, representing values of distinct tracks such as the notes, chords or effects;
- Time tokens, which serve to structure and place the previous categories of tokens in time.
All tokenisations now follows the pattern:
- Preprocess the MIDI;
- Gather global MIDI events (tempo...);
- Gather track events (notes, chords);
- If "one token stream", concatenate all global and track events and sort them by time of occurrence. Else, concatenate the global events to each sequence of track events;
- Deduce the time events for all the sequences of events (only one if "one token stream");
- Return the tokens, as a combination of list of strings and list of integers (token ids).
This cleans considerably the code (DRY, less redundant methods), while bringing speedups as the calls to sorting methods has been reduced.
TLDR; other changes
- New submodule
pytorch_data
offering PyTorchDataset
objects and a data collator, to be used when training a PyTorch model. Learn more in the documentation of the module; MIDILike
,CPWord
andStructured
now handle nativelyProgram
tokens in a multitrack /one_token_stream
way;- Time signature changes are now handled by
TSD
,MIDILike
andCPWord
; - The
time_signature_range
config option is now more flexible / convenient.
Changelog
- #61 new
pytorch_data
submodule, withDatasetTok
andDatasetJsonIO
classes. This module is only loaded iftorch
is installed in the python environment; - #61
tokenize_midi_dataset()
method now have atokenizer_config_file_name
argument, allowing to save the tokenizer config with a custom file name; - #61 "all-in-one"
DataCollator
object to be used with PyTorchDataLoader
s; - #62
Structured
andMIDILike
now natively handleProgram
tokens. When settingconfig.use_programs
true, aProgram
token will be added before eachPitch
/NoteOn
/NoteOff
token to associate its instrument. MIDIs will also be treated as a single stream of tokens in this case, whereas otherwise each track is converted into independent token sequences; - #62
miditok.utils.remove_duplicated_notes
method can now remove notes with the same pitch and onset time, regardless of their offset time / duration; - #62
miditok.utils.merge_same_program_tracks
is now called inpreprocess_midi
whenconfig.use_programs
is True; - #62 Big refactor of the
REMI
codebase, that now has all the features ofREMIPlus
, and code clean and speedups (less calls to sorting). TheREMIPlus
class is now basically only a wrappedREMI
with programs and time signature enabled; - #62
TSD
andMIDILike
now encode and decode time signature changes; - #63 @ilya16 The
Tempo
s can now be created with a logarithmic scale, instead of the default linear scale. - c53a008 and 5d1c12e
track_to_tokens
andtokens_to_track
methods are now partially removed. They are now protected, for classes that still rely on them, and removed from the others. These methods were made for internal calls and not recommended to use. Instead, themidi_to_tokens
method is recommended; - #65 @ilya16 changes
time_signature_range
into a dictionary{denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)}
; - #65 @ilya16 fix in the formula computing the number of ticks per bar.
- #66 Adds an option to
TokenizerConfig
to delete the successive tempo / time signature changes carrying the same value during MIDI preprocessing; - #66 now using xdist for tests, big speedup on Github actions (ty @ilya16 !);
- #66
CPWord
andOctuple
now follow the common tokenization workflow; - #66 As a consequence to the previous point,
OctupleMono
is removed as there was no records of its use. It is now equivalent toOctuple
withoutconfig.use_programs
; - #66
CPWord
now handling time signature changes; - #66 tests for tempo and time signatures changes are now more robust, exceptions were removed and fixed.
- 5a6378b
save_tokens
now by default doesn't save programs ifconfig.use_programs
is False
Compatibility
- Calls to
track_to_tokens
andtokens_to_track
methods are not supported anymore. If you used these methods, you may replace them withmidi_to_tokens
andtokens_to_midi
(or just call the tokenizer) while selecting the appropriate token sequences / tracks; time_signature_range
now needs to be given as a dictionary;- Due to changes in the order of vocabularies of
Octuple
(as programs are now optional), tokenizers and tokens made with previous versions will not be compatible unless the vocabulary order is swapped, idx 3 moved to 5.
v2.1.2 I/O fixes
Thanks to @Kapitan11 who spotted bugs when decodings tokens given as ids / integers (#59), this update brings a few fixes that solve them alongside tests ensuring that the input / output (i/o) formats of the tokenizers are well handled in every cases.
The documentation has also been updated on this subject, that was unclear until now.
Changes
- 394dc4d Fix in
MuMIDI
andOctuple
token encodings that performed the preprocessing steps twice; - 394dc4d code of single track tests improved and now covering tempos for most tokenizations;
- 394dc4d
MuMIDI
can now decode tempo tokens; - 394dc4d
_in_as_seq
decorator now used solely for thetokens_to_midi()
method, and removed fromtokens_to_track()
which explicitly expects aTokSequence
object as argument (089fa74); - 089fa74
_in_as_seq
decorator now handling all token ids input formats as it should; - 9fe7639 Fix in
TSD
decoding with multiple input sequences when not inone_token_stream
mode; - 9fe7639 Adding i/o input ids tests;
- 8c2349b
unique_track
property renamed toone_token_stream
as it is more explicit and accurate; - 8c2349b new
convert_sequence_to_tokseq
method, which can convert any input sequence holding ids (integer), tokens (string) or events (Event) data into aTokSequence
or list ofTokSequence
s objects, with the appropriate format depending on the tokenizer. This method is used by the_in_as_seq
decorator; - 8c2349b new
io_format
tokenizer property, returning the tokenizer's io format as a tuple of strings. Their significations are: I for instrument (for non one_token_stream tokenizers), T for token, C for sub-token class (for multi-voc tokenizers) - Minor code lint improvements;
Compatibility
- All good 🙌
v2.1.1 Minor fixes
Changes
- 220f384 Fix in
learn_bpe()
for tokenizers inunique_track
mode; - 30d5546 Fixes in data augmentation (on tokens) in
unique_track
mode: 1) was skipping files (detected as drums) and 2) it now augment all pitches except drums ones (as opposed to all before); - 30d5546 Tokenizer creating
Program
tokens fromtokenizer.config.programs
given by user.
Compatibility
- If you used custom
Program
tokens, make sure to give(-1, 128)
as argument for your tokenizer's config (TokenizerConfig
programs
arg). It's already it by default, this message only applied if you gave something else.
V2.1.0 TokenizerConfig
Major change
This "mid-size" update brings a new TokenizerConfig
object, holding any tokenizer's configuration. This object is now used to instantiate all tokenizers, and replaces the now removed beat_res
, nb_velocities
, pitch_range
and additional_tokens
arguments. It allows to simplify the code, reduce exceptions, and expose a simplified way to custom tokenizers.
You can read the documentation and example to see how to use it.
Changes
- e586b1f New
TokenizerConfig
object to hold config and instantiate tokenizers - 26a67a6 @tingled Fix in
__repr__
- 9970ec4 Fix in CPWord token type graph
- 69e64a7
max_bar_embedding
argument forREMIPlus
is now by default set to False - 62292d6 @Kapitan11
load_params
now private method, and documentation updated for this feature - 3aeb7ff Removing the depreciated "slow" BPE methods
- f8ca854 @ilya16 Fixing PitchBend time attribute in
merge_tracks
method - b12d270
TSD
now natively handleProgram
tokens, the same wayREMIPlus
does. Using theuse_prorams
option will convert MIDIs into a single token sequence for all tracks, instead of one seq per track instead; - Other minor code, lint and docstring improvements
Compatibility
- On your current / previous projects, you will need to update your code, specifically the way you create tokenizers, to use this update. This doesn't apply to code creating tokenizers from config file (
params
arg); - Slow BPE removed. If you still use these methods, we encourage you to switch to the new fast ones. You trained models will need to be using with old slow tokenizers.
V2.0.6 MMM tokenizer
Changes
- 811bd68 #40 #41 Adding the
MMM
tokenizer (Multi-Track Music Machine)
Compatibility
- All good 🙌
v2.0.5 Bug fixes and safety checks
Changes
- f9f63d0 (related to #37) adding a compatibility check to
learn_bpe
method - f1af66a fixing an issue when loading tokens in
learn_bpe
withunique_track
compatible tokenizer (REMIPlus) causing no BPE learning - f1af66a in
learn_bpe
: checking that the total number of unique base tokens (chars) is inferior to the target vocabulary size - 47b6166 handling multi-voc indexing with tokens present in all vocabs eg special
Compatibility
- All good 🙌
v2.0.4 Bugfix
Changes
- 456a6ce bugfix on the velocity feature when performing data augmentation at token level
v2.0.3 Minor improvements
Changes
- ff1bb5e and 195cb65 the
__call__
magic method allows to load MIDI and json files before converting them - c045630
TokSequence
s are now subscriptable! (you can dotok_seq[id_]
) - a632214 Special tokens are now stored without the
None
value - Minor code and documentation improvements
Compatibility
- In case you use
token_type_graph
andtokens_errors
: previous config files store special tokens with None value (egPAD_None
), have to modified to remove it (eg justPAD
) (special_tokens
entry only). No change in vocabulary / tokens.
v2.0.2 Fix _ids_are_bpe_encoded
- 63110d7 fix in
_ids_are_bpe_encoded
method
V2.0.1 REMI+ and new Chord params
Changes
- e26b088 from @atsukoba + help from @muthissar:
REMI+
is now implemented! 🎉 This multitrack tokenization can be seen as an extension ofREMI
. - 2962211 Chord tokens can now represent the root note within tokens (versus only chord quality previously). Chord parameters have to be specified in
additional_tokens
argument, with the keyschord_maps
,chord_tokens_with_root_note
andchord_unknown
. You can use the default value as an example. - e402b0d
_in_as_seq
decorator now automatically checks if the input ids are encoded with BPE - 2064ee9 fix with BPE containing spaces in merges, could not load tokenizers after training
Compatibility
- due to 2064ee9, bytes and merges are shifted from v2.0.0. BPE tokenizers will be incompatible and would have to be retrained, or the bytes from their vocabularies and merges would have to be shifted. This only applies for BPE.