Releases: Natooz/MidiTok
Releases · Natooz/MidiTok
v1.2.8
Changes
- 82b2a1b Fix in
MuMIDI
token_types_errors()
- 0869c23 Fix, BPE tokenizers now update the vocabulary
_token_types_indexes
attribute after being modified - b3642c1
EOS
key added totoken_types_graph
, prevents crash just in case - 7d873ca MIDI objects converted from tokens now have
max_tick
attribute calculated - 770d8b8 0869c23 small fixes and typo corrections
- Fixes in tests and GitHub Action integration
Compatibility
- All good !
v1.2.7 Small improvements
Changes
- 22fee1d TimeSignature parameter automatically set to False for incompatible tokenizers, also fixing a bug when it was not provided by the user
- 2e958f1 TimeSignature of MIDI set to 4/4 if the original MIDI had none (rare but can happen)
- a46fd56 unused import removed
- f416ff5 BPE calculation in
apply_bpe
method speed up by precomputing token successions in a class attribute
Compatibility
- All good !
v1.2.6 Bugfixes
Changes
- 168c8c3 Bugfix in Octuple vocabulary creation, now only creates the selected programs
- bfe987e fix in MuMIDI and Octuple
token_types_errors
methods that could make crash when analyzing special tokens (Pas, Mask ...) - 9567387 bugfix in CPWord decoding (crash with special tokens), and Octuple now saves
_sos_eos
and_mask
attributes insave_params
Compatibility
- All good !
v1.2.5 TSD tokenizer & small fixes
Changes
- 67c2926 Introducing TSD tokenization (Time Shift Duration). It is similar to MIDI-Like but uses
Duration
tokens instead ofNote-Off
, and its main difference with REMI is the way it represents time. - 8af6a6b
_add_pad_type_to_graph
method has been renamed_add_special_tokens_to_types_graph
, and now also addsSOS
,EOS
, andMASK
tokens to the graph. - f755c70 and 4b069a2
add_bpe_to_tokens_type_graph
method for byte pair encoding, fixing a bug when loading a tokenizer from config file.
Compatibility
_add_pad_type_to_graph
is still supported but will be removed in a future update, you should replace it by_add_special_tokens_to_types_graph
in your code to stay up to date
v1.2.4 Byte Pair Encoding
Changes
- Byte Pair Encoding is up ! it works with any tokenizer (except multi-embedding like CP Word or Octuple) as a wrapper to use as
bpe(tokenizer_class, params)
(see example in readme) - 72a0f32
Vocabulary
class now have aupdate_token_types_indexes
method to create its_token_types_indexes
attribute, which can be called after loading a tokenizer with its vocabulary saved (as with BPE) - d232f4a
Structured
now takesadditional_tokens
as constructor argument, to aligning with all other tokenizers - 4b0dc9f Bugfix in
MIDITokenizer
base class for rest and beat range attributes when loading class from params - eb3612f
save_tokens
now saves tokens as a dictionary with tokens and programs keys so that the distinction is clear - tqdm is now used (and required) in
tokenize_dataset
andbpe
methods
Compatibility
Structured
now takesadditional_tokens
as constructor argument, to aligning with all other tokenizers- As from v1.2.4, tokens saved with the
save_tokens
method will now be saved as a dictionary, so that no confusion is made between tracks and programs (as it could before). You can still load tokens saved with < v1.2.4 withload_tokens
with no consequences, as you then handle how to index from it.
v1.2.3 Bufix in merge_tracks_per_class
Changes
- 87db480 fix in
merge_tracks_per_class
, some tracks were omitted when filtering pitch / tessitura
v1.2.2 Multitrack Tokenization Program reduced sets
Changes
- bd951ec
merge_tracks_per_class
now allows to remove the notes with pitch out of the recommended range (tessitura) as defined by the General MIDI 2 specs. Use thefilter_pitches
argument. - 611754d
MuMIDI
andOctuple
now allowing to use custom sets of programs, reducing their vocabulary size. Use theprogram
argument when constructing the the tokenizers.
v1.2.1 Constants format update & utils module
Changes from 4141e00
get_midi_programs
,remove_duplicated_notes
,detect_chords
,merge_tracks
,merge_same_program_tracks
andcurrent_bar_pos
methods have been moved frommiditok/midi_tokenizer_base.py
tomiditok/utils.py
, you can call them withmiditok.utils.the_method()
- New method
merge_tracks_per_class
which allows to merge tracks of a MIDI of the same instrument class MIDI_INSTRUMENTS
pitch range value changed from tuple to rangeINSTRUMENT_CLASSES
changed from type Dict[int: Tuple[int, str]] to List[Dict[str: Union[str, range]]] so its fits the format of other constants. The index of the list corresponds to the index of each class.INSTRUMENT_CLASSES_RANGES
replaced byCLASS_OF_INST
to easily gets the class of any instrument / track by its program- Minor cleans in imports
Compatibility
- See first point above if you used utils functions
- See above if you used
MIDI_INSTRUMENTS
,INSTRUMENT_CLASSES
andINSTRUMENT_CLASSES_RANGES
constants
v1.2.0 Multi-vocabulary tokenizers for CP Word, Octuple & MuMIDI
Changes
- 7fe9df6 becea47 :
CP Word
,Octuple
andMuMIDI
tokenizers now have severalVocabulary
objects withinself.vocab
, each for every token type (Pitch, Duration ...). This allows to easily create several input / output layers of different sizes, fitting the token types vocabulary sizes. example here - 05c1ab9
MIDITokenizer
base class now hasMIDITokenizer
call
(link tomidi_to_tokens
),len
(returnslen(self.vocab)
) andgetitem
(returnsself.vocab[item]
, converting a token to an event and vice versa) magic methods.
Compatibility
CP Word
,Octuple
andMuMIDI
tokenizations from < v1.2.0 will not be compatible anymore, datasets have to be retokenized
Thanks
Special thanks to @envilk for his contribution !
v1.1.11 Octuple bugfix & mask class argument
Changes
- #13 d930de5 Fail check when decoding tokens with
Octuple
, could lead to errors with wrongTimeSignature
tokens - a39b390
mask
argument is now present for all tokenizer constructors. Masking tokens are then added to vocabularies at initialization. - af85740 unused
Bar
token removed from the vocabulary ofStructured
Compatibility
- Structured:
Bar
token (value 1) has been removed, subsequent tokens values should be decreased by 1 MASK
token is now added to vocabulary at tokenization initialization, token indexes could be shifted in comparison with previous versions < 1.1.11, you should probably re-tokenize your data and retrain your models with v1.1.11 if you used masking tokens