v3.0.2 New data loading and preprocessing methods
Tldr
This new version introduces a new DatasetMIDI
class to use when training PyTorch models. It relies on the previously named DatasetTok
class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new miditok.pytorch_data.split_midis_for_training
method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge symusic.Score
objects.
Thanks @Kinyugo for the discussions and tests that guided the development of the features! (#147)
The update also brings a few minor fixes, and the docs have a new theme!
What's Changed
- Fix token_paths to files_paths, and config to model_config by @sunsetsobserver in #145
- Fix issues in Octuple with multiple different-beat time signatures by @ilya16 in #146
- Pitch interval decoding: discarding notes outside the tokenizer pitch range by @Natooz in #149
- Fixing
save_pretrained
to comply with huggingface_hub v0.21 by @Natooz in #150 - ability to
overwrite _create_durations_tuples
in init by @JLenzy in #153 - Refactor of PyTorch data loading classes and methods by @Natooz and @Kinyugo in #148
- The docs have a new theme! Using the furo theme.
New Contributors
- @sunsetsobserver made their first contribution in #145
- @JLenzy made their first contribution in #153
Full Changelog: v3.0.1...v3.0.2