Skip to content

v3.0.2 New data loading and preprocessing methods

Compare
Choose a tag to compare
@Natooz Natooz released this 24 Mar 14:38
· 57 commits to main since this release

Tldr

This new version introduces a new DatasetMIDI class to use when training PyTorch models. It relies on the previously named DatasetTok class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new miditok.pytorch_data.split_midis_for_training method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge symusic.Score objects.
Thanks @Kinyugo for the discussions and tests that guided the development of the features! (#147)

The update also brings a few minor fixes, and the docs have a new theme!

What's Changed

  • Fix token_paths to files_paths, and config to model_config by @sunsetsobserver in #145
  • Fix issues in Octuple with multiple different-beat time signatures by @ilya16 in #146
  • Pitch interval decoding: discarding notes outside the tokenizer pitch range by @Natooz in #149
  • Fixing save_pretrained to comply with huggingface_hub v0.21 by @Natooz in #150
  • ability to overwrite _create_durations_tuples in init by @JLenzy in #153
  • Refactor of PyTorch data loading classes and methods by @Natooz and @Kinyugo in #148
  • The docs have a new theme! Using the furo theme.

New Contributors

Full Changelog: v3.0.1...v3.0.2