New tokenization workflow, fixes in time signature (#66)

* option to delete equal successive tempo / time sig changes, black formatting * fixes in tokenizations when encoding / decoding Tempo messages, passing pytest with xdist * HUGE: CPWord and Octuple now adopting common workflow, OctupleMono removed, CPWord handling Time Signature, fixes in tempo and time sig decoding for MIDILike & REMI & TSD when one_token_stream is False, common TIME_SIGNATURE_RANGE set in constants, fixes in tests than now also test all time signature and tempo changes * fix in tests, doc update (additional tokens table) * black code formatting * fix in MuMIDI init + removed from data augment test (not compatible) * fix in MMM init when no config object is given
Natooz · Aug 17, 2023 · 3f33a12 · 3f33a12
1 parent 114d253
commit 3f33a12
Show file tree

Hide file tree

Showing 24 changed files with 816 additions and 1,037 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -25,7 +25,7 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install setuptools flake8 pytest coverage torch tensorflow
+          pip install setuptools flake8 pytest-xdist[psutil] coverage torch tensorflow
           pip install -r requirements.txt
       - name: Lint with flake8
         run: |
@@ -35,6 +35,6 @@ jobs:
           flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
       - name: Test with pytest
         run: |
-          coverage run -m pytest
+          coverage run -m pytest -n auto
       - name: Codecov
         uses: codecov/codecov-action@v3.1.0
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ Python package to tokenize MIDI music files, presented at the ISMIR 2021 LBD.
 [![GitHub CI](https://github.com/Natooz/MidiTok/actions/workflows/pytest.yml/badge.svg)](https://github.com/Natooz/MidiTok/actions/workflows/pytest.yml)
 [![Codecov](https://img.shields.io/codecov/c/github/Natooz/MidiTok)](https://codecov.io/gh/Natooz/MidiTok)
 [![GitHub license](https://img.shields.io/github/license/Natooz/MidiTok.svg)](https://github.com/Natooz/MidiTok/blob/main/LICENSE)
-[![Downloads](https://pepy.tech/badge/MidiTok)](https://pepy.tech/project/MidiTok)
+[![Downloads](https://static.pepy.tech/badge/miditok)](https://pepy.tech/project/MidiTok)
 [![Code style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 
 Using Deep Learning with symbolic music ? MidiTok can take care of converting (tokenizing) your MIDI files into tokens, ready to be fed to models such as Transformer, for any generation, transcription or MIR task.

diff --git a/docs/additional_tokens_table.csv b/docs/additional_tokens_table.csv
@@ -0,0 +1,9 @@
+Tokenization,Tempo,Time signature,Chord,Rest
+MIDILike,✅,✅,✅,✅
+REMI,✅,✅,✅,✅
+TSD,✅,✅,✅,✅
+Structured,❌,❌,❌,❌
+CPWord,✅,✅,✅,✅
+Octuple,✅,✅,❌,❌
+MuMIDI,✅,❌,✅,❌
+MMM,✅,✅,✅,❌
diff --git a/docs/midi_tokenizer.rst b/docs/midi_tokenizer.rst
@@ -62,72 +62,10 @@ Additional tokens
 MidiTok offers to include additional tokens on music information. You can specify them in the ``tokenizer_config`` argument (:class:`miditok.TokenizerConfig`) when creating a tokenizer. The :class:`miditok.TokenizerConfig` documentations specifically details the role of each of them, and their associated parameters.
 Cells with ❕ markers means the additional token is implemented by default and not optionnal.
 
-.. list-table:: Compatibility table of tokenizations and additional tokens.
+.. csv-table:: Compatibility table of tokenizations and additional tokens.
+   :file: additional_tokens_table.csv
    :header-rows: 1
 
-   * - Token type
-     - :ref:`REMI`
-     - :ref:`REMIPlus`
-     - :ref:`MIDI-Like`
-     - :ref:`TSD`
-     - :ref:`Structured`
-     - :ref:`CPWord`
-     - :ref:`Octuple`
-     - :ref:`MuMIDI`
-     - :ref:`MMM`
-   * - Chord
-     - ✅
-     - ✅
-     - ✅
-     - ✅
-     - ✅
-     - ❌
-     - ❌
-     - ✅
-     - ✅
-   * - Rest
-     - ✅
-     - ✅
-     - ✅
-     - ✅
-     - ✅
-     - ❌
-     - ❌
-     - ❌
-     - ❌
-   * - Tempo
-     - ✅
-     - ✅
-     - ✅
-     - ✅
-     - ✅
-     - ❌
-     - ✅
-     - ✅
-     - ✅
-   * - Program
-     - ✅¹
-     - ✅¹
-     - ✅¹
-     - ✅¹
-     - ✅¹
-     - ✅²
-     - ✅❕
-     - ✅❕
-     - ✅❕
-   * - Time signature
-     - ✅
-     - ✅
-     - ✅
-     - ✅
-     - ❌
-     - ❌
-     - ✅
-     - ❌
-     - ✅
-
-**¹** the tokenizer will add `Program` tokens before each `Pitch` / `NoteOn` token, and will treat all the tracks of a MIDI as a single sequence of tokens.
-**²** unimplemented, the tokenizer's vocabulary will contain the `Program` tokens, but it will not use it.
 
 Special tokens
 ------------------------
@@ -148,7 +86,7 @@ Tokens & TokSequence input / output format
 
 Depending on the tokenizer at use, the **format** of the tokens returned by the ``midi_to_tokens`` method may vary, as well as the expected format for the ``tokens_to_midi`` method. The format is given by the ``tokenizer.io_format` property. For any tokenizer, the format is the same for both methods.
 
-The format is deduced from the ``is_multi_voc`` and ``one_token_stream`` tokenizer properties. In short: **one_token_stream** being True means that the tokenizer will convert a MIDI file into a single stream of tokens for all instrument tracks, otherwise it will convert each track to a distinct token stream; **is_mult_voc** being True means that each token stream is a list of lists of tokens, of shape ``(T,C)`` for T time steps and C subtokens per time step.
+The format is deduced from the ``is_multi_voc`` and ``one_token_stream`` tokenizer properties. **one_token_stream** being True means that the tokenizer will convert a MIDI file into a single stream of tokens for all instrument tracks, otherwise it will convert each track to a distinct token sequence. **is_mult_voc** being True means that each token stream is a list of lists of tokens, of shape ``(T,C)`` for T time steps and C subtokens per time step.
 
 This results in four situations, where I is the number of tracks, T is the number of tokens (or time steps) and C the number of subtokens per time step:
 
@@ -163,7 +101,7 @@ Some tokenizer examples to illustrate:
 
 * **TSD** without ``config.use_programs`` will not have multiple vocabularies and will treat each MIDI track as a unique stream of tokens, hence it will convert MIDI files to a list of ``TokSequence`` objects, ``(I,T)`` format.
 * **TSD** with ``config.use_programs`` being True will convert all MIDI tracks to a single stream of tokens, hence one ``TokSequence`` object, ``(T)`` format.
-* **CPWord** is a multi-voc tokenizer and treats each MIDI track as a distinct stream of tokens, hence it will convert MIDI files to a list of ``TokSequence`` objects with ``(I,T,C)`` format.
+* **CPWord** is a multi-voc tokenizer, without ``config.use_programs`` it will treat each MIDI track as a distinct stream of tokens, hence it will convert MIDI files to a list of ``TokSequence`` objects with the ``(I,T,C)`` format.
 * **Octuple** is a multi-voc tokenizer and converts all MIDI track to a single stream of tokens, hence it will convert MIDI files to a ``TokSequence`` object, ``(T,C)`` format.
 
 

diff --git a/docs/tokenizations.rst b/docs/tokenizations.rst
@@ -85,13 +85,6 @@ Octuple
     :noindex:
     :show-inheritance:
 
-Octuple Mono
-------------------------
-
-.. autoclass:: miditok.OctupleMono
-    :noindex:
-    :show-inheritance:
-
 MuMIDI
 ------------------------
 

diff --git a/miditok/__init__.py b/miditok/__init__.py
@@ -6,7 +6,6 @@
     TSD,
     Structured,
     Octuple,
-    OctupleMono,
     CPWord,
     MuMIDI,
     MMM,
@@ -44,7 +43,6 @@ def _tweak_config_before_creating_voc(self):
     "TSD",
     "Structured",
     "Octuple",
-    "OctupleMono",
     "CPWord",
     "MuMIDI",
     "MMM",

diff --git a/miditok/classes.py b/miditok/classes.py
@@ -26,7 +26,9 @@
     NB_TEMPOS,
     TEMPO_RANGE,
     LOG_TEMPOS,
+    DELETE_EQUAL_SUCCESSIVE_TEMPO_CHANGES,
     TIME_SIGNATURE_RANGE,
+    DELETE_EQUAL_SUCCESSIVE_TIME_SIG_CHANGES,
     PROGRAMS,
     CURRENT_VERSION_PACKAGE,
 )
@@ -164,8 +166,8 @@ class TokenizerConfig:
             add more TimeSignatureChange objects. (default: False)
     :param use_programs: will use ``Program`` tokens, if the tokenizer is compatible.
             Used to specify an instrument / MIDI program. The :ref:`Octuple`, :ref:`MMM` and :ref:`MuMIDI` tokenizers
-            use natively `Program` tokens, this option is always enabled. :ref:`TSD`, :ref:`REMI`, :ref:`REMIPlus`,
-            :ref:`MIDILike` and :ref:`Structured` will add `Program` tokens before each `Pitch` / `NoteOn` token to
+            use natively `Program` tokens, this option is always enabled. :ref:`TSD`, :ref:`REMI`, :ref:`MIDILike`,
+            :ref:`Structured` and :ref:`CPWord` will add `Program` tokens before each `Pitch` / `NoteOn` token to
             indicate its associated instrument and will treat all the tracks of a MIDI as a single sequence of tokens.
             :ref:`CPWord`, :ref:`Octuple` and :ref:`MuMIDI` add a `Program` tokens with the stacks of `Pitch`,
             `Velocity` and `Duration` tokens. (default: False)
@@ -183,8 +185,25 @@ class TokenizerConfig:
     :param nb_tempos: number of tempos "bins" to use. (default: 32)
     :param tempo_range: range of minimum and maximum tempos within which the bins fall. (default: (40, 250))
     :param log_tempos: will use log scaled tempo values instead of linearly scaled. (default: False)
+    :param delete_equal_successive_tempo_changes: setting this option True will delete identical successive tempo
+            changes when preprocessing a MIDI file after loading it. For examples, if a MIDI has two tempo changes
+            for tempo 120 at tick 1000 and the next one is for tempo 121 at tick 1200, during preprocessing the tempo
+            values are likely to be downsampled and become identical (120 or 121). If that's the case, the second
+            tempo change will be deleted and not tokenized. This parameter doesn't apply for tokenizations that natively
+            inject the tempo information at recurrent timings (e.g. Octuple). For others, note that setting it True
+            might reduce the number of `Tempo` tokens and in turn the recurrence of this information. Leave it False if
+            you want to have recurrent `Tempo` tokens, that you might inject yourself by adding `TempoChange` objects to
+            your MIDIs. (default: False)
     :param time_signature_range: range as a dictionary {denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)}.
             (default: {4: [4]})
+    :param delete_equal_successive_time_sig_changes: setting this option True will delete identical successive time
+            signature changes when preprocessing a MIDI file after loading it. For examples, if a MIDI has two time
+            signature changes for 4/4 at tick 1000 and the next one is also 4/4 at tick 1200, the second time signature
+            change will be deleted and not tokenized. This parameter doesn't apply for tokenizations that natively
+            inject the time signature information at recurrent timings (e.g. Octuple). For others, note that setting it
+            True might reduce the number of `TimeSig` tokens and in turn the recurrence of this information. Leave it
+            False if you want to have recurrent `TimeSig` tokens, that you might inject yourself by adding
+            `TimeSignatureChange` objects to your MIDIs. (default: False)
     :param programs: sequence of MIDI programs to use. Note that `-1` is used and reserved for drums tracks.
             (default: from -1 to 127 included)
     :param **kwargs: additional parameters that will be saved in `config.additional_params`.
@@ -208,7 +227,11 @@ def __init__(
         nb_tempos: int = NB_TEMPOS,
         tempo_range: Tuple[int, int] = TEMPO_RANGE,
         log_tempos: bool = LOG_TEMPOS,
-        time_signature_range: Dict[int, Union[List[int], Tuple[int, int]]] = TIME_SIGNATURE_RANGE,
+        delete_equal_successive_tempo_changes: bool = DELETE_EQUAL_SUCCESSIVE_TEMPO_CHANGES,
+        time_signature_range: Dict[
+            int, Union[List[int], Tuple[int, int]]
+        ] = TIME_SIGNATURE_RANGE,
+        delete_equal_successive_time_sig_changes: bool = DELETE_EQUAL_SUCCESSIVE_TIME_SIG_CHANGES,
         programs: Sequence[int] = PROGRAMS,
         **kwargs,
     ):
@@ -239,12 +262,20 @@ def __init__(
         self.nb_tempos: int = nb_tempos  # nb of tempo bins for additional tempo tokens, quantized like velocities
         self.tempo_range: Tuple[int, int] = tempo_range  # (min_tempo, max_tempo)
         self.log_tempos: bool = log_tempos
+        self.delete_equal_successive_tempo_changes = (
+            delete_equal_successive_tempo_changes
+        )
 
         # Time signature params
         self.time_signature_range: Dict[int, List[int]] = {
-            beat_res: list(range(beats[0], beats[1] + 1)) if isinstance(beats, tuple) else beats
+            beat_res: list(range(beats[0], beats[1] + 1))
+            if isinstance(beats, tuple)
+            else beats
             for beat_res, beats in time_signature_range.items()
         }
+        self.delete_equal_successive_time_sig_changes = (
+            delete_equal_successive_time_sig_changes
+        )
 
         # Programs
         self.programs: Sequence[int] = programs

diff --git a/miditok/constants.py b/miditok/constants.py
@@ -62,9 +62,11 @@
 NB_TEMPOS = 32
 TEMPO_RANGE = (40, 250)  # (min_tempo, max_tempo)
 LOG_TEMPOS = False  # log or linear scale tempos
+DELETE_EQUAL_SUCCESSIVE_TEMPO_CHANGES = False
 
 # Time signature params
-TIME_SIGNATURE_RANGE = {4: [4]}  # {denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)}
+# {denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)}
+TIME_SIGNATURE_RANGE = {8: [3, 12, 6], 4: [5, 6, 3, 2, 1, 4]}
 
 # Programs
 PROGRAMS = list(range(-1, 128))
@@ -80,6 +82,7 @@
 TIME_DIVISION = 384  # 384 and 480 are convenient as divisible by 4, 8, 12, 16, 24, 32
 TEMPO = 120
 TIME_SIGNATURE = (4, 4)
+DELETE_EQUAL_SUCCESSIVE_TIME_SIG_CHANGES = False
 
 # Used with chords
 PITCH_CLASSES = [