This project aims to develop a machine learning model capable of predicting the activations of MIDI notes from an audio file containing a musical piece.
├── data_analysis.ipynb
├── amt_baseline.ipynb # includes a baseline multiclass logistic regression model
├── amt_dnn.ipynb # includes deep neural network models
├── amt_lstm.ipynb # includes lstm (long short term memory) and transfer learning
├── audio_results/ # audio and midi note activations for a test and predicted audio
└── README.md
The dataset used is the OMAPS2 dataset, which consists of audio recordings from a piano in .wav format and corresponding manually annotated music transcription sheets in .txt format. The dataset is already split into train, validation, and test sets.
The audio data was converted into a constant-Q transform (CQT) form, which decomposes the audio signal into frequency components over time. The CQT vectors are then aligned with the MIDI annotations using one-hot encoding.
Baseline Model: The One-vs-Rest Classifier with a Logistic Regression estimator was used as a baseline for comparison.
Two main architectures were explored:
-
Deep Neural Network (DNN): The DNN takes CQT vectors as input and outputs the one-hot encoded MIDI activations. It consists of several hidden layers with ReLU activation functions and employs techniques like dropout and early stopping to prevent overfitting.
-
Long Short-Term Memory (LSTM): The LSTM architecture was implemented to capture short-term and long-term dependencies in the audio data. Transfer learning was also explored by initializing the LSTM weights with pre-trained weights from the DNN models.
More:
- Hyperparameter tuning (Grid Search) on Batch Size and Dropout
- Regularization techniques: Dropout and Early Stopping
- Optimization techniques: Cyclical Learning Rate, use of Minibatches
The models' performance was evaluated using metrics like accuracy (a modified version that doesn't take True Negatives (TN) into consideration), precision, recall, and F1-score. The predictions were compared against the time-aligned MIDI annotations.
The best DNN model achieved an accuracy of 37.32%, outperforming the baseline logistic regression model, which achieved an accuracy of 10.75%. However, the LSTM models struggled to achieve satisfactory accuracy despite various optimization strategies.
Below, we can see the similarities between the Actual Midi Note Activation
and Predicted Midi Note Activation
for a particular audio.
Furthermore, we can also listen to and compare the audios for the actual audio (y_test_output60000.mid
) mentioned above and the audio generated from the model predictions (predictions_output60000.mid
) in the audio_results
folder.
-
Online Platform for Testing: Create a web-based application where users can upload their
.midi
or.mid
files and compare the generated MIDI note activations against the original files. This will provide a convenient way to test and evaluate the model's predictions. -
Investigate LSTM Issues: Perform a thorough analysis of why the LSTM models struggled in comparison to the DNN. This will involve investigating hyperparameters, architecture, and reshaping to identify potential areas of improvement.
-
Use Transformers and Attention Mechanisms: Incorporate advanced architectures like transformers and attention mechanisms to better capture complex temporal dependencies in the music data. This should improve the model's ability to differentiate between notes and enhance transcription accuracy.
While the accuracy scores may not seem ideal, with the best DNN model achieving an accuracy of 37.32%, it is important to note that the predicted MIDI note activations capture the overall structure and pattern of the actual MIDI note activations quite well. This can be observed from the visual similarities between the actual and predicted MIDI note activation plots. In the future, exploring more advanced architectures like attention-based models, transformers, etc. could potentially lead to further improvements in the model's performance. Despite the challenges, the progress made in this project demonstrates the potential for developing accurate automatic music transcription systems.