Automatic Piano Music Transcription

This project aims to develop a machine learning model capable of predicting the activations of MIDI notes from an audio file containing a musical piece.

Code Structure

├── data_analysis.ipynb
├── amt_baseline.ipynb # includes a baseline multiclass logistic regression model
├── amt_dnn.ipynb # includes deep neural network models
├── amt_lstm.ipynb # includes lstm (long short term memory) and transfer learning
├── audio_results/ # audio and midi note activations for a test and predicted audio
└── README.md

Dataset

The dataset used is the OMAPS2 dataset, which consists of audio recordings from a piano in .wav format and corresponding manually annotated music transcription sheets in .txt format. The dataset is already split into train, validation, and test sets.

Preprocessing

The audio data was converted into a constant-Q transform (CQT) form, which decomposes the audio signal into frequency components over time. The CQT vectors are then aligned with the MIDI annotations using one-hot encoding.

Model Architecture

Baseline Model: The One-vs-Rest Classifier with a Logistic Regression estimator was used as a baseline for comparison.

Two main architectures were explored:

Deep Neural Network (DNN): The DNN takes CQT vectors as input and outputs the one-hot encoded MIDI activations. It consists of several hidden layers with ReLU activation functions and employs techniques like dropout and early stopping to prevent overfitting.
Long Short-Term Memory (LSTM): The LSTM architecture was implemented to capture short-term and long-term dependencies in the audio data. Transfer learning was also explored by initializing the LSTM weights with pre-trained weights from the DNN models.

More:

Hyperparameter tuning (Grid Search) on Batch Size and Dropout
Regularization techniques: Dropout and Early Stopping
Optimization techniques: Cyclical Learning Rate, use of Minibatches

Evaluation

The models' performance was evaluated using metrics like accuracy (a modified version that doesn't take True Negatives (TN) into consideration), precision, recall, and F1-score. The predictions were compared against the time-aligned MIDI annotations.

Results

The best DNN model achieved an accuracy of 37.32%, outperforming the baseline logistic regression model, which achieved an accuracy of 10.75%. However, the LSTM models struggled to achieve satisfactory accuracy despite various optimization strategies.

Below, we can see the similarities between the Actual Midi Note Activation and Predicted Midi Note Activation for a particular audio.

Furthermore, we can also listen to and compare the audios for the actual audio (y_test_output60000.mid) mentioned above and the audio generated from the model predictions (predictions_output60000.mid) in the audio_results folder.

Future Improvements

Online Platform for Testing: Create a web-based application where users can upload their .midi or .mid files and compare the generated MIDI note activations against the original files. This will provide a convenient way to test and evaluate the model's predictions.
Investigate LSTM Issues: Perform a thorough analysis of why the LSTM models struggled in comparison to the DNN. This will involve investigating hyperparameters, architecture, and reshaping to identify potential areas of improvement.
Use Transformers and Attention Mechanisms: Incorporate advanced architectures like transformers and attention mechanisms to better capture complex temporal dependencies in the music data. This should improve the model's ability to differentiate between notes and enhance transcription accuracy.

Conclusion

While the accuracy scores may not seem ideal, with the best DNN model achieving an accuracy of 37.32%, it is important to note that the predicted MIDI note activations capture the overall structure and pattern of the actual MIDI note activations quite well. This can be observed from the visual similarities between the actual and predicted MIDI note activation plots. In the future, exploring more advanced architectures like attention-based models, transformers, etc. could potentially lead to further improvements in the model's performance. Despite the challenges, the progress made in this project demonstrates the potential for developing accurate automatic music transcription systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Piano Music Transcription

Code Structure

Dataset

Preprocessing

Model Architecture

Evaluation

Results

Future Improvements

Conclusion

Contributors

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
audio_results		audio_results
.gitignore		.gitignore
ADS Final Project.pdf		ADS Final Project.pdf
README.md		README.md
amt_baseline.ipynb		amt_baseline.ipynb
amt_dnn.ipynb		amt_dnn.ipynb
amt_lstm.ipynb		amt_lstm.ipynb

shriyakalakata/automatic-piano-music-transcription

Folders and files

Latest commit

History

Repository files navigation

Automatic Piano Music Transcription

Code Structure

Dataset

Preprocessing

Model Architecture

Evaluation

Results

Future Improvements

Conclusion

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages