Presentation | Technical Documentation | Project Documentation | Experiments |
---|
Vinko DraguΕ‘ica |
Filip MirkoviΔ |
Ivan Rep |
Matej CigleneΔki |
Create and populate the virtual environment. Simply put, the virtual environment allows you to install Python packages for this project only (which you can easily delete later). This way, we won't clutter your global Python packages.
Step 1: Execute the following command:
python3 -m venv venv
source venv/bin/activate
sleep 1
pip install -r requirements.txt
pip install -r requirements-dev.txt
Step 2: Install current directory as a editable Python module:
pip install -e .
(optional) Step 3: Activate pre-commit hook
pre-commit install
Pre-commit, defined in .pre-commit-config.yaml
will fix your imports will make sure the code follows Python standards
To remove pre-commit run: rm -rf .git/hooks
Directory | Description |
---|---|
data | datasets |
docs | documentation |
figures | figures |
models | model checkpoints, model metadata, training reports |
references | research papers and competition guidelines |
src | python source code |
- create eval script which will caculate ALL metrics for the whole dataset
- y_true, y_pred
- confusion matrix
- distribution of prediction metrics (hammings score, f1, acc)
- plot per instrument for each metric
- roc curve
- multiple dataset distribution plotting
- instrument count histogram plot
- create backend API/inference
- load model in inference, caculate metrics the whole test irmas dataset (analitics)
- should reuse the train.py script, just use different modes?
- http server with some loaded model which returns responses
- load model in inference, caculate metrics the whole test irmas dataset (analitics)
- technical documentation
- visualize embedded features: for each model with tensorboard embedder https://projector.tensorflow.org/
- add a feature that uses different features per channel - convolutional models expect a 3-channel tensor, so lets make full use of those 3 channels
- add fluffy support for all models
- try out focal loss and label smoothing: https://pytorch.org/vision/main/_modules/torchvision/ops/focal_loss.html
- convert all augmentations so they happen on the GPU
- remove audio transform from the dataset class
- implement on_after_batch_transfer
- both whole audio transform (along with augmentations) in the Model itself
- model then calls on_after_batch_trasnfer automatically and does the augmenations
- run experiments in both cases
- make sure augmetantions happen in batch
- Add ArcFace module in codebase
- Rep vs IRMAS: perform validation on Rep's corected dataset to check how many labels are correctly marked in the original dataset
- check if all instruments are correct
- check if at least one instrument is correct
- hamming distance between Rep's and original
- how dirty is the training set in terms of including non-predominant instruments
- train with relabeled data (cleanlab): (@matej has to provide csv) Include train override csv. No augmentations. Compare both models metrics.
- Inference analysis: run inference on single audio with multiple different durations (run on 10, 20, ..., 590, 600 seconds)
- Train Wav2Vec2 CNN: IRMAS only no aug
- Fluffy: Directly compare Fluffy Deep head CNN to standard Deep head CNN
- Add Focal Loss, InstrumentFamilyLoss to
src/model/loss_functions.py
and add SupportedLosses - check whatsup with pretrained weights (crop and resize) -> everything is fine
- turns out that the models use average pooling over the height and width which means that the final representation only has dimension (B, C)
- the model silently fails instead of breaking, so keep an eye out in case something doesn't work
- train with relabeled data (rep): Include Ivan's relabeled data and retrained some model to check performance boost (make sure to pick a model which already works)
- Train efficenet irmas only no aug with small
batch size=4
- train ResNeXt 50_32x4d on MelSpectrogram
- Compare how augmentations affect the final metrics:
- with no augmentations
- with augmentations
- Compare how augmentations affect the final metrics:
- train ResNeXt 50_32x4d on MFCC
- Compare how augmentations affect the final metrics:
- with no augmentations
- with augmentations
- Compare how augmentations affect the final metrics:
- OpenMIC guitars: use cleanlab and Kmeans to find guitars. Openmic has 1 guitar label. Take pretrained AST and do feature extraction on IRMAS train only on electric and aucustic guitar examples. Create a script which takes the AST features and creates Kmeans between two classes. Cluster OpenMIC guitars, take the most confident examples and save the examples (and new labels).
-
β οΈ create a CSV which splits IRMAS validation to train and validation. First, group the .wav examples by the same song and find union of labels. Apply http://scikit.ml/stratification.html Multi-label data stratification to split the data. - use validation examples in train (without data leakage), check what's the total time of audio in train and val
- augmentations: time shift, pitch shift, sox
- add normalization after augmentations
- add gradient/activation visualization for a predicted image
- write summary of Wavelet transform and how it affects the results
- Wav2Vec results, and train
- write summary of LSTM results
- implement argument which accepts list of numbers [1000, 500, 4] and will create appropriate deep cnn
- use module called deep head and pass it as a argument
- finish experiments and interpretation of the wavelet transformation
- implement spectrogram cropping and zero padding instead of resizing
- implement SVM model which uses classical audio features for mutlilabel classification
- research if SVM can perform multilabel classification or use 11 SVMs
- add more augmentations
- check if wavelet works
- implement chunking of the audio in inference and perform multiple forward pass
- implement saving the embeddings of each model for visualizations using dimensionality reduction
- think about and reserach what happens with variable sampling rate and how can we avoid issues with time length change, solution: chunking
- add explained variance percentage in PCA
- Create a script/notebook for plotting SVM results. There should be a total of 22 plots. You can reduce dimensionality with t-SNE and PCA from
sklearn
. Save the plots to .png so we can easily include it in the documentation - find features which show the highest amount of variance!
- itterate through whole dataset and caculated featuers and save them. Then caculate variance for whole dataset for each feature
- cleanup audio transform for spectrograms (remove repeat)
- you still need to resize because the height isn't 224 (it's 128) but make sure the width is the same as the pretrained model image width
- use caculate_spectrogram_duration_in_seconds to dynamically determine the audio length.
- implement spectrogram normalization and std (norm,std) and use those paramters to preprocess the image before training.
- implement Fluffy nn.Module
- use Fluffy on Torch CNN, multi-head
- train some model Fluffy
- Wav2Vec2 feature extractor only
- move spectrogram chunking to collate
- prototype pretraining phase:
- Shuffle parts of the spectrogram in the following way: (16x16 grid)
- shuffle 15% of patches
- electra, is the patch shuffled?
- Shuffle parts of the spectrogram in the following way: (16x16 grid)
- ESC50: download non instrument audio files and write data loader which are NOT instruments (@matej) this might not be important since the model usually gives [0,0,0,0,0] anyways: download ESC50 non instrument audio files and write data loader which are NOT instruments (@matej)
- any dataset/csv loader
-
β οΈ download the whole IRMAS dataset
General links:
- Audio Deep Learning Made Simple State-of-the-Art Techniques:
- https://towardsdatascience.com/audio-deep-learning-made-simple-part-1-state-of-the-art-techniques-da1d3dff2504
- https://towardsdatascience.com/audio-deep-learning-made-simple-part-2-why-mel-spectrograms-perform-better-aad889a93505
- https://towardsdatascience.com/audio-deep-learning-made-simple-part-3-data-preparation-and-augmentation-24c6e1f6b52
- paperswithcode Audio Classification: https://paperswithcode.com/task/audio-classification
- Music and Instrument Classification using Deep Learning Technics: https://cs230.stanford.edu/projects_fall_2019/reports/26225883.pdf
- AUDIO MANIPULATION WITH TORCHAUDIO: https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html
Use cleanlab to find bad lables: https://docs.cleanlab.ai/stable/tutorials/audio.html?highlight=encoderclassifier
Do this without introducing data leakage, but make sure that we still have enough validation data.
Chunking should happen only in inference in the following way:
- preprocess 20sec audio normally, send the spectrogram to the model and chunk the spectrogram inside of the
predict_step
.
We don't do chunking in the train step because we can't chunk the label.
Time window of spectrogram is defined by maximum audio lenght of some train sample. If we chunk that sample, we don't know if the label will appear in every of those chunks.
Add low dim (t-Sne) plot of features to check clusters. How to that:
- forward pass every example
- now you have embedding
- take t-sne
Masked Autoencoders (MAE)
https://huggingface.co/docs/transformers/model_doc/vit_mae#transformers.ViTMAEForPreTraining
Has script for pretrain but does it work? Written in nn.Module
Pretraining on CNN-s:
- SparK: https://github.com/keyu-tian/SparK
- timm: https://github.com/huggingface/pytorch-image-models/tree/main/
Instead of training the transformer backbone, add layers in between the backbone and train those layers. Those layers are called adapters.
https://docs.adapterhub.ml/ https://docs.adapterhub.ml/adapter_composition.html
- parameter efficient tuninng (possible adapters): https://github.com/huggingface/peft
Normalization of the audio in time domain (amplitude). Librosa already does this?
Spectrogram normalization, same as any image problem normalization - pre-caculate mean and std and use it in the preprocessing step.
IRMAS dataset https://www.upf.edu/web/mtg/irmas:
- IRMAS Test dataset only contains the information about presence of the instruments. Drums and music genre information is not present.
- examples: 6705
- instruments: 11
- duration: 3sec
NSynth: Neural Audio Synthesis https://magenta.tensorflow.org/datasets/nsynth
- examples: 305 979
- instruments: 1006
- A novel WaveNet-style autoencoder model that learns codes that meaningfully represent the space of instrument sounds.
MusicNet:
- examples: 330
- instruments: 11
- duration: song
MedleyDB:
- examples: 122
- instruments: 80
OpenMIC-2018 https://zenodo.org/record/1432913#.W6dPeJNKjOR
- paper: http://ismir2018.ircam.fr/doc/pdfs/248_Paper.pdf
- num examples: 20 000
- instruments: 20
- duration: 10sec
https://kevinmusgrave.github.io/pytorch-metric-learning/losses/ How to construct tripplets: https://omoindrot.github.io/triplet-loss Softmax loss and center loss: https://hav4ik.github.io/articles/deep-metric-learning-survey
Some instruments are similar and their class should be (somehow) close together.
Standard classification loss + (alpha * distance between two classes)
- distance is probably embedings from some pretrained audio model (audio transformer)
Tripplet loss, how do we form triplets
- real: guitar
- postive: guitar
- negative: not guitar?
Reserach audio files which are NOT instruments. Both background noises and sounds SIMILAR to instruments! Download the datasets and write dataset loader for them (@matej). Label everything [0, ..., 0]
Problem: how to encode additional features (drums/no drums, music genre)? We can't create spectrogram out fo those arrays. Maybe simply append one hot encoded values after the spectrogram becomes 1D linear vector?
Current state-of-the-art model for audio classification on multiple datasets and multiple metrics.
paper: https://arxiv.org/pdf/2212.09058.pdf github: https://github.com/microsoft/unilm/tree/master/beats https://paperswithcode.com/sota/audio-classification-on-audioset
- github: https://github.com/YuanGongND/ast
- paper: https://arxiv.org/pdf/2104.01778.pdf
- pretrained model: https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593
- hugging face: https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer
AST max duration is 10.23 sec for 16_000hz audio
Notes:
- They used 16kHz audio for the pretrained model, so if you want to use the pretrained model, please prepare your data in 16kHz
Idea: introduce multiple MLP (fully conneted layer) heads. Each head will detect a single instrument instead of trying to detect all instruments at once.
- explore how to implement this in PyTorch efficiently:
Idea: train on single wav, then later introduce irmas_combinatorics
dataset which contains multiple wav
Trained LSTM (with and without Bahdanau attention) on melspectrogram and MFCC features, for single and multiple insturment classification. Adding instruments accroding to genre and randomly was also explored. This approach retains high accuracy due to the class imbalance of the train and validation set, however the F1 metric, with macro averaging in the multi instrument case, remains low in the 0.26 - 0.35 interval. All instruments with higher F1 metrics use Bahdanau attention.
Aside from sliding wavelet filters, the output of the wavelet transform needs to be logsacled or preferably trasformed with amplitude_to_db
. This does not seem to improve or degrade the performance of the LSTM model with attention, and the F1 score remains in similar margins.
Still doing some resarch on Wavelets April 3rd...
Adding instrument waveforms to imitate the examples with multiple insturments needs to be handled with greater care, otherwise it only improves the F1 metric slightly (LSTM) or even lowers it (Wav2Vec2 backbone). A bug was present that I did not catch before. I'm redoing the expereiments.
The idea was to implement a pretrained feature extractor with multiple FCNN (but not necessarily FCNN) heads that serve as disconected binary instrument classifiers. E.g. we wan to classify 5 instruments, hence we use a backbone with 5 FCNNs, and each FCNN searches for it's "own" instrument among the 5.
As was already mentioned, we used only the feature extractor of the pretrained Wav2Vec2 model, and completely disposed of the transformer component for effiency. Up untill this point, the training was performed for ~35 epochs and while the average validation f1 metric remains in the 0.5-0.6 region, it varies significantly across instruments. For most instruments the f1 score remains in the 0.6-0.7 range with numerous outliers, on the high end we have the acoustic guitar and the human voice with f1 above 0.8. This is to be expected, considering the backbone was trained on many instances of human voices. On the low end we have the organ with f1 of ~0.2, and most likely due do bugs in the code the electric guitar with f1 of 0. This could also be atributed to it's similarity with other instruments such as violin or acoustic guitar. This leaves us with a "deathrattle" of sort for this whole "let's use only IRMAS" idea. The idea is to pretrain a feature extractor based on contrastive loss, aslo margins within genres and instrument families should be applied. If this doesn't produce better results the only solution I propose is getting more data, e.g. open MIC.
This model has been trained for far fewer epochs ~7, and so far it exhibits the same issues as Fluffy with just the feature extractor. Perhaps more training would be needed, however using such large models requires considerable memory usage, and it's use durign inference time might be limited.
- create 4 Mobilenets which cover 11 instruments
- forward pass to get features
- create 4 FC (each FC has 3 instruments)
- concat predictions
- create 4 Mobilenets which cover 11 instruments
- forward pass to get features
- concat all features
Introduce SVM and train it additionally on high level features of spectrogram (MFCC). For example, one can caculate entropy of a audio/spectrogram for a given timeframe (@vinko)
If you have audio of 3 sec, caculate ~30 entropies every 0.1 sec and use those entropies as SVM features. Also try using a lot more librosa features.
Ensamble should be features of some backbone and Vinko's SVM.
https://www.audiolabs-erlangen.de/resources/MIR/FMP/C8/C8S1_HPS.html
Loosely speaking, a harmonic sound is what we perceive as pitched sound, what makes us hear melodies and chords. The prototype of a harmonic sound is the acoustic realization of a sinusoid, which corresponds to a horizontal line in a spectrogram representation. The sound of a violin is another typical example of what we consider a harmonic sound. Again, most of the observed structures in the spectrogram are of horizontal nature (even though they are intermingled with noise-like components). On the other hand, a percussive sound is what we perceive as a clash, a knock, a clap, or a click. The sound of a drum stroke or a transient that occurs in the attack phase of a musical tone are further typical examples. The prototype of a percussive sound is the acoustic realization of an impulse, which corresponds to a vertical line in a spectrogram representation.
https://pytorch.org/audio/stable/transforms.html https://pytorch.org/audio/stable/functional.html#feature-extractions
note: in practice, Mel Spectrograms are used instead of classical spectrogram. We have to normazlie spectrograms images just like any other image dataset (mean/std).
Take an audio sequence and peform SFTF (Short-time Fourier transform) to get spectrums in multiple time intervals. The result is a 3D tensor (time, amplitude, spectrum). STFT has a time window size which is defined by a sampling frequnecy
. It is also defined by a window type
.
Spectrogram of Mel Spectrogram:
-
https://github.com/asteroid-team/torch-audiomentations
-
https://github.com/Spijkervet/torchaudio-augmentations
-
https://pytorch.org/audio/main/tutorials/audio_feature_augmentation_tutorial.html#specaugment
-
https://pytorch.org/audio/stable/transforms.html#augmentations
- white noise
- time shift
- amplitude change / normalization
PyTorch Sox effects
allpass, band, bandpass, bandreject, bass, bend, biquad, chorus, channels, compand, contrast, dcshift, deemph, delay, dither, divide, downsample, earwax, echo, echos, equalizer, fade, fir, firfit, flanger, gain, highpass, hilbert, loudness, lowpass, mcompand, norm, oops, overdrive, pad, phaser, pitch, rate, remix, repeat, reverb, reverse, riaa, silence, sinc, speed, stat, stats, stretch, swap, synth, tempo, treble, tremolo, trim, upsample, vad, vol
SpecAugment: https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html SpecAugment PyTorch: https://github.com/zcaceres/spec_augment SpecAugment torchaudio: https://pytorch.org/audio/main/tutorials/audio_feature_augmentation_tutorial.html#specaugment
Naive: concat multiple audio sequences into one and merge their labels. Introduce some overlapping, but not too much!
Use the same genre for data generation: combine sounds which come from the same genre instead of different genres
How to sample?
- sample audio files [3, 5] but dont use more than 4 instruments
- sample different starting positions at which the audio will start playing
- START-----x---x----------x--------x----------END
- cutoff the audio sequence at max length?
Librosa and Torch give the same array (sound) if both are normalized and converted to mono.
Librosa is gives same array if you load it with sr=None, resample compared to resampling on load.
Best results for AST feature extraction, use torchaudio.load with resampling.
Kaldi
window_shift = int(sample_frequency * frame_shift * 0.001) window_size = int(sample_frequency * frame_length * 0.001)
Librosa hop_length #ms #len 1/(1 / 44100 * 1000) * 20
with a 25ms Hamming window every 10ms (hop)
nfft = 1/(1 / 44100 * 1000) * 25 = 1102 hop = 1/(1 / 44100 * 1000) * 10 = 441