A project that applies Vision Transformer (ViT) architectures to spectrograms for audio / speech tasks (e.g. classification, recognition).
Transforms audio signals into spectrogram “images” and uses transformer models to learn from them.
- Introduction
- Features
- Installation
- Usage
- Examples
- Model / Architecture
- Experiments & Results
- Dataset
- Training
- Evaluation
- Contributing
- License
- References
Deep learning for audio often uses convolutional networks on spectrogram inputs.
This project explores using Vision Transformers (ViT) on spectrograms to capture global patterns and long-range dependencies in time-frequency space.
The key idea:
- Convert a raw audio waveform → spectrogram (e.g. Mel spectrogram)
- Treat the spectrogram as a 2D image
- Feed patches / tokens into a Transformer / ViT model
- Train / fine-tune for audio tasks (classification, detection, etc.)
- Pipeline for audio → spectrogram → transformer input
- Preprocessing scripts (e.g. STFT, Mel, normalization)
- Model architectures based on ViT / Transformer
- Training and evaluation scripts
- Support for GPU / multi-GPU
- Logging, checkpoints, and result saving
# Clone this repository
git clone https://github.com/Sree14hari/Spectrogram-VisionTransformer.git
cd Spectrogram-VisionTransformer
# (Optional) create virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txtYou’ll need packages like torch, numpy, librosa (or similar for spectrograms), etc.
- Place your audio files / dataset in a folder
- (Optional) a config or script to convert them into spectrograms
python train.py --config configs/your_config.yamlpython evaluate.py --checkpoint path/to/model.pth --data your_test_dataYou could include a sample script run_demo.py that loads a sample audio, converts to spectrogram, runs through the model, and prints results.
Describe the architecture you used:
- Patch size (e.g. 16×16)
- Number of transformer layers / heads / embedding dimension
- Any modifications you did for spectrogram data (positional embeddings, time-frequency embedding, masking, etc.)
- Loss functions, regularization, etc.
If you adopted or adapted from other works (e.g. ASiT: Audio Spectrogram vIsion Transformer ([arXiv][1]) or other similar works), mention it here.
Provide a table of your experiments, for example:
| Task / Dataset | Model Variant | Accuracy / Metric | Notes |
|---|---|---|---|
| Speech Commands | ViT-base | 95.6% | baseline |
| Your dataset | Your model | XX.X% | your experiment |
Include charts, loss curves, confusion matrices, etc.
Describe the dataset(s) you used:
- Name (e.g. Speech Commands, ESC-50, custom)
- Number of classes
- Preprocessing (sampling rate, window size, overlap, normalization)
- Train / val / test splits
Detail your training settings, e.g.:
- Learning rate, scheduler
- Batch size
- Number of epochs
- Optimizer (Adam, SGD, etc.)
- Data augmentation (if any)
- Hardware setup
- How metrics are computed (accuracy, F1, AUC, etc.)
- Any special evaluation scripts
- How to reproduce results
If you welcome contributions, you can say:
- Please open issues or pull requests
- Follow the code style
- Add tests / documentation
- Cite your work
State your license (e.g. MIT, Apache, etc.)
MIT License
© Sreehari R