Skip to content

Sree14hari/Spectrogram-VisionTransformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spectrogram-VisionTransformer

A project that applies Vision Transformer (ViT) architectures to spectrograms for audio / speech tasks (e.g. classification, recognition).
Transforms audio signals into spectrogram “images” and uses transformer models to learn from them.

Table of Contents

Introduction

Deep learning for audio often uses convolutional networks on spectrogram inputs.
This project explores using Vision Transformers (ViT) on spectrograms to capture global patterns and long-range dependencies in time-frequency space.

The key idea:

  1. Convert a raw audio waveform → spectrogram (e.g. Mel spectrogram)
  2. Treat the spectrogram as a 2D image
  3. Feed patches / tokens into a Transformer / ViT model
  4. Train / fine-tune for audio tasks (classification, detection, etc.)

Features

  • Pipeline for audio → spectrogram → transformer input
  • Preprocessing scripts (e.g. STFT, Mel, normalization)
  • Model architectures based on ViT / Transformer
  • Training and evaluation scripts
  • Support for GPU / multi-GPU
  • Logging, checkpoints, and result saving

Installation

# Clone this repository
git clone https://github.com/Sree14hari/Spectrogram-VisionTransformer.git
cd Spectrogram-VisionTransformer

# (Optional) create virtual environment
python -m venv venv
source venv/bin/activate   # or venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

You’ll need packages like torch, numpy, librosa (or similar for spectrograms), etc.

Usage

Prepare data

  • Place your audio files / dataset in a folder
  • (Optional) a config or script to convert them into spectrograms

Train a model

python train.py --config configs/your_config.yaml

Evaluate or infer

python evaluate.py --checkpoint path/to/model.pth --data your_test_data

Example script

You could include a sample script run_demo.py that loads a sample audio, converts to spectrogram, runs through the model, and prints results.

Model / Architecture

Describe the architecture you used:

  • Patch size (e.g. 16×16)
  • Number of transformer layers / heads / embedding dimension
  • Any modifications you did for spectrogram data (positional embeddings, time-frequency embedding, masking, etc.)
  • Loss functions, regularization, etc.

If you adopted or adapted from other works (e.g. ASiT: Audio Spectrogram vIsion Transformer ([arXiv][1]) or other similar works), mention it here.

Experiments & Results

Provide a table of your experiments, for example:

Task / Dataset Model Variant Accuracy / Metric Notes
Speech Commands ViT-base 95.6% baseline
Your dataset Your model XX.X% your experiment

Include charts, loss curves, confusion matrices, etc.

Dataset

Describe the dataset(s) you used:

  • Name (e.g. Speech Commands, ESC-50, custom)
  • Number of classes
  • Preprocessing (sampling rate, window size, overlap, normalization)
  • Train / val / test splits

Training

Detail your training settings, e.g.:

  • Learning rate, scheduler
  • Batch size
  • Number of epochs
  • Optimizer (Adam, SGD, etc.)
  • Data augmentation (if any)
  • Hardware setup

Evaluation

  • How metrics are computed (accuracy, F1, AUC, etc.)
  • Any special evaluation scripts
  • How to reproduce results

Contributing

If you welcome contributions, you can say:

  • Please open issues or pull requests
  • Follow the code style
  • Add tests / documentation
  • Cite your work

License

State your license (e.g. MIT, Apache, etc.)

MIT License
© Sreehari R

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published