Skip to content

A flexible benchmarking suite for evaluating Gemma, Mistral, and other LLMs on academic benchmarks like MMLU.

License

Notifications You must be signed in to change notification settings

dhyaneesh/llm-benchmarking

Repository files navigation


πŸ” LLM Benchmarking Suite

A comprehensive benchmarking suite for evaluating Gemma and other language models on various benchmarks including MMLU (Massive Multitask Language Understanding) and GSM8K (Grade School Math 8K).


πŸš€ Features

  • βœ… Support for Gemma models (2B and 7B)
  • πŸ” Support for Mistral models
  • πŸ“Š MMLU benchmark implementation
  • πŸ”’ GSM8K benchmark implementation
  • πŸ”Œ Configurable model parameters
  • πŸ”’ Secure HuggingFace authentication
  • πŸ“ˆ Detailed results reporting and visualization
  • πŸ“Š Interactive plots and summary reports
  • πŸ§™β€β™‚οΈ Interactive setup wizard

πŸ› οΈ Setup Instructions

βœ… Prerequisites

Ensure you have the following installed:

  • Python 3.10+
  • CUDA-capable GPU (recommended)
  • HuggingFace account with access to Gemma models

πŸ“¦ Installation

  1. Clone the repository
git clone https://github.com/yourusername/gemma-benchmarking.git
cd gemma-benchmarking
  1. Run the setup wizard (Recommended)
python scripts/setup_wizard.py

The wizard will:

  • Check prerequisites
  • Set up the Python environment (conda or venv)
  • Configure models and benchmarks
  • Generate a custom configuration file
  • Guide you through the next steps
  1. Manual Installation (Alternative)
Option 1: Using Conda (Recommended)
conda env create -f environment.yml
conda activate gemma-benchmark
Option 2: Using Python venv
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
Option 3: Using Docker

Prerequisites

  • Docker installed on your system
  • NVIDIA Container Toolkit (for GPU support)

Running with Docker

  1. CPU Version
docker-compose up --build
  1. GPU Version
TARGET=gpu docker-compose up --build
  1. Running Jupyter Notebooks
COMMAND="jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root" docker-compose up --build
  1. Running Specific Scripts
COMMAND="python scripts/run_benchmark.py" docker-compose up --build

The Docker setup includes:

  • Multi-stage builds for CPU and GPU support
  • Persistent volume for HuggingFace cache
  • Jupyter notebook support
  • Security best practices (non-root user)
  • Automatic GPU detection and support
  1. Install dependencies
pip install -r requirements.txt

πŸ”’ Authentication

For models that require authentication (like Gemma), you need to log in to HuggingFace:

huggingface-cli login

This will prompt you to enter your HuggingFace token. You can get your token from HuggingFace Settings.


βš™οΈ Configuration

All benchmarking settings are controlled via JSON configuration files in the configs/ directory.

  • The default configuration is available at: configs/default.json
  • You can create custom configs to tailor model selection, datasets, and evaluation settings
  • The setup wizard will help you create a custom configuration file

πŸ“ˆ Usage

Running Benchmarks

Run with the default config:

python src/main.py

Run with a custom config:

python src/main.py --config path/to/config.json

Specify models and benchmarks via CLI:

python src/main.py --models gemma-2b mistral-7b

Run specific benchmarks:

python src/main.py --benchmarks mmlu gsm8k

Enable verbose output:

python src/main.py --verbose

Generating Reports

After running benchmarks, generate visualization reports:

python scripts/generate_report.py

Customize report generation:

python scripts/generate_report.py --results_dir custom_results --output_dir custom_reports --output_name my_report

πŸ“ Project Structure

gemma-benchmarking/
β”œβ”€β”€ configs/              # Configuration files (JSON)
β”œβ”€β”€ environment.yml       # Conda environment specification
β”œβ”€β”€ logs/                 # Log files
β”œβ”€β”€ requirements.txt      # Python dependencies
β”œβ”€β”€ results/              # Benchmark output results
β”œβ”€β”€ reports/              # Visualization reports and plots
β”œβ”€β”€ scripts/              # Utility scripts
β”‚   β”œβ”€β”€ setup_wizard.py   # Interactive setup wizard
β”‚   β”œβ”€β”€ generate_report.py  # Report generation script
β”‚   └── prepare_data.py     # Dataset preparation scripts
β”œβ”€β”€ src/                  # Source code
β”‚   β”œβ”€β”€ benchmarks/       # Benchmark task implementations
β”‚   β”‚   β”œβ”€β”€ base_benchmark.py  # Base benchmark class
β”‚   β”‚   β”œβ”€β”€ mmlu.py           # MMLU benchmark
β”‚   β”‚   └── gsm8k.py          # GSM8K benchmark
β”‚   β”œβ”€β”€ models/           # Model wrappers and loading logic
β”‚   β”œβ”€β”€ utils/            # Helper utilities and tools
β”‚   β”œβ”€β”€ visualization/    # Visualization and reporting tools
β”‚   β”‚   └── plotter.py       # Results plotting
β”‚   └── main.py           # Entry point for benchmarking
└── README.md             # You're here!

πŸ“Š Available Benchmarks

MMLU (Massive Multitask Language Understanding)

  • Evaluates models across 57 subjects
  • Supports few-shot learning
  • Configurable number of examples per subject

GSM8K (Grade School Math 8K)

  • Tests mathematical reasoning capabilities
  • Step-by-step problem solving
  • Few-shot learning support
  • Detailed accuracy metrics

πŸ“Œ Roadmap

  • Add CLI wizard for quick setup
  • Add support for additional Gemma model variants
  • Expand academic benchmark integration
  • Add HumanEval benchmark implementation
  • Improve visualization and report automation
  • Add leaderboard comparison with open models (e.g., LLaMA, Mistral)
  • Docker support and multiplatform compatibility

πŸ“„ License

This project is licensed under the MIT License.


πŸ™Œ Contributing

Pull requests, issues, and suggestions are welcome! Please open an issue or start a discussion if you'd like to contribute.


πŸ“„ Acknowledgments

  • Google for the Gemma models
  • Mistral AI for the Mistral models
  • HuggingFace for the transformers library and model hosting
  • The MMLU and GSM8K benchmark creators

About

A flexible benchmarking suite for evaluating Gemma, Mistral, and other LLMs on academic benchmarks like MMLU.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published