A comprehensive benchmarking suite for evaluating Gemma and other language models on various benchmarks including MMLU (Massive Multitask Language Understanding) and GSM8K (Grade School Math 8K).
- β Support for Gemma models (2B and 7B)
- π Support for Mistral models
- π MMLU benchmark implementation
- π’ GSM8K benchmark implementation
- π Configurable model parameters
- π Secure HuggingFace authentication
- π Detailed results reporting and visualization
- π Interactive plots and summary reports
- π§ββοΈ Interactive setup wizard
Ensure you have the following installed:
- Python 3.10+
- CUDA-capable GPU (recommended)
- HuggingFace account with access to Gemma models
- Clone the repository
git clone https://github.com/yourusername/gemma-benchmarking.git
cd gemma-benchmarking
- Run the setup wizard (Recommended)
python scripts/setup_wizard.py
The wizard will:
- Check prerequisites
- Set up the Python environment (conda or venv)
- Configure models and benchmarks
- Generate a custom configuration file
- Guide you through the next steps
- Manual Installation (Alternative)
Option 1: Using Conda (Recommended)
conda env create -f environment.yml
conda activate gemma-benchmark
Option 2: Using Python venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Option 3: Using Docker
- Docker installed on your system
- NVIDIA Container Toolkit (for GPU support)
- CPU Version
docker-compose up --build
- GPU Version
TARGET=gpu docker-compose up --build
- Running Jupyter Notebooks
COMMAND="jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root" docker-compose up --build
- Running Specific Scripts
COMMAND="python scripts/run_benchmark.py" docker-compose up --build
The Docker setup includes:
- Multi-stage builds for CPU and GPU support
- Persistent volume for HuggingFace cache
- Jupyter notebook support
- Security best practices (non-root user)
- Automatic GPU detection and support
- Install dependencies
pip install -r requirements.txt
For models that require authentication (like Gemma), you need to log in to HuggingFace:
huggingface-cli login
This will prompt you to enter your HuggingFace token. You can get your token from HuggingFace Settings.
All benchmarking settings are controlled via JSON configuration files in the configs/
directory.
- The default configuration is available at:
configs/default.json
- You can create custom configs to tailor model selection, datasets, and evaluation settings
- The setup wizard will help you create a custom configuration file
Run with the default config:
python src/main.py
Run with a custom config:
python src/main.py --config path/to/config.json
Specify models and benchmarks via CLI:
python src/main.py --models gemma-2b mistral-7b
Run specific benchmarks:
python src/main.py --benchmarks mmlu gsm8k
Enable verbose output:
python src/main.py --verbose
After running benchmarks, generate visualization reports:
python scripts/generate_report.py
Customize report generation:
python scripts/generate_report.py --results_dir custom_results --output_dir custom_reports --output_name my_report
gemma-benchmarking/
βββ configs/ # Configuration files (JSON)
βββ environment.yml # Conda environment specification
βββ logs/ # Log files
βββ requirements.txt # Python dependencies
βββ results/ # Benchmark output results
βββ reports/ # Visualization reports and plots
βββ scripts/ # Utility scripts
β βββ setup_wizard.py # Interactive setup wizard
β βββ generate_report.py # Report generation script
β βββ prepare_data.py # Dataset preparation scripts
βββ src/ # Source code
β βββ benchmarks/ # Benchmark task implementations
β β βββ base_benchmark.py # Base benchmark class
β β βββ mmlu.py # MMLU benchmark
β β βββ gsm8k.py # GSM8K benchmark
β βββ models/ # Model wrappers and loading logic
β βββ utils/ # Helper utilities and tools
β βββ visualization/ # Visualization and reporting tools
β β βββ plotter.py # Results plotting
β βββ main.py # Entry point for benchmarking
βββ README.md # You're here!
- Evaluates models across 57 subjects
- Supports few-shot learning
- Configurable number of examples per subject
- Tests mathematical reasoning capabilities
- Step-by-step problem solving
- Few-shot learning support
- Detailed accuracy metrics
- Add CLI wizard for quick setup
- Add support for additional Gemma model variants
- Expand academic benchmark integration
- Add HumanEval benchmark implementation
- Improve visualization and report automation
- Add leaderboard comparison with open models (e.g., LLaMA, Mistral)
- Docker support and multiplatform compatibility
This project is licensed under the MIT License.
Pull requests, issues, and suggestions are welcome! Please open an issue or start a discussion if you'd like to contribute.
- Google for the Gemma models
- Mistral AI for the Mistral models
- HuggingFace for the transformers library and model hosting
- The MMLU and GSM8K benchmark creators