OpenArena is a Python project designed to create high-quality datasets by pitting Language Models (LLMs) against each other in a competitive environment. This tool uses an ELO-based rating system to rank different LLMs based on their performance on various prompts, judged by another LLM acting as an impartial evaluator.
- Asynchronous battles between multiple LLMs
- ELO rating system for model performance tracking
- Detailed evaluation and scoring by a judge model
- Generation of training data based on model performances
- Support for Hugging Face datasets
- YAML configuration for easy setup and customization
- Support for custom endpoints and API keys
- Python 3.7+
- Ollama (for local model execution)
-
Clone this repository:
git clone https://github.com/syv-ai/OpenArena.git cd OpenArena
-
Install the required packages:
pip install -r requirements.txt
-
Ensure you have access to the Ollama API endpoint (default:
http://localhost:11434/v1/chat/completions
) or other custom endpoints as specified in your configuration.
-
Create or modify the
arena_config.yaml
file to set up your desired models, datasets, and endpoints:default_endpoint: url: "http://localhost:11434/v1/chat/completions" # api_key: "your_default_api_key" # Uncomment and set if needed judge_model: name: "JudgeModel" model_id: "llama3" # endpoint: # url: "http://custom-judge-endpoint.com/api/chat" # api_key: "judge_model_api_key" models: - name: "Open hermes" model_id: "openhermes" # endpoint: # url: "http://custom-phi3-endpoint.com/api/chat" # api_key: "phi3_api_key" - name: "Mistral v0.3" model_id: "mistral" - name: "Phi 3 medium" model_id: "phi3:14b" datasets: - name: "skunkworksAI/reasoning-0.01" description: "Reasoning dataset" split: "train" field: "instruction" limit: 10
-
Run the script:
python llm_arena.py
-
The script will output:
- Intermediate ELO rankings after each prompt
- ELO rating progression for each model
- Generated training data
- Final ELO ratings
- Total execution time
- Configuration Loading: The script loads the configuration from the
arena_config.yaml
file. - Dataset Loading: Prompts are loaded from the specified Hugging Face dataset(s).
- Response Generation: Each model generates responses for all given prompts.
- Evaluation: The judge model evaluates pairs of responses for each prompt, providing scores and explanations.
- ELO Updates: ELO ratings are updated based on the scores from each battle.
- Training Data Generation: The results are compiled into a structured format for potential use in training or fine-tuning other models.
- Modify the
arena_config.yaml
file to add or remove models, change the judge model, adjust dataset parameters, or specify custom endpoints and API keys. - Adjust the ELO K-factor in the
update_elo_ratings()
method to change the volatility of the ratings.
- Automatically download Ollama models
- Added YAML configuration
- OpenAI support - supports any OpenAI compatible endpoint (also vLLM)
- Hugging Face datasets integration
- Auto upload to Hugging Face
- Database to keep track of which prompts have been done if it should fail
Contributions are welcome! Please feel free to submit a Pull Request.