Welcome to GuessTheRuleBench, a dynamic benchmark designed to evaluate implicit rule deduction capabilities of Large Language Models (LLMs) through "guess-the-rule" games. This repository contains:
- The code for running the benchmark via a Python library.
- A web application demo where human user can play the games or watch LLM agents interact with the system in real-time.
- Experiment results and a research paper detailing the methodology and findings.
Below is a high-level system design diagram that illustrates the various components, their interactions, and the overall workflow of GuessTheRuleBench:
For a complete understanding of the methodology, experiments, and analysis, please refer below
Python 3.9 or below is required to run the Python library and backend services. Use a conda environment to avoid installing libraries globally:
conda create -n guess_the_rule_env python=3.9
conda activate guess_the_rule_env
The Python library provides four game classes:
StaticGoingOnAPicnic()
for the Static Picnic gameDynamicGoingOnAPicnic()
for the Dynamic Picnic gameCodeFunctionsPicnic()
for the Code Functions Picnic gameMathGuessTheRuleGame()
for the Math game
Each class exposes the following methods:
create_game_instance()
to request a new instance of the game.get_more_examples(N)
to request N more examples.validate_guess(guess)
to present the user's guess for validation.get_game_summary()
to retrieve the performance summary of the current game.load_game(uuid)
to load a previously generated game instance.
from lib.domain.picnic.static_picnic.base import StaticGoingOnAPicnic
# Get a new object for the static picnic game
static_picnic_obj = StaticGoingOnAPicnic(
difficulty='L1',
num_init_examples=2
)
# Create a new game instance
static_picnic_obj.create_game_instance()
# Request more examples
static_picnic_obj.get_more_examples(n=1)
static_picnic_obj.get_more_examples(n=2)
static_picnic_obj.get_more_examples(n=3)
# Validate guess
static_picnic_obj.validate_guess(guess='Items from the category kitchen appliances')
# Get game summary
static_picnic_obj.get_game_summary()
# Load an existing game and check its summary
loaded_game = StaticGoingOnAPicnic.load_game('650499e9-a5da-4129-b426-8d6517bf65e6')
loaded_game.get_game_summary(include_rule=True)
Before starting, ensure you have your OPENAI_API_KEY
set in the .env
file at the root of this repository:
OPENAI_API_KEY=your-key-here
- Navigate to the backend directory:
cd Guess-the-Rule-LLM-Benchmark/backend
- Install the required libraries:
pip install -r requirements.txt
- Run the backend server:
python app/main.py
The backend will start on http://localhost:8000. Important: The frontend is configured to communicate with this port only, so ensure the backend is running on port 8000.
- Navigate to the frontend directory:
cd ../frontend
- Install the required libraries:
npm i
- Start the frontend server:
npm run dev
The frontend will be accessible at http://localhost:8080/. Open this URL in your browser to interact with the benchmark games, either playing them yourself or observing LLM gameplay.
Below are some screenshots showcasing the web application interface and its features:
Either start a new game or load existing game using an already generated game UUID.
Select the game configurations to start a new game.
Below is a summary of the average win rate of different models across all games and difficulty levels. Bold values highlight the best performances
in their respective columns.
If you have any suggestions, issues, or contributions, please feel free to open an issue or submit a pull request. We appreciate your interest and support in improving GuessTheRuleBench.