GuessTheRuleBench

Welcome to GuessTheRuleBench, a dynamic benchmark designed to evaluate implicit rule deduction capabilities of Large Language Models (LLMs) through "guess-the-rule" games. This repository contains:

The code for running the benchmark via a Python library.
A web application demo where human user can play the games or watch LLM agents interact with the system in real-time.
Experiment results and a research paper detailing the methodology and findings.

High-Level System Design Diagram

Below is a high-level system design diagram that illustrates the various components, their interactions, and the overall workflow of GuessTheRuleBench:

Research Paper and Demo Presentation

For a complete understanding of the methodology, experiments, and analysis, please refer below

Research Paper PDF

Demo Presentation Video

Demo Slides

System Requirements

Python 3.9 or below is required to run the Python library and backend services. Use a conda environment to avoid installing libraries globally:

conda create -n guess_the_rule_env python=3.9
conda activate guess_the_rule_env

For Agentic Use: Running the Benchmark Python Library

The Python library provides four game classes:

StaticGoingOnAPicnic() for the Static Picnic game
DynamicGoingOnAPicnic() for the Dynamic Picnic game
CodeFunctionsPicnic() for the Code Functions Picnic game
MathGuessTheRuleGame() for the Math game

Each class exposes the following methods:

create_game_instance() to request a new instance of the game.
get_more_examples(N) to request N more examples.
validate_guess(guess) to present the user's guess for validation.
get_game_summary() to retrieve the performance summary of the current game.
load_game(uuid) to load a previously generated game instance.

Sample Code for Static Picnic Game:

from lib.domain.picnic.static_picnic.base import StaticGoingOnAPicnic

# Get a new object for the static picnic game
static_picnic_obj = StaticGoingOnAPicnic(
    difficulty='L1',
    num_init_examples=2
)

# Create a new game instance
static_picnic_obj.create_game_instance()

# Request more examples
static_picnic_obj.get_more_examples(n=1)
static_picnic_obj.get_more_examples(n=2)
static_picnic_obj.get_more_examples(n=3)

# Validate guess
static_picnic_obj.validate_guess(guess='Items from the category kitchen appliances')

# Get game summary
static_picnic_obj.get_game_summary()

# Load an existing game and check its summary
loaded_game = StaticGoingOnAPicnic.load_game('650499e9-a5da-4129-b426-8d6517bf65e6')
loaded_game.get_game_summary(include_rule=True)

Run the Web Application for Benchmark Demo Locally

Before starting, ensure you have your OPENAI_API_KEY set in the .env file at the root of this repository:

OPENAI_API_KEY=your-key-here

Backend

Navigate to the backend directory:

cd Guess-the-Rule-LLM-Benchmark/backend

Install the required libraries:

pip install -r requirements.txt

Run the backend server:

python app/main.py

The backend will start on http://localhost:8000. Important: The frontend is configured to communicate with this port only, so ensure the backend is running on port 8000.

Frontend

Navigate to the frontend directory:

cd ../frontend

Install the required libraries:

npm i

Start the frontend server:

npm run dev

The frontend will be accessible at http://localhost:8080/. Open this URL in your browser to interact with the benchmark games, either playing them yourself or observing LLM gameplay.

Web Application UI

Below are some screenshots showcasing the web application interface and its features:

Landing Page
Docs Page
Game Play

Either start a new game or load existing game using an already generated game UUID.

Select the game configurations to start a new game.

Game play UI.

Benchmark Experiments High-Level Results

Below is a summary of the average win rate of different models across all games and difficulty levels. Bold values highlight the best performances in their respective columns.

Feedback and Contribution

If you have any suggestions, issues, or contributions, please feel free to open an issue or submit a pull request. We appreciate your interest and support in improving GuessTheRuleBench.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
backend		backend
docs		docs
frontend		frontend
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GuessTheRuleBench

High-Level System Design Diagram

Research Paper and Demo Presentation

System Requirements

For Agentic Use: Running the Benchmark Python Library

Sample Code for Static Picnic Game:

Run the Web Application for Benchmark Demo Locally

Backend

Frontend

Web Application UI

Benchmark Experiments High-Level Results

Feedback and Contribution

About

Releases

Packages

Contributors 5

Languages

m1chae11u/Guess-the-Rule-LLM-Benchmark

Folders and files

Latest commit

History

Repository files navigation

GuessTheRuleBench

High-Level System Design Diagram

Research Paper and Demo Presentation

System Requirements

For Agentic Use: Running the Benchmark Python Library

Sample Code for Static Picnic Game:

Run the Web Application for Benchmark Demo Locally

Backend

Frontend

Web Application UI

Benchmark Experiments High-Level Results

Feedback and Contribution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages