comprehension-bench

This tool replicates results from the MK1 Comprehension Benchmark, designed to evaluate the reasoning capabilities of large language models. The benchmark and its results are detailed in the accompanying blog post.

The benchmark uses a dataset comprising a single, procedurally generated, logically consistent natural language document paired with 100 associated queries. Because the documents are uniquely generated each time, models cannot simply memorize the dataset; they must reason over the provided text.

The document consists of nested clauses structured like this example:

The canyon is equipped for hands-on training if at least one of the following is satisfied:
    (1) All of the following must be true:
        (a) the tambourine is licensed for special operations
        (b) the talisman is proficient in energy conversion

Each query presents a set of conditions based on the document and asks a question with a provably true or false answer, formatted like this:

Given these conditions:
- the canyon is equipped for hands-on training
- the loom is synchronized with atomic clocks
- the hydra is aligned with strategic goals
Is the windmill consistently fashionably late?

To assess model performance across varying input sizes, the document's context length is systematically increased by inserting clauses that are irrelevant to the core logical structure. The evaluation process involves sending the extended document and a query to a target model, then comparing the model's response against the known correct answer (true or false).

Installation

To install the necessary dependencies, run:

pip install -r requirements.txt

Benchmark Results

The plot below shows the performance of various models on the benchmark across different context lengths.

Example Usage

export GOOGLE_API_KEY=<Your google genai api key>
export OPENAI_API_KEY=<Your openai api key>
python eval.py --generation_model "gemini-2.0-flash" --generation_backend "google" --generation_api_key $GOOGLE_API_KEY --evaluation_api_key $OPENAI_API_KEY

The --num_samples flag is optional. Set this to a small number (e.g. 5) to evaluate a small subset of the dataset, or leave it unset to evaluate the entire dataset.

License

The code repository is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
comprehension-dataset		comprehension-dataset
prompt_templates		prompt_templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
model.py		model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

comprehension-bench

Installation

Benchmark Results

Example Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

mk1-project/comprehension-bench

Folders and files

Latest commit

History

Repository files navigation

comprehension-bench

Installation

Benchmark Results

Example Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages