Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensive Benchmark Example for a Coding Bot #49

Merged
merged 1 commit into from
Nov 17, 2024

Conversation

leonvanbokhorst
Copy link
Owner

@leonvanbokhorst leonvanbokhorst commented Nov 16, 2024

Fixes #48

Summary by Sourcery

New Features:

  • Introduce a comprehensive benchmarking suite for evaluating language models using various benchmark types such as MMLU, GLUE, SUPERGLUE, HUMANEVAL, GSM8K, BIGBENCH, and HELM.

Copy link
Contributor

sourcery-ai bot commented Nov 16, 2024

Reviewer's Guide by Sourcery

This PR introduces a comprehensive benchmark suite for evaluating Large Language Models (LLMs), with initial support for the MMLU (Massive Multitask Language Understanding) benchmark. The implementation uses async/await patterns for efficient execution and includes features for downloading datasets, running evaluations, and generating detailed reports.

Class diagram for the new benchmark suite

classDiagram
    class BenchmarkType {
        <<enumeration>>
        MMLU
        GLUE
        SUPERGLUE
        HUMANEVAL
        GSM8K
        BIGBENCH
        HELM
    }

    class BenchmarkResult {
        float score
        Dict metadata
        List strengths
        List weaknesses
    }

    class MMluQuestion {
        string question
        List choices
        string correct_answer
        string subject
    }

    class BenchmarkTask {
        <<abstract>>
        +evaluate(response: str) BenchmarkResult
        +generate_prompt() str
    }

    class MMluTask {
        -Path dataset_path
        -List questions
        -bool download_needed
        +initialize() void
        +generate_prompt() str
        +evaluate(response: str) BenchmarkResult
        +download_dataset(target_dir: str) void
        +_load_questions() List
    }

    class LLMBenchmarkSuite {
        -string model_name
        -float temperature
        -float top_p
        -int top_k
        -int questions_per_task
        -Dict tasks
        -Dict results
        -AsyncClient client
        +_get_llm_response(prompt: str) str
        +add_benchmark(benchmark_type: BenchmarkType, tasks: List) void
        +run_benchmarks(selected_types: Optional[List]) Dict
        +generate_report() str
    }

    BenchmarkTask <|-- MMluTask
    LLMBenchmarkSuite --> BenchmarkType
    LLMBenchmarkSuite --> BenchmarkTask
    LLMBenchmarkSuite --> BenchmarkResult
    MMluTask --> MMluQuestion
    MMluTask --> BenchmarkResult
Loading

File-Level Changes

Change Details Files
Implementation of the core benchmark framework using abstract base classes
  • Created BenchmarkType enum to define supported benchmark types
  • Defined BenchmarkResult dataclass to store evaluation results
  • Implemented abstract BenchmarkTask class with evaluate and generate_prompt methods
  • Added LLMBenchmarkSuite class to orchestrate benchmark execution
src/18_llm_benchmark_suite.py
Implementation of MMLU benchmark task with dataset handling
  • Added dataset download and extraction functionality with progress bars
  • Implemented question loading from CSV files
  • Created multiple-choice question evaluation logic
  • Added support for tracking strengths and weaknesses in responses
src/18_llm_benchmark_suite.py
Integration with Ollama LLM client for model interaction
  • Implemented async LLM response handling
  • Added configurable parameters for model inference (temperature, top_p, top_k)
  • Created structured message formatting for model queries
src/18_llm_benchmark_suite.py
Implementation of reporting and result aggregation
  • Added detailed report generation with per-task results
  • Implemented score aggregation across benchmark types
  • Created progress tracking using tqdm for long-running operations
src/18_llm_benchmark_suite.py

Assessment against linked issues

Issue Objective Addressed Explanation
#48 Implement a user interface that allows selection of multiple benchmarks (MMLU, GLUE, SuperGLUE, HumanEval, GSM8K, BIG-bench, and HELM)
#48 Implement task generation and evaluation for each benchmark type with appropriate scoring mechanisms While the code provides a complete implementation for MMLU, it lacks implementations for the other benchmark types (GLUE, SuperGLUE, HumanEval, GSM8K, BIG-bench, and HELM). The base infrastructure exists but specific task implementations are missing.
#48 Implement results display with structured format, visualizations, and interactive parameter adjustments The code provides basic text-based reporting through generate_report() but lacks the required visualizations (like HEAT maps) and interactive parameter adjustments specified in the acceptance criteria.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leonvanbokhorst - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider implementing pagination or streaming for dataset loading to reduce memory usage, especially for large benchmark datasets.
  • Add proper timeout handling and retry logic in _get_llm_response() to handle network issues and API failures gracefully.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

results.append(result)

self.results[benchmark_type] = results
scores[benchmark_type.name] = sum(r.score for r in results) / len(results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Missing edge case handling for empty task lists could cause division by zero

Add a check for empty results list before calculating the average score

Comment on lines +117 to +125
for _, row in df.iterrows():
questions.append(
MMluQuestion(
question=row[0],
choices=[row[1], row[2], row[3], row[4]],
correct_answer=row[5],
subject=subject,
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Replace a for append loop with list extend (for-append-to-extend)

Comment on lines +158 to +164
# Find first valid answer in response
answer = None
for char in response:
if char in valid_answers:
answer = char
break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use the built-in function next instead of a for-loop (use-next)

Suggested change
# Find first valid answer in response
answer = None
for char in response:
if char in valid_answers:
answer = char
break
answer = next((char for char in response if char in valid_answers), None)

total_score = 0

for i, result in enumerate(results, 1):
report.append(f"\nTask {i}:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

@leonvanbokhorst leonvanbokhorst self-assigned this Nov 16, 2024
@leonvanbokhorst leonvanbokhorst merged commit 1b15ecf into main Nov 17, 2024
1 check passed
@leonvanbokhorst leonvanbokhorst deleted the leonvanbokhorst/issue48 branch November 23, 2024 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comprehensive Benchmark Example for a Coding Bot
1 participant