-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comprehensive Benchmark Example for a Coding Bot #49
Conversation
Reviewer's Guide by SourceryThis PR introduces a comprehensive benchmark suite for evaluating Large Language Models (LLMs), with initial support for the MMLU (Massive Multitask Language Understanding) benchmark. The implementation uses async/await patterns for efficient execution and includes features for downloading datasets, running evaluations, and generating detailed reports. Class diagram for the new benchmark suiteclassDiagram
class BenchmarkType {
<<enumeration>>
MMLU
GLUE
SUPERGLUE
HUMANEVAL
GSM8K
BIGBENCH
HELM
}
class BenchmarkResult {
float score
Dict metadata
List strengths
List weaknesses
}
class MMluQuestion {
string question
List choices
string correct_answer
string subject
}
class BenchmarkTask {
<<abstract>>
+evaluate(response: str) BenchmarkResult
+generate_prompt() str
}
class MMluTask {
-Path dataset_path
-List questions
-bool download_needed
+initialize() void
+generate_prompt() str
+evaluate(response: str) BenchmarkResult
+download_dataset(target_dir: str) void
+_load_questions() List
}
class LLMBenchmarkSuite {
-string model_name
-float temperature
-float top_p
-int top_k
-int questions_per_task
-Dict tasks
-Dict results
-AsyncClient client
+_get_llm_response(prompt: str) str
+add_benchmark(benchmark_type: BenchmarkType, tasks: List) void
+run_benchmarks(selected_types: Optional[List]) Dict
+generate_report() str
}
BenchmarkTask <|-- MMluTask
LLMBenchmarkSuite --> BenchmarkType
LLMBenchmarkSuite --> BenchmarkTask
LLMBenchmarkSuite --> BenchmarkResult
MMluTask --> MMluQuestion
MMluTask --> BenchmarkResult
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @leonvanbokhorst - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider implementing pagination or streaming for dataset loading to reduce memory usage, especially for large benchmark datasets.
- Add proper timeout handling and retry logic in _get_llm_response() to handle network issues and API failures gracefully.
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
results.append(result) | ||
|
||
self.results[benchmark_type] = results | ||
scores[benchmark_type.name] = sum(r.score for r in results) / len(results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Missing edge case handling for empty task lists could cause division by zero
Add a check for empty results list before calculating the average score
for _, row in df.iterrows(): | ||
questions.append( | ||
MMluQuestion( | ||
question=row[0], | ||
choices=[row[1], row[2], row[3], row[4]], | ||
correct_answer=row[5], | ||
subject=subject, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Replace a for append loop with list extend (for-append-to-extend
)
# Find first valid answer in response | ||
answer = None | ||
for char in response: | ||
if char in valid_answers: | ||
answer = char | ||
break | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Use the built-in function next
instead of a for-loop (use-next
)
# Find first valid answer in response | |
answer = None | |
for char in response: | |
if char in valid_answers: | |
answer = char | |
break | |
answer = next((char for char in response if char in valid_answers), None) |
total_score = 0 | ||
|
||
for i, result in enumerate(results, 1): | ||
report.append(f"\nTask {i}:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Merge consecutive list appends into a single extend (
merge-list-appends-into-extend
) - Replace a for append loop with list extend [×3] (
for-append-to-extend
)
Fixes #48
Summary by Sourcery
New Features: