Comprehensive Benchmark Example for a Coding Bot #48

leonvanbokhorst · 2024-11-16T16:53:50Z

Title:
As an AI-driven coding assistant, I want to create a benchmark example that incorporates tasks from widely used LLM benchmarks, so developers can evaluate and understand the bot's performance across diverse domains.

Narrative:
Leon, a university educator, wants to demonstrate how to evaluate an LLM's performance comprehensively. He envisions a coding bot that not only helps generate benchmarks but also illustrates results dynamically. The bot should take input from the user to select specific benchmarks, provide detailed explanations of each, and execute example tasks. By doing so, it will enable developers and learners to understand both the methodology and the performance of their LLM.

Acceptance Criteria:

User Interface:
- The bot should allow users to choose from the following benchmarks: MMLU, GLUE, SuperGLUE, HumanEval, GSM8K, BIG-bench, and HELM.
- Users can select one or multiple benchmarks.
Task Generation:
- For each selected benchmark:
  - MMLU: Generate a subject-specific multiple-choice question (e.g., law, mathematics).
  - GLUE/SuperGLUE: Provide a sentence-pair classification task (e.g., sentiment analysis or textual entailment).
  - HumanEval: Propose a coding challenge requiring the user to write or evaluate a function.
  - GSM8K: Create a math word problem suitable for grade-school level.
  - BIG-bench: Generate a creative or reasoning task (e.g., analogy completion or abstract thinking).
  - HELM: Demonstrate a fairness or robustness evaluation scenario (e.g., bias detection in a given dataset).
Evaluation Framework:
- Provide a scoring mechanism that evaluates the bot's response for correctness.
- Highlight specific strengths and weaknesses based on benchmark criteria.
Results Display:
- Display the results in a structured format, including:
  - Scores for each benchmark.
  - A summary of areas of excellence and areas needing improvement.
- Offer visualizations for more complex metrics, such as HEAT maps for HELM or task-wise scores.
Interactivity:
- Allow users to tweak parameters (e.g., difficulty level, topic focus).
- Offer explanations for why certain answers were correct or incorrect.
Documentation:
- Include inline documentation for each benchmark, explaining its purpose, relevance, and scoring methodology.

Examples of Tasks:

MMLU (Law):
"What is the term for a contract requiring no mutual agreement?
A) Bilateral Contract
B) Unilateral Contract
C) Express Contract
D) Implied Contract"
GLUE (Sentiment Analysis):
"Given the sentence: 'The movie was surprisingly entertaining,' classify the sentiment as positive or negative."
HumanEval (Python):
"Write a function to find the factorial of a number using recursion."
GSM8K (Math):
"If Jane has 4 apples and buys 6 more, how many apples does she have in total?"
BIG-bench (Creativity):
"Complete the analogy: Water is to river as blood is to ______."
HELM (Fairness):
"Analyze the sentiment of these sentences and determine if there is any bias in the evaluations based on gendered language."

By fulfilling this user story, the coding bot will empower developers to seamlessly evaluate their LLM across multiple dimensions, providing actionable insights and fostering a deeper understanding of model capabilities.

leonvanbokhorst added the enhancement New feature or request label Nov 16, 2024

leonvanbokhorst self-assigned this Nov 16, 2024

leonvanbokhorst mentioned this issue Nov 16, 2024

Comprehensive Benchmark Example for a Coding Bot #49

Merged

leonvanbokhorst closed this as completed in 1b15ecf Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comprehensive Benchmark Example for a Coding Bot #48

Comprehensive Benchmark Example for a Coding Bot #48

leonvanbokhorst commented Nov 16, 2024 •

edited

Loading

Comprehensive Benchmark Example for a Coding Bot #48

Comprehensive Benchmark Example for a Coding Bot #48

Comments

leonvanbokhorst commented Nov 16, 2024 • edited Loading

leonvanbokhorst commented Nov 16, 2024 •

edited

Loading