You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Title:
As an AI-driven coding assistant, I want to create a benchmark example that incorporates tasks from widely used LLM benchmarks, so developers can evaluate and understand the bot's performance across diverse domains.
Narrative:
Leon, a university educator, wants to demonstrate how to evaluate an LLM's performance comprehensively. He envisions a coding bot that not only helps generate benchmarks but also illustrates results dynamically. The bot should take input from the user to select specific benchmarks, provide detailed explanations of each, and execute example tasks. By doing so, it will enable developers and learners to understand both the methodology and the performance of their LLM.
Acceptance Criteria:
User Interface:
The bot should allow users to choose from the following benchmarks: MMLU, GLUE, SuperGLUE, HumanEval, GSM8K, BIG-bench, and HELM.
Users can select one or multiple benchmarks.
Task Generation:
For each selected benchmark:
MMLU: Generate a subject-specific multiple-choice question (e.g., law, mathematics).
GLUE/SuperGLUE: Provide a sentence-pair classification task (e.g., sentiment analysis or textual entailment).
HumanEval: Propose a coding challenge requiring the user to write or evaluate a function.
GSM8K: Create a math word problem suitable for grade-school level.
BIG-bench: Generate a creative or reasoning task (e.g., analogy completion or abstract thinking).
HELM: Demonstrate a fairness or robustness evaluation scenario (e.g., bias detection in a given dataset).
Evaluation Framework:
Provide a scoring mechanism that evaluates the bot's response for correctness.
Highlight specific strengths and weaknesses based on benchmark criteria.
Results Display:
Display the results in a structured format, including:
Scores for each benchmark.
A summary of areas of excellence and areas needing improvement.
Offer visualizations for more complex metrics, such as HEAT maps for HELM or task-wise scores.
Interactivity:
Allow users to tweak parameters (e.g., difficulty level, topic focus).
Offer explanations for why certain answers were correct or incorrect.
Documentation:
Include inline documentation for each benchmark, explaining its purpose, relevance, and scoring methodology.
Examples of Tasks:
MMLU (Law):
"What is the term for a contract requiring no mutual agreement?
A) Bilateral Contract
B) Unilateral Contract
C) Express Contract
D) Implied Contract"
GLUE (Sentiment Analysis):
"Given the sentence: 'The movie was surprisingly entertaining,' classify the sentiment as positive or negative."
HumanEval (Python):
"Write a function to find the factorial of a number using recursion."
GSM8K (Math):
"If Jane has 4 apples and buys 6 more, how many apples does she have in total?"
BIG-bench (Creativity):
"Complete the analogy: Water is to river as blood is to ______."
HELM (Fairness):
"Analyze the sentiment of these sentences and determine if there is any bias in the evaluations based on gendered language."
By fulfilling this user story, the coding bot will empower developers to seamlessly evaluate their LLM across multiple dimensions, providing actionable insights and fostering a deeper understanding of model capabilities.
The text was updated successfully, but these errors were encountered:
Title:
As an AI-driven coding assistant, I want to create a benchmark example that incorporates tasks from widely used LLM benchmarks, so developers can evaluate and understand the bot's performance across diverse domains.
Narrative:
Leon, a university educator, wants to demonstrate how to evaluate an LLM's performance comprehensively. He envisions a coding bot that not only helps generate benchmarks but also illustrates results dynamically. The bot should take input from the user to select specific benchmarks, provide detailed explanations of each, and execute example tasks. By doing so, it will enable developers and learners to understand both the methodology and the performance of their LLM.
Acceptance Criteria:
User Interface:
Task Generation:
Evaluation Framework:
Results Display:
Interactivity:
Documentation:
Examples of Tasks:
MMLU (Law):
"What is the term for a contract requiring no mutual agreement?
A) Bilateral Contract
B) Unilateral Contract
C) Express Contract
D) Implied Contract"
GLUE (Sentiment Analysis):
"Given the sentence: 'The movie was surprisingly entertaining,' classify the sentiment as positive or negative."
HumanEval (Python):
"Write a function to find the factorial of a number using recursion."
GSM8K (Math):
"If Jane has 4 apples and buys 6 more, how many apples does she have in total?"
BIG-bench (Creativity):
"Complete the analogy: Water is to river as blood is to ______."
HELM (Fairness):
"Analyze the sentiment of these sentences and determine if there is any bias in the evaluations based on gendered language."
By fulfilling this user story, the coding bot will empower developers to seamlessly evaluate their LLM across multiple dimensions, providing actionable insights and fostering a deeper understanding of model capabilities.
The text was updated successfully, but these errors were encountered: