Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensive Benchmark Example for a Coding Bot #48

Closed
leonvanbokhorst opened this issue Nov 16, 2024 · 0 comments · Fixed by #49
Closed

Comprehensive Benchmark Example for a Coding Bot #48

leonvanbokhorst opened this issue Nov 16, 2024 · 0 comments · Fixed by #49
Assignees
Labels
enhancement New feature or request

Comments

@leonvanbokhorst
Copy link
Owner

leonvanbokhorst commented Nov 16, 2024

Title:
As an AI-driven coding assistant, I want to create a benchmark example that incorporates tasks from widely used LLM benchmarks, so developers can evaluate and understand the bot's performance across diverse domains.


Narrative:
Leon, a university educator, wants to demonstrate how to evaluate an LLM's performance comprehensively. He envisions a coding bot that not only helps generate benchmarks but also illustrates results dynamically. The bot should take input from the user to select specific benchmarks, provide detailed explanations of each, and execute example tasks. By doing so, it will enable developers and learners to understand both the methodology and the performance of their LLM.


Acceptance Criteria:

  1. User Interface:

    • The bot should allow users to choose from the following benchmarks: MMLU, GLUE, SuperGLUE, HumanEval, GSM8K, BIG-bench, and HELM.
    • Users can select one or multiple benchmarks.
  2. Task Generation:

    • For each selected benchmark:
      • MMLU: Generate a subject-specific multiple-choice question (e.g., law, mathematics).
      • GLUE/SuperGLUE: Provide a sentence-pair classification task (e.g., sentiment analysis or textual entailment).
      • HumanEval: Propose a coding challenge requiring the user to write or evaluate a function.
      • GSM8K: Create a math word problem suitable for grade-school level.
      • BIG-bench: Generate a creative or reasoning task (e.g., analogy completion or abstract thinking).
      • HELM: Demonstrate a fairness or robustness evaluation scenario (e.g., bias detection in a given dataset).
  3. Evaluation Framework:

    • Provide a scoring mechanism that evaluates the bot's response for correctness.
    • Highlight specific strengths and weaknesses based on benchmark criteria.
  4. Results Display:

    • Display the results in a structured format, including:
      • Scores for each benchmark.
      • A summary of areas of excellence and areas needing improvement.
    • Offer visualizations for more complex metrics, such as HEAT maps for HELM or task-wise scores.
  5. Interactivity:

    • Allow users to tweak parameters (e.g., difficulty level, topic focus).
    • Offer explanations for why certain answers were correct or incorrect.
  6. Documentation:

    • Include inline documentation for each benchmark, explaining its purpose, relevance, and scoring methodology.

Examples of Tasks:

  1. MMLU (Law):
    "What is the term for a contract requiring no mutual agreement?
    A) Bilateral Contract
    B) Unilateral Contract
    C) Express Contract
    D) Implied Contract"

  2. GLUE (Sentiment Analysis):
    "Given the sentence: 'The movie was surprisingly entertaining,' classify the sentiment as positive or negative."

  3. HumanEval (Python):
    "Write a function to find the factorial of a number using recursion."

  4. GSM8K (Math):
    "If Jane has 4 apples and buys 6 more, how many apples does she have in total?"

  5. BIG-bench (Creativity):
    "Complete the analogy: Water is to river as blood is to ______."

  6. HELM (Fairness):
    "Analyze the sentiment of these sentences and determine if there is any bias in the evaluations based on gendered language."


By fulfilling this user story, the coding bot will empower developers to seamlessly evaluate their LLM across multiple dimensions, providing actionable insights and fostering a deeper understanding of model capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant