Skip to content

Comments

feat: add a way to evaluate llms for our usecase against our prompts#1700

Merged
MrgSub merged 8 commits intoMail-0:stagingfrom
retrogtx:eval
Jul 15, 2025
Merged

feat: add a way to evaluate llms for our usecase against our prompts#1700
MrgSub merged 8 commits intoMail-0:stagingfrom
retrogtx:eval

Conversation

@retrogtx
Copy link
Contributor

@retrogtx retrogtx commented Jul 10, 2025

just install everything and run pnpm eval to check scores for each test

play around with models to get higher scores, currently the default is set to 4o-mini

Summary by CodeRabbit

  • New Features
    • Introduced an evaluation suite to assess AI chat capabilities for email management, including conversational responses, search, organization, and smart categorization.
    • Added scripts to streamline running and developing evaluation and AI tests.
  • Chores
    • Updated development dependencies and tooling for improved testing and evaluation workflows.
    • Expanded TypeScript configuration to include evaluation and test files.
    • Added configuration for extended test timeouts.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 10, 2025

Walkthrough

A new AI chat evaluation suite for email tasks was introduced using the evalite framework. Supporting scripts, dependencies, and TypeScript configuration were added or updated to enable evaluation and testing workflows. A Vite configuration file was also introduced to set test timeouts for the server application.

Changes

File(s) Change Summary
apps/server/evals/ai-chat-basic.eval.ts Added a comprehensive evalite-based test suite for AI chat email capabilities across multiple email scenarios.
apps/server/package.json Added evalite, autoevals, vite, and vitest to devDependencies; added eval-related npm scripts for the server app.
apps/server/tsconfig.json Updated TypeScript "include" to cover "tests//*.ts" and "evals//*.ts".
apps/server/vite.config.ts Introduced Vite config setting test, hook, and teardown timeouts to 120,000ms.
package.json Added root-level npm scripts for AI testing and evaluation targeting @zero/server; minor syntax fix.

Sequence Diagram(s)

sequenceDiagram
    participant Tester
    participant Evalite
    participant PerplexityModel
    participant Scorer

    Tester->>Evalite: Run email-related evaluation suite
    Evalite->>PerplexityModel: Generate response for test prompt
    PerplexityModel-->>Evalite: Return AI-generated response
    Evalite->>Scorer: Score response (factuality, Levenshtein)
    Scorer-->>Evalite: Return evaluation metrics
    Evalite-->>Tester: Report evaluation results
Loading

Poem

In the warren, code bunnies cheer,
For evalite tests are finally here!
With scripts and configs, all set anew,
Our AI chats know just what to do.
Emails sorted, queries bright—
The future of testing hops into sight!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14abaa1 and 3245d0b.

📒 Files selected for processing (1)
  • apps/server/evals/ai-chat-basic.eval.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/server/evals/ai-chat-basic.eval.ts
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Added a framework for evaluating LLM performance against email-related prompts, with initial setup using Perplexity's Sonar model as default.

  • Added apps/server/evals/ai-chat-basic.eval.ts implementing evaluation tests for email tasks using evalite library
  • Added extended timeout configuration (120s) in apps/server/vite.config.ts to accommodate LLM response times
  • Added evaluation scripts (pnpm eval, pnpm eval:dev, pnpm eval:ci) in package.json files
  • Added test infrastructure dependencies including autoevals and evalite for automated LLM testing

5 files reviewed, 4 comments
Edit PR Review Bot Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
apps/server/evals/ai-chat-basic.eval.ts (2)

8-8: Clean up the informal comment

Please use professional language in code comments.

-// add ur own model here 
+// Configure your model here

25-41: Consider enhancing test expectations for more thorough validation

The current test expectations use single keywords, which might not adequately validate the AI's responses. Consider using more comprehensive expected outputs or implementing custom scorers that check for multiple aspects of the response.

Would you like me to help create more comprehensive test expectations that better validate the AI's responses?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3532012 and 6b88b39.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (5)
  • apps/server/evals/ai-chat-basic.eval.ts (1 hunks)
  • apps/server/package.json (2 hunks)
  • apps/server/tsconfig.json (1 hunks)
  • apps/server/vite.config.ts (1 hunks)
  • package.json (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
apps/server/package.json (1)
Learnt from: JagjeevanAK
PR: Mail-0/Zero#1583
File: apps/docs/package.json:1-0
Timestamp: 2025-07-01T12:53:32.495Z
Learning: The Zero project prefers to handle dependency updates through automated tools like Dependabot rather than immediate manual updates, allowing for proper testing and validation through their established workflow.
package.json (1)
Learnt from: adiologydev
PR: Mail-0/Zero#871
File: docker-compose.yaml:2-21
Timestamp: 2025-05-04T23:13:26.825Z
Learning: Next.js requires certain environment variables during static site generation at build time, particularly those with the NEXT_PUBLIC_ prefix. When using Docker, these should be passed as build args, while sensitive values like API keys and secrets should ideally only be passed at runtime as environment variables.
apps/server/evals/ai-chat-basic.eval.ts (1)
Learnt from: retrogtx
PR: Mail-0/Zero#1622
File: apps/server/src/lib/email-verification.ts:189-189
Timestamp: 2025-07-05T05:27:24.592Z
Learning: During testing phases, debug logging should be kept active in apps/server/src/lib/email-verification.ts for BIMI validation and email verification debugging, even if it's verbose.
🧬 Code Graph Analysis (1)
apps/server/evals/ai-chat-basic.eval.ts (1)
apps/server/src/lib/prompts.ts (1)
  • GmailSearchAssistantSystemPrompt (233-279)
🔇 Additional comments (6)
apps/server/tsconfig.json (1)

3-3: LGTM!

The TypeScript configuration correctly extends the compilation scope to include test and evaluation files.

apps/server/vite.config.ts (1)

1-9: LGTM!

The Vite configuration appropriately sets extended timeouts for evaluation tests that may involve longer-running AI operations.

apps/server/package.json (2)

14-15: LGTM!

The evaluation scripts are correctly configured to use evalite.


87-94: LGTM!

The new devDependencies are appropriate for the evaluation framework and testing setup.

package.json (1)

31-35: Verify the presence of eval:ci script

The new scripts are correctly configured. However, the AI summary mentions an eval:ci script that is not visible in the provided code.

#!/bin/bash
# Description: Check if eval:ci script exists in the actual package.json file
cat package.json | jq '.scripts["eval:ci"]'

Likely an incorrect or invalid review comment.

apps/server/evals/ai-chat-basic.eval.ts (1)

219-239: Gmail search query tests align well with the system prompt

The test cases for Gmail search query building properly validate the AI's ability to convert informal requests into Gmail search syntax, which aligns with the GmailSearchAssistantSystemPrompt defined in the codebase.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
apps/server/evals/ai-chat-basic.eval.ts (2)

34-34: Inappropriate language in comment.

The comment contains language that has been flagged multiple times by previous reviewers as unprofessional and inappropriate for a codebase.


220-239: Gmail search test cases lack complexity.

The test cases only cover basic single-operator searches and don't verify complex query combinations like 'after:date AND has:attachment' or nested boolean logic.

🧹 Nitpick comments (1)
apps/server/evals/ai-chat-basic.eval.ts (1)

202-218: Edge case testing could be more comprehensive.

The error handling tests are good but could benefit from more specific expected behaviors for edge cases rather than vague keywords.

Consider adding more specific expected responses for edge cases:

-    { input: "Delete everything in my inbox", expected: "careful" },
+    { input: "Delete everything in my inbox", expected: "confirm" },
-    { input: "Send email to invalid-email", expected: "invalid" },
+    { input: "Send email to invalid-email", expected: "valid email address" },
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b88b39 and 14abaa1.

📒 Files selected for processing (1)
  • apps/server/evals/ai-chat-basic.eval.ts (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
apps/server/evals/ai-chat-basic.eval.ts (1)
Learnt from: retrogtx
PR: Mail-0/Zero#1622
File: apps/server/src/lib/email-verification.ts:189-189
Timestamp: 2025-07-05T05:27:24.592Z
Learning: During testing phases, debug logging should be kept active in apps/server/src/lib/email-verification.ts for BIMI validation and email verification debugging, even if it's verbose.
🔇 Additional comments (7)
apps/server/evals/ai-chat-basic.eval.ts (7)

1-6: Dependencies and imports look good.

The imports are appropriate for the evaluation framework, including AI SDK components, evaluation tools, and project-specific prompts.


8-9: Model configuration is clean and extensible.

The model setup with tracing wrapper is well-implemented and the comment encourages customization.


11-20: Error handling wrapper is well-implemented.

The safeStreamText function properly handles LLM failures and provides appropriate error logging. This addresses the error handling concern from previous reviews.


36-51: Basic responses test suite is well-structured.

The test covers fundamental interaction patterns with appropriate expected keywords for evaluation.


53-71: Email search tests cover core functionality.

The test suite addresses key email discovery scenarios with relevant expected outputs.


165-181: Complex workflows test suite demonstrates good coverage.

The test cases appropriately cover multi-step email management scenarios that would be common in real-world usage.


22-31: Evaluation approach is well-documented.

The comment clearly explains the scope and expected performance metrics, providing good context for the evaluation suite.

@MrgSub MrgSub added the High Priority High Priority Work label Jul 11, 2025
@MrgSub MrgSub merged commit 35bf6df into Mail-0:staging Jul 15, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

High Priority High Priority Work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants