feat: add a way to evaluate llms for our usecase against our prompts by retrogtx · Pull Request #1700 · Mail-0/Zero

retrogtx · 2025-07-10T05:47:18Z

just install everything and run pnpm eval to check scores for each test

play around with models to get higher scores, currently the default is set to 4o-mini

Summary by CodeRabbit

New Features
- Introduced an evaluation suite to assess AI chat capabilities for email management, including conversational responses, search, organization, and smart categorization.
- Added scripts to streamline running and developing evaluation and AI tests.
Chores
- Updated development dependencies and tooling for improved testing and evaluation workflows.
- Expanded TypeScript configuration to include evaluation and test files.
- Added configuration for extended test timeouts.

…ar as model

coderabbitai · 2025-07-10T05:47:25Z

Walkthrough

A new AI chat evaluation suite for email tasks was introduced using the evalite framework. Supporting scripts, dependencies, and TypeScript configuration were added or updated to enable evaluation and testing workflows. A Vite configuration file was also introduced to set test timeouts for the server application.

Changes

File(s)	Change Summary
apps/server/evals/ai-chat-basic.eval.ts	Added a comprehensive evalite-based test suite for AI chat email capabilities across multiple email scenarios.
apps/server/package.json	Added evalite, autoevals, vite, and vitest to devDependencies; added eval-related npm scripts for the server app.
apps/server/tsconfig.json	Updated TypeScript "include" to cover "tests/*/.ts" and "evals/*/.ts".
apps/server/vite.config.ts	Introduced Vite config setting test, hook, and teardown timeouts to 120,000ms.
package.json	Added root-level npm scripts for AI testing and evaluation targeting @zero/server; minor syntax fix.

Sequence Diagram(s)

sequenceDiagram
    participant Tester
    participant Evalite
    participant PerplexityModel
    participant Scorer

    Tester->>Evalite: Run email-related evaluation suite
    Evalite->>PerplexityModel: Generate response for test prompt
    PerplexityModel-->>Evalite: Return AI-generated response
    Evalite->>Scorer: Score response (factuality, Levenshtein)
    Scorer-->>Evalite: Return evaluation metrics
    Evalite-->>Tester: Report evaluation results

Poem

In the warren, code bunnies cheer,
For evalite tests are finally here!
With scripts and configs, all set anew,
Our AI chats know just what to do.
Emails sorted, queries bright—
The future of testing hops into sight!
🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14abaa1 and 3245d0b.

📒 Files selected for processing (1)

apps/server/evals/ai-chat-basic.eval.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

apps/server/evals/ai-chat-basic.eval.ts

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

greptile-apps

PR Summary

Added a framework for evaluating LLM performance against email-related prompts, with initial setup using Perplexity's Sonar model as default.

Added apps/server/evals/ai-chat-basic.eval.ts implementing evaluation tests for email tasks using evalite library
Added extended timeout configuration (120s) in apps/server/vite.config.ts to accommodate LLM response times
Added evaluation scripts (pnpm eval, pnpm eval:dev, pnpm eval:ci) in package.json files
Added test infrastructure dependencies including autoevals and evalite for automated LLM testing

_{5 files reviewed, 4 comments}
_{Edit PR Review Bot Settings | Greptile}

apps/server/vite.config.ts

apps/server/evals/ai-chat-basic.eval.ts

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

apps/server/evals/ai-chat-basic.eval.ts (2)
8-8: Clean up the informal comment

Please use professional language in code comments.
-// add ur own model here 
+// Configure your model here
25-41: Consider enhancing test expectations for more thorough validation

The current test expectations use single keywords, which might not adequately validate the AI's responses. Consider using more comprehensive expected outputs or implementing custom scorers that check for multiple aspects of the response.

Would you like me to help create more comprehensive test expectations that better validate the AI's responses?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3532012 and 6b88b39.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (5)

apps/server/evals/ai-chat-basic.eval.ts (1 hunks)
apps/server/package.json (2 hunks)
apps/server/tsconfig.json (1 hunks)
apps/server/vite.config.ts (1 hunks)
package.json (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

apps/server/package.json (1)

Learnt from: JagjeevanAK
PR: Mail-0/Zero#1583
File: apps/docs/package.json:1-0
Timestamp: 2025-07-01T12:53:32.495Z
Learning: The Zero project prefers to handle dependency updates through automated tools like Dependabot rather than immediate manual updates, allowing for proper testing and validation through their established workflow.

package.json (1)

Learnt from: adiologydev
PR: Mail-0/Zero#871
File: docker-compose.yaml:2-21
Timestamp: 2025-05-04T23:13:26.825Z
Learning: Next.js requires certain environment variables during static site generation at build time, particularly those with the NEXT_PUBLIC_ prefix. When using Docker, these should be passed as build args, while sensitive values like API keys and secrets should ideally only be passed at runtime as environment variables.

apps/server/evals/ai-chat-basic.eval.ts (1)

Learnt from: retrogtx
PR: Mail-0/Zero#1622
File: apps/server/src/lib/email-verification.ts:189-189
Timestamp: 2025-07-05T05:27:24.592Z
Learning: During testing phases, debug logging should be kept active in apps/server/src/lib/email-verification.ts for BIMI validation and email verification debugging, even if it's verbose.

🧬 Code Graph Analysis (1)

apps/server/evals/ai-chat-basic.eval.ts (1)

apps/server/src/lib/prompts.ts (1)

GmailSearchAssistantSystemPrompt (233-279)

🔇 Additional comments (6)

apps/server/tsconfig.json (1)

3-3: LGTM!

The TypeScript configuration correctly extends the compilation scope to include test and evaluation files.

apps/server/vite.config.ts (1)

1-9: LGTM!

The Vite configuration appropriately sets extended timeouts for evaluation tests that may involve longer-running AI operations.

apps/server/package.json (2)

14-15: LGTM!

The evaluation scripts are correctly configured to use evalite.

87-94: LGTM!

The new devDependencies are appropriate for the evaluation framework and testing setup.
package.json (1)

31-35: Verify the presence of eval:ci script

The new scripts are correctly configured. However, the AI summary mentions an eval:ci script that is not visible in the provided code.
#!/bin/bash
# Description: Check if eval:ci script exists in the actual package.json file
cat package.json | jq '.scripts["eval:ci"]'
Likely an incorrect or invalid review comment.
apps/server/evals/ai-chat-basic.eval.ts (1)

219-239: Gmail search query tests align well with the system prompt

The test cases for Gmail search query building properly validate the AI's ability to convert informal requests into Gmail search syntax, which aligns with the GmailSearchAssistantSystemPrompt defined in the codebase.

apps/server/evals/ai-chat-basic.eval.ts

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

apps/server/evals/ai-chat-basic.eval.ts (2)

34-34: Inappropriate language in comment.

The comment contains language that has been flagged multiple times by previous reviewers as unprofessional and inappropriate for a codebase.

220-239: Gmail search test cases lack complexity.

The test cases only cover basic single-operator searches and don't verify complex query combinations like 'after:date AND has:attachment' or nested boolean logic.

🧹 Nitpick comments (1)

apps/server/evals/ai-chat-basic.eval.ts (1)
202-218: Edge case testing could be more comprehensive.

The error handling tests are good but could benefit from more specific expected behaviors for edge cases rather than vague keywords.

Consider adding more specific expected responses for edge cases:
-    { input: "Delete everything in my inbox", expected: "careful" },
+    { input: "Delete everything in my inbox", expected: "confirm" },
-    { input: "Send email to invalid-email", expected: "invalid" },
+    { input: "Send email to invalid-email", expected: "valid email address" },

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b88b39 and 14abaa1.

📒 Files selected for processing (1)

apps/server/evals/ai-chat-basic.eval.ts (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

apps/server/evals/ai-chat-basic.eval.ts (1)

Learnt from: retrogtx
PR: Mail-0/Zero#1622
File: apps/server/src/lib/email-verification.ts:189-189
Timestamp: 2025-07-05T05:27:24.592Z
Learning: During testing phases, debug logging should be kept active in apps/server/src/lib/email-verification.ts for BIMI validation and email verification debugging, even if it's verbose.

🔇 Additional comments (7)

apps/server/evals/ai-chat-basic.eval.ts (7)

1-6: Dependencies and imports look good.

The imports are appropriate for the evaluation framework, including AI SDK components, evaluation tools, and project-specific prompts.

8-9: Model configuration is clean and extensible.

The model setup with tracing wrapper is well-implemented and the comment encourages customization.

11-20: Error handling wrapper is well-implemented.

The safeStreamText function properly handles LLM failures and provides appropriate error logging. This addresses the error handling concern from previous reviews.

36-51: Basic responses test suite is well-structured.

The test covers fundamental interaction patterns with appropriate expected keywords for evaluation.

53-71: Email search tests cover core functionality.

The test suite addresses key email discovery scenarios with relevant expected outputs.

165-181: Complex workflows test suite demonstrates good coverage.

The test cases appropriately cover multi-step email management scenarios that would be common in real-world usage.

22-31: Evaluation approach is well-documented.

The comment clearly explains the scope and expected performance metrics, providing good context for the evaluation suite.

…er tasks

retrogtx added 6 commits July 7, 2025 22:25

add eval

566b0bf

add evalite and autoevals scripts to package.json

b44a464

Merge branch 'staging' into eval

840b61a

update packages

2acecf6

add custom timeout

c95865b

add context using our pre defined prompts to increase scores, add son…

6b88b39

…ar as model

graphite-app bot assigned retrogtx Jul 10, 2025

greptile-apps bot reviewed Jul 10, 2025

View reviewed changes

apps/server/vite.config.ts Show resolved Hide resolved

apps/server/evals/ai-chat-basic.eval.ts Show resolved Hide resolved

apps/server/evals/ai-chat-basic.eval.ts Show resolved Hide resolved

apps/server/evals/ai-chat-basic.eval.ts Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jul 10, 2025

View reviewed changes

apps/server/evals/ai-chat-basic.eval.ts Show resolved Hide resolved

error handling

14abaa1

coderabbitai bot reviewed Jul 10, 2025

View reviewed changes

MrgSub added the High Priority High Priority Work label Jul 11, 2025

utilize dynamic test case generation for gmail search queries and oth…

3245d0b

…er tasks

MrgSub approved these changes Jul 15, 2025

View reviewed changes

MrgSub merged commit 35bf6df into Mail-0:staging Jul 15, 2025
4 checks passed

coderabbitai bot mentioned this pull request Jul 20, 2025

fix: eval lint err for extra arguments and autofix failing test #1768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

feat: add a way to evaluate llms for our usecase against our prompts#1700

feat: add a way to evaluate llms for our usecase against our prompts#1700
MrgSub merged 8 commits intoMail-0:stagingfrom
retrogtx:eval

retrogtx commented Jul 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 10, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

retrogtx commented Jul 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

retrogtx commented Jul 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 10, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)