feat: add a way to evaluate llms for our usecase against our prompts#1700
feat: add a way to evaluate llms for our usecase against our prompts#1700MrgSub merged 8 commits intoMail-0:stagingfrom
Conversation
WalkthroughA new AI chat evaluation suite for email tasks was introduced using the Changes
Sequence Diagram(s)sequenceDiagram
participant Tester
participant Evalite
participant PerplexityModel
participant Scorer
Tester->>Evalite: Run email-related evaluation suite
Evalite->>PerplexityModel: Generate response for test prompt
PerplexityModel-->>Evalite: Return AI-generated response
Evalite->>Scorer: Score response (factuality, Levenshtein)
Scorer-->>Evalite: Return evaluation metrics
Evalite-->>Tester: Report evaluation results
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
PR Summary
Added a framework for evaluating LLM performance against email-related prompts, with initial setup using Perplexity's Sonar model as default.
- Added
apps/server/evals/ai-chat-basic.eval.tsimplementing evaluation tests for email tasks using evalite library - Added extended timeout configuration (120s) in
apps/server/vite.config.tsto accommodate LLM response times - Added evaluation scripts (
pnpm eval,pnpm eval:dev,pnpm eval:ci) in package.json files - Added test infrastructure dependencies including autoevals and evalite for automated LLM testing
5 files reviewed, 4 comments
Edit PR Review Bot Settings | Greptile
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
apps/server/evals/ai-chat-basic.eval.ts (2)
8-8: Clean up the informal commentPlease use professional language in code comments.
-// add ur own model here +// Configure your model here
25-41: Consider enhancing test expectations for more thorough validationThe current test expectations use single keywords, which might not adequately validate the AI's responses. Consider using more comprehensive expected outputs or implementing custom scorers that check for multiple aspects of the response.
Would you like me to help create more comprehensive test expectations that better validate the AI's responses?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (5)
apps/server/evals/ai-chat-basic.eval.ts(1 hunks)apps/server/package.json(2 hunks)apps/server/tsconfig.json(1 hunks)apps/server/vite.config.ts(1 hunks)package.json(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
apps/server/package.json (1)
Learnt from: JagjeevanAK
PR: Mail-0/Zero#1583
File: apps/docs/package.json:1-0
Timestamp: 2025-07-01T12:53:32.495Z
Learning: The Zero project prefers to handle dependency updates through automated tools like Dependabot rather than immediate manual updates, allowing for proper testing and validation through their established workflow.
package.json (1)
Learnt from: adiologydev
PR: Mail-0/Zero#871
File: docker-compose.yaml:2-21
Timestamp: 2025-05-04T23:13:26.825Z
Learning: Next.js requires certain environment variables during static site generation at build time, particularly those with the NEXT_PUBLIC_ prefix. When using Docker, these should be passed as build args, while sensitive values like API keys and secrets should ideally only be passed at runtime as environment variables.
apps/server/evals/ai-chat-basic.eval.ts (1)
Learnt from: retrogtx
PR: Mail-0/Zero#1622
File: apps/server/src/lib/email-verification.ts:189-189
Timestamp: 2025-07-05T05:27:24.592Z
Learning: During testing phases, debug logging should be kept active in apps/server/src/lib/email-verification.ts for BIMI validation and email verification debugging, even if it's verbose.
🧬 Code Graph Analysis (1)
apps/server/evals/ai-chat-basic.eval.ts (1)
apps/server/src/lib/prompts.ts (1)
GmailSearchAssistantSystemPrompt(233-279)
🔇 Additional comments (6)
apps/server/tsconfig.json (1)
3-3: LGTM!The TypeScript configuration correctly extends the compilation scope to include test and evaluation files.
apps/server/vite.config.ts (1)
1-9: LGTM!The Vite configuration appropriately sets extended timeouts for evaluation tests that may involve longer-running AI operations.
apps/server/package.json (2)
14-15: LGTM!The evaluation scripts are correctly configured to use evalite.
87-94: LGTM!The new devDependencies are appropriate for the evaluation framework and testing setup.
package.json (1)
31-35: Verify the presence of eval:ci scriptThe new scripts are correctly configured. However, the AI summary mentions an
eval:ciscript that is not visible in the provided code.#!/bin/bash # Description: Check if eval:ci script exists in the actual package.json file cat package.json | jq '.scripts["eval:ci"]'Likely an incorrect or invalid review comment.
apps/server/evals/ai-chat-basic.eval.ts (1)
219-239: Gmail search query tests align well with the system promptThe test cases for Gmail search query building properly validate the AI's ability to convert informal requests into Gmail search syntax, which aligns with the
GmailSearchAssistantSystemPromptdefined in the codebase.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
apps/server/evals/ai-chat-basic.eval.ts (2)
34-34: Inappropriate language in comment.The comment contains language that has been flagged multiple times by previous reviewers as unprofessional and inappropriate for a codebase.
220-239: Gmail search test cases lack complexity.The test cases only cover basic single-operator searches and don't verify complex query combinations like 'after:date AND has:attachment' or nested boolean logic.
🧹 Nitpick comments (1)
apps/server/evals/ai-chat-basic.eval.ts (1)
202-218: Edge case testing could be more comprehensive.The error handling tests are good but could benefit from more specific expected behaviors for edge cases rather than vague keywords.
Consider adding more specific expected responses for edge cases:
- { input: "Delete everything in my inbox", expected: "careful" }, + { input: "Delete everything in my inbox", expected: "confirm" }, - { input: "Send email to invalid-email", expected: "invalid" }, + { input: "Send email to invalid-email", expected: "valid email address" },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
apps/server/evals/ai-chat-basic.eval.ts(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
apps/server/evals/ai-chat-basic.eval.ts (1)
Learnt from: retrogtx
PR: Mail-0/Zero#1622
File: apps/server/src/lib/email-verification.ts:189-189
Timestamp: 2025-07-05T05:27:24.592Z
Learning: During testing phases, debug logging should be kept active in apps/server/src/lib/email-verification.ts for BIMI validation and email verification debugging, even if it's verbose.
🔇 Additional comments (7)
apps/server/evals/ai-chat-basic.eval.ts (7)
1-6: Dependencies and imports look good.The imports are appropriate for the evaluation framework, including AI SDK components, evaluation tools, and project-specific prompts.
8-9: Model configuration is clean and extensible.The model setup with tracing wrapper is well-implemented and the comment encourages customization.
11-20: Error handling wrapper is well-implemented.The
safeStreamTextfunction properly handles LLM failures and provides appropriate error logging. This addresses the error handling concern from previous reviews.
36-51: Basic responses test suite is well-structured.The test covers fundamental interaction patterns with appropriate expected keywords for evaluation.
53-71: Email search tests cover core functionality.The test suite addresses key email discovery scenarios with relevant expected outputs.
165-181: Complex workflows test suite demonstrates good coverage.The test cases appropriately cover multi-step email management scenarios that would be common in real-world usage.
22-31: Evaluation approach is well-documented.The comment clearly explains the scope and expected performance metrics, providing good context for the evaluation suite.
just install everything and run
pnpm evalto check scores for each testplay around with models to get higher scores, currently the default is set to 4o-mini
Summary by CodeRabbit