Skip to content

Feature | TUI and the great refactor#24

Merged
yuval-qf merged 96 commits intomainfrom
feature/tui
Aug 17, 2025
Merged

Feature | TUI and the great refactor#24
yuval-qf merged 96 commits intomainfrom
feature/tui

Conversation

@drorIvry
Copy link
Contributor

@drorIvry drorIvry commented Jul 30, 2025

Summary by CodeRabbit

  • New Features

    • FastAPI server with evaluation, health, LLM (scenarios/summary), and interview endpoints plus WebSocket live updates.
    • CLI gains “server” mode; evaluation/reporting now SDK-driven.
    • UI integrates SDK for real-time progress, chat, parallel workers, and server-generated summaries.
    • Released Python and TypeScript SDKs; added higher-level library API exports.
  • Changes

    • CLI flag renamed: --judge-llm-model → --judge-llm; config keys updated to evaluated_agent_*.
  • Documentation

    • Added AGENTS.md, TESTING_GUIDE.md, SDK READMEs, and updated README/CLI docs.
  • Chores

    • Added packaging manifest, new dependencies, CI workflow updates, pre-commit hook, and lint config tweaks.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 30, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a FastAPI server with REST and WebSocket APIs, a full Python and TypeScript SDK, and migrates CLI/UI flows to use the server via SDK. Introduces structured logging, new evaluation orchestration/services, interview and LLM endpoints, packaging/CI updates, and significant model/type moves to rogue_sdk.types with corresponding test and doc updates.

Changes

Cohort / File(s) Summary
Lint & Packaging
/.flake8, /.pre-commit-config.yaml, /.gitignore, /MANIFEST.in, /pyproject.toml, /sdks/python/pyproject.toml, /packages/sdk/package.json, /packages/sdk/tsconfig.json, /sdks/python/LICENSE.md
Ignore E203; add trailing comma hook; ignore dist; add MANIFEST; add runtime deps and setuptools config; add Python SDK pyproject; add TS SDK package metadata/tsconfig; add Elastic License 2.0.
Docs
/AGENTS.md, /docs/README.md, /README.md, /cli_docs.md, /TESTING_GUIDE.md, /tui_plan.md, /TUI_IMPLEMENTATION_SUMMARY.md, /packages/sdk/README.md, /sdks/python/README.md
New/updated guides for agents, CLI option rename (judge_llm), testing, TUI plan/summary, and SDK READMEs; minor whitespace fix.
Server Bootstrap
/rogue/run_server.py, /rogue/__main__.py, /rogue/server/main.py, /rogue/server/__init__.py
Adds CLI server subcommand; starts FastAPI app with routers and websocket; exposes server package.
Server APIs
/rogue/server/api/__init__.py, /rogue/server/api/health.py, /rogue/server/api/evaluation.py, /rogue/server/api/llm.py, /rogue/server/api/interview.py
Adds health, evaluations CRUD/async execution, LLM scenario/summary generation, and interview session endpoints; DI for EvaluationService.
Server Core & Services
/rogue/server/core/__init__.py, /rogue/server/core/evaluation_orchestrator.py, /rogue/server/services/evaluation_service.py, /rogue/server/services/evaluation_library.py, /rogue/server/services/interviewer_service.py, /rogue/server/services/llm_service.py, /rogue/server/services/scenario_evaluation_service.py, /rogue/server/services/__init__.py, /rogue/server/websocket/manager.py, /rogue/server/websocket/__init__.py
Introduces orchestrator, in-memory job service with WS broadcasting, evaluation library, interviewer helpers, LLM service type moves, async scenario evaluation, and WS manager/router.
Python SDK
/sdks/python/rogue_sdk/__init__.py, /sdks/python/rogue_sdk/types.py, /sdks/python/rogue_sdk/client.py, /sdks/python/rogue_sdk/sdk.py, /sdks/python/rogue_sdk/websocket.py
New SDK: typed models, async HTTP client, WebSocket client, high-level SDK with evaluation, updates, scenarios, summary, interview utilities.
TypeScript SDK
/packages/sdk/src/types.ts, /packages/sdk/src/client.ts, /packages/sdk/src/websocket.ts, /packages/sdk/src/sdk.ts, /packages/sdk/src/index.ts
Adds TS SDK: types, HTTP client with retries, WS client with reconnection, functional SDK, index exports.
CLI
/rogue/run_cli.py, /.github/actions/rogue/action.yml, /.github/workflows/rogue.yml, /.github/workflows/test.yml, /.github/actions/run-tests/action.yml
CLI now uses RogueSDK and server URL; create_report async via server; rename flag --judge-llm; CI sets up venv, installs SDK/server, starts server, updates action inputs.
UI
/rogue/ui/app.py, /rogue/ui/components/scenario_runner.py, /rogue/ui/components/report_generator.py, /rogue/ui/components/interviewer.py, /rogue/ui/components/scenario_generator.py, /rogue/ui/components/config_screen.py
UI migrates to SDK for eval, summary, scenarios, and interview; updates config keys to evaluated_agent_*; removes report refresh button; enhanced logging and streaming updates.
Evaluator Agent
/rogue/evaluator_agent/evaluator_agent.py, /rogue/evaluator_agent/policy_evaluation.py, /rogue/evaluator_agent/run_evaluator_agent.py, /rogue/prompt_injection_evaluator/run_prompt_injection_evaluator.py, /rogue/evaluator_agent/__main__.py
Constructor/signature changes to judge_llm/auth; richer logging; streaming pipeline; type moves to SDK; remove legacy main harness.
Common Logging
/rogue/common/logging/__init__.py, /rogue/common/logging/config.py, /rogue/common/logging/context.py
Adds Loguru-based structured logging with contextvars, configuration, and context helpers.
Models Refactor
/rogue/models/__init__.py, /rogue/models/config.py, /rogue/models/chat_history.py, /rogue/models/scenario.py, /rogue/models/evaluation_result.py, /rogue/models/prompt_injection.py, /rogue/models/cli_input.py
Moves types to rogue_sdk.types; removes config/chat_history/scenario modules; simplifies evaluation_result to PolicyEvaluationResult; updates CLIInput fields/defaults.
Core API Exposure
/rogue/__init__.py
Exposes EvaluationLibrary and top-level evaluate_* callables; adds metadata; exports run_server.
Common Utils
/rogue/common/workdir_utils.py, /rogue/common/agent_model_wrapper.py
Update type imports to SDK; minor import reorder.
Tests
/rogue/tests/test_run_cli.py, /rogue/tests/models/test_cli_input.py, /rogue/tests/models/test_config.py, /rogue/tests/models/test_evaluation_result.py, /sdks/python/rogue_sdk/tests/test_types.py
Update to new types/fields; add SDK type tests; remove config tests tied to deleted module.
Examples
/examples/tshirt_store_agent/tshirt_store_agent.py, /examples/tshirt_store_langgraph_agent/shirtify_langgraph_agent_executor.py
Add type ignore for tools arg; switch to logger.exception in executor.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI
  participant RogueSDK
  participant Server as FastAPI Server
  participant EvalAPI as /api/v1/evaluations
  participant WS as WebSocket /ws/{job_id}
  participant LLM as LLM API

  User->>CLI: rogue run ... --rogue-server-url
  CLI->>RogueSDK: run_evaluation(...)
  RogueSDK->>Server: POST EvalAPI (create)
  Server-->>RogueSDK: job_id (PENDING)
  RogueSDK->>WS: connect(job_id)
  Server->>WS: stream job_update/chat_update
  RogueSDK-->>CLI: on_update/on_chat callbacks
  Server->>LLM: judge/summary calls (as needed)
  RogueSDK->>Server: GET EvalAPI/{job_id} (wait)
  Server-->>RogueSDK: COMPLETED + results
  RogueSDK-->>CLI: results
  CLI->>RogueSDK: generate_summary(results, model)
  RogueSDK->>Server: POST /api/v1/llm/summary
  Server-->>RogueSDK: summary
  RogueSDK-->>CLI: summary
  CLI-->>User: Write JSON + print summary
Loading
sequenceDiagram
  participant Client
  participant EvalAPI as POST /evaluations
  participant EvalSvc as EvaluationService
  participant Orchestrator
  participant Agent as EvaluatorAgent
  participant WS as WebSocketManager

  Client->>EvalAPI: create_evaluation(request)
  EvalAPI->>EvalSvc: add_job + run_job(job_id)
  EvalSvc->>Orchestrator: run_evaluation()
  Orchestrator->>Agent: arun_evaluator_agent(...)
  Agent-->>Orchestrator: stream updates (chat/status/results)
  Orchestrator-->>EvalSvc: forward updates
  EvalSvc->>WS: broadcast chat/job_update
  EvalSvc-->>Client: GET job status/progress
  EvalSvc-->>Client: final results (COMPLETED/FAILED)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

A rabbit taps the server drum,
WebSockets hum, results now come.
The SDKs bloom in twin arrays,
Types in line, in tidy ways.
Hop to CLI, the flags feel new—
Judge the judge, and test the crew.
Thump! The logs all tell what’s true. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@drorIvry drorIvry marked this pull request as draft July 30, 2025 13:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
rogue/server/__main__.py (1)

1-13: LGTM! Well-structured server entry point with minor suggestions.

The implementation follows good practices with secure defaults and environment variable configuration.

Consider these improvements for production readiness:

 import uvicorn
 import os
+import sys
 from .main import app

 if __name__ == "__main__":
     host = os.getenv("HOST", "127.0.0.1")  # Default to localhost for security
-    port = int(os.getenv("PORT", "8000"))
+    try:
+        port = int(os.getenv("PORT", "8000"))
+    except ValueError:
+        print("Error: PORT environment variable must be a valid integer")
+        sys.exit(1)
+    
+    # Disable reload in production
+    reload = os.getenv("ENVIRONMENT", "development") == "development"
+    
     uvicorn.run(
         app,
         host=host,
         port=port,
-        reload=True,
+        reload=reload,
     )
rogue/ui/components/scenario_runner.py (1)

200-205: Clean refactor removing file operations from UI, but remove commented code.

The removal of filesystem operations aligns well with the server-client architecture refactor. Results are properly maintained in memory and state for UI consumption.

Remove the commented-out code entirely rather than leaving it as comments:

-        # workdir = state.get("workdir")
-
-        # final_output_path = (
-        #     workdir / f"evaluation_results_{datetime.now().isoformat()}.json"
-        # )
-        # final_output_path.write_text(all_results.model_dump_json(indent=2))
-        # state["evaluation_results_output_path"] = final_output_path

Also applies to: 215-215

rogue/server/api/health.py (1)

23-23: Consider making the version configurable.

The hardcoded version "1.0.0" should ideally be read from a configuration file, environment variable, or package metadata to avoid manual updates during releases.

You could replace the hardcoded version with:

-        version="1.0.0",
+        version=os.getenv("APP_VERSION", "1.0.0"),

Or read from package metadata if available.

rogue/server/services/evaluation_service.py (3)

21-32: Consider caching for better performance with large job datasets.

The method recreates the jobs list, filters, and sorts on every call, which could become expensive with many jobs.

For better performance with large datasets, consider:

  1. Maintaining pre-sorted indices
  2. Using database-backed storage for persistence and efficient querying
  3. Caching frequently accessed job lists
def get_jobs(
    self,
    status: Optional[EvaluationStatus] = None,
    limit: int = 50,
    offset: int = 0,
) -> List[EvaluationJob]:
-    jobs = list(self.jobs.values())
+    # Consider maintaining a sorted index or using database queries
+    jobs = sorted(self.jobs.values(), key=lambda x: x.created_at, reverse=True)
    if status:
        jobs = [job for job in jobs if job.status == status]
-    
-    jobs.sort(key=lambda x: x.created_at, reverse=True)
    return jobs[offset : offset + limit]

98-99: Consider error handling for fire-and-forget tasks.

The async task is created without error handling, which could lead to silent failures in WebSocket broadcasts.

Add error handling for the async task:

def _notify_job_update(self, job: EvaluationJob):
-    asyncio.create_task(websocket_manager.broadcast_job_update(job))
+    task = asyncio.create_task(websocket_manager.broadcast_job_update(job))
+    task.add_done_callback(lambda t: logger.error(f"WebSocket broadcast failed: {t.exception()}") if t.exception() else None)

Or use a more robust background task manager for production use.


101-109: Move import to module level and add error handling.

The local import and fire-and-forget pattern have the same concerns as the previous method.

+from ..models.api_models import WebSocketMessage

def _notify_chat_update(self, job_id: str, chat_data: Any):
    """Send real-time chat updates via WebSocket"""
-    from ..models.api_models import WebSocketMessage

    message = WebSocketMessage(
        type="chat_update", job_id=job_id, data={"message": str(chat_data)}
    )

-    asyncio.create_task(websocket_manager.broadcast_to_job(job_id, message))
+    task = asyncio.create_task(websocket_manager.broadcast_to_job(job_id, message))
+    task.add_done_callback(lambda t: logger.error(f"Chat broadcast failed: {t.exception()}") if t.exception() else None)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 7cdcc57 and c3cb1ab.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • .flake8 (1 hunks)
  • AGENTS.md (1 hunks)
  • docs/README.md (0 hunks)
  • pyproject.toml (2 hunks)
  • refactor_plan.md (1 hunks)
  • rogue/__init__.py (1 hunks)
  • rogue/run_cli.py (1 hunks)
  • rogue/server/__main__.py (1 hunks)
  • rogue/server/api/evaluation.py (1 hunks)
  • rogue/server/api/health.py (1 hunks)
  • rogue/server/main.py (1 hunks)
  • rogue/server/models/api_models.py (1 hunks)
  • rogue/server/services/evaluation_service.py (1 hunks)
  • rogue/server/websocket/manager.py (1 hunks)
  • rogue/services/evaluation_library.py (1 hunks)
  • rogue/services/scenario_evaluation_service.py (0 hunks)
  • rogue/ui/components/scenario_runner.py (2 hunks)
💤 Files with no reviewable changes (2)
  • docs/README.md
  • rogue/services/scenario_evaluation_service.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/clean-code.mdc:0-0
Timestamp: 2025-07-01T11:52:43.736Z
Learning: Refactor continuously
pyproject.toml (5)

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to {requirements.txt,pyproject.toml} : Use proper package versions

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to {requirements.txt,pyproject.toml} : Pin dependency versions

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to {requirements.txt,pyproject.toml} : Separate dev dependencies

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-07-01T11:54:30.386Z
Learning: Applies to {requirements.txt,pyproject.toml} : Store requirements in requirements.txt or pyproject.toml

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/tech-stack.mdc:0-0
Timestamp: 2025-07-01T11:55:48.082Z
Learning: Use FastAPI exclusively

AGENTS.md (1)

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/self_improve.mdc:0-0
Timestamp: 2025-07-01T11:54:49.379Z
Learning: Document breaking changes

rogue/server/api/health.py (1)

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/tech-stack.mdc:0-0
Timestamp: 2025-07-01T11:55:48.082Z
Learning: Use FastAPI exclusively

rogue/server/main.py (1)

Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/tech-stack.mdc:0-0
Timestamp: 2025-07-01T11:55:48.082Z
Learning: Use FastAPI exclusively

🧬 Code Graph Analysis (5)
rogue/run_cli.py (1)
rogue/services/scenario_evaluation_service.py (1)
  • evaluate_scenarios (35-66)
rogue/server/models/api_models.py (2)
rogue/models/scenario.py (1)
  • Scenario (12-51)
rogue/models/config.py (1)
  • AgentConfig (14-45)
rogue/server/websocket/manager.py (1)
rogue/server/models/api_models.py (2)
  • EvaluationJob (26-35)
  • WebSocketMessage (49-52)
rogue/services/evaluation_library.py (5)
rogue/models/config.py (2)
  • AgentConfig (14-45)
  • AuthType (7-11)
rogue/models/scenario.py (3)
  • Scenarios (54-70)
  • Scenario (12-51)
  • ScenarioType (7-9)
rogue/models/evaluation_result.py (1)
  • EvaluationResults (21-38)
rogue/services/scenario_evaluation_service.py (2)
  • ScenarioEvaluationService (13-66)
  • evaluate_scenarios (35-66)
rogue/server/services/evaluation_service.py (1)
  • progress_callback (65-76)
rogue/server/api/evaluation.py (2)
rogue/server/models/api_models.py (5)
  • EvaluationRequest (11-15)
  • EvaluationResponse (38-41)
  • EvaluationJob (26-35)
  • EvaluationStatus (18-23)
  • JobListResponse (44-46)
rogue/server/services/evaluation_service.py (7)
  • EvaluationService (11-109)
  • add_job (15-16)
  • run_evaluation (51-96)
  • get_jobs (21-32)
  • get_job_count (34-37)
  • get_job (18-19)
  • cancel_job (39-49)
🔇 Additional comments (32)
.flake8 (1)

11-11: LGTM! Standard practice for black compatibility.

Adding E203 to the ignore list is a common and recommended practice when using black formatter, as it conflicts with black's slice formatting style.

pyproject.toml (1)

10-10: LGTM! Appropriate dependencies for FastAPI server architecture.

The addition of fastapi>=0.115.0 and uvicorn>=0.32.0 aligns well with the server refactor objectives. The version constraints are reasonable and follow the project's dependency management practices.

Also applies to: 19-19

rogue/run_cli.py (1)

119-134: LGTM! Clean separation of concerns in the refactor.

The modification correctly separates evaluation logic from file output responsibilities. The CLI maintains backward compatibility by handling file output after evaluation completes, while the core service focuses purely on evaluation.

The logic flow is sound:

  1. Collect results from the async generator
  2. Write to file for CLI compatibility
  3. Return results for further processing
  4. Proper error handling when results are missing
rogue/server/api/health.py (1)

10-26: LGTM! Well-structured health check endpoint.

The implementation follows FastAPI best practices with proper Pydantic modeling, async handler, and comprehensive system information in the response.

rogue/server/main.py (2)

13-17: LGTM! Proper lifespan management implementation.

The async context manager for lifespan events is correctly implemented with appropriate logging.


20-41: LGTM! Well-structured FastAPI application factory.

The create_app factory pattern is excellent for testing and modularity. Router integration and app configuration are properly structured.

AGENTS.md (1)

1-66: LGTM! Comprehensive and well-structured documentation.

This documentation provides excellent guidance for AI agents and developers working in the repository. It covers all essential aspects including dependencies, tooling, code style, project structure, and execution commands. The consistent use of uv and clear formatting makes it easy to follow.

rogue/server/models/api_models.py (4)

11-16: LGTM! Well-structured request model with proper reuse.

The EvaluationRequest model properly reuses existing domain models (AgentConfig, Scenario) and includes sensible defaults for retry and timeout parameters.


18-24: LGTM! Comprehensive status enumeration.

The EvaluationStatus enum covers all necessary job lifecycle states including error and cancellation handling.


26-36: LGTM! Complete job model with proper timestamps and progress tracking.

The EvaluationJob model includes all necessary fields for job lifecycle management, with proper Optional typing for fields that may not be set initially.


49-53: LGTM! Generic WebSocket message structure.

The WebSocketMessage model provides a flexible structure for real-time communication with proper typing.

rogue/server/api/evaluation.py (4)

19-41: LGTM! Well-structured job creation endpoint.

The endpoint properly creates jobs with UUID generation, uses background tasks for async execution, and returns appropriate response models.


43-53: LGTM! Good pagination and filtering support.

The list endpoint includes proper pagination parameters and optional status filtering with sensible defaults.


55-61: LGTM! Proper error handling for job retrieval.

The get endpoint includes appropriate 404 error handling when jobs are not found.


63-69: LGTM! Proper cancellation endpoint with error handling.

The cancel endpoint correctly handles job cancellation with appropriate error responses for non-existent jobs.

rogue/server/services/evaluation_service.py (2)

39-49: LGTM! Clean cancellation logic.

The method correctly validates job existence and cancellable states before updating status and timestamps. The boolean return value provides clear success indication.


91-96: LGTM! Comprehensive error handling.

The exception handling properly updates job status, captures error messages, and ensures completion timestamps are always set via the finally block.

rogue/__init__.py (3)

1-15: LGTM! Clean modular API design.

The restructured __init__.py provides a clean, library-oriented API while maintaining backward compatibility by importing legacy submodules. The approach supports both new evaluation workflows and existing usage patterns.


27-29: Excellent use of function aliases for clean API.

Creating top-level aliases for EvaluationLibrary methods provides a clean, functional API that's easy to use and discover.


32-57: Well-structured all list with clear categorization.

The explicit __all__ list with categorized comments makes the public API surface clear and maintainable.

refactor_plan.md (3)

1-48: Comprehensive architectural plan with clear inspiration.

The refactor plan is well-structured and clearly inspired by OpenCode's successful patterns. The high-level architecture diagram effectively communicates the separation of concerns between frontends, server, and ADK agent.


122-176: Well-phased implementation approach.

The implementation phases are logically sequenced, starting with core infrastructure and progressing through SDKs to frontends. The timeline appears realistic for the scope of work.


419-434: Thoughtful migration strategy prioritizing backward compatibility.

The migration approach properly balances innovation with user continuity by maintaining existing functionality during the transition period.

rogue/services/evaluation_library.py (5)

14-23: LGTM! Clear class documentation and purpose.

The docstring clearly explains the library's role as a core evaluation engine used by multiple components, which aligns well with the modular architecture goals.


25-67: Solid async evaluation method with good error handling.

The method properly handles the async generator from the service, provides progress callbacks, and includes error handling for missing results.


69-98: Clean streaming interface.

The streaming method provides a clean async generator interface that directly exposes the underlying service's updates.


119-126: Proper synchronous wrapper implementation.

Using asyncio.run() is the correct approach for providing a synchronous interface to async functionality.


152-175: No field name mismatch in AgentConfig instantiation

The aliases used in quick_evaluateevaluated_agent_url, evaluated_agent_auth_type, evaluated_agent_credentials, judge_llm_model, and judge_llm_api_key—exactly match the Field(alias=…) definitions in AgentConfig. You can safely ignore the original comment.

Likely an incorrect or invalid review comment.

rogue/server/websocket/manager.py (4)

12-27: LGTM! Clean connection management with proper grouping.

The connection management properly groups WebSockets by job ID and handles global connections. The logic for adding connections to the appropriate lists is correct.


47-59: Excellent error handling with connection cleanup.

The broadcast methods properly handle connection failures and clean up disconnected WebSockets, preventing memory leaks and stale connections.


75-94: Well-structured job update broadcasting.

The broadcast_job_update method properly constructs structured messages and broadcasts to both job-specific and global connections, enabling flexible client subscription patterns.


101-120: WebSocket endpoints follow FastAPI patterns correctly.

Both endpoints properly handle connection lifecycle, use appropriate decorators, and implement the keep-alive pattern with proper exception handling.

Comment on lines 11 to 13
class EvaluationService:
def __init__(self):
self.jobs: Dict[str, EvaluationJob] = {}
Copy link
Contributor

@coderabbitai coderabbitai bot Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider thread safety for concurrent access.

The in-memory jobs dictionary lacks synchronization mechanisms, which could lead to race conditions when multiple concurrent requests access or modify jobs simultaneously.

Consider using asyncio.Lock or threading.RLock to protect the jobs dictionary:

class EvaluationService:
    def __init__(self):
        self.jobs: Dict[str, EvaluationJob] = {}
+        self._lock = asyncio.Lock()

Then protect all dictionary operations:

    def add_job(self, job: EvaluationJob):
+        async with self._lock:
            self.jobs[job.job_id] = job

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In rogue/server/services/evaluation_service.py around lines 11 to 13, the
in-memory jobs dictionary is accessed without synchronization, risking race
conditions in concurrent scenarios. Introduce a lock such as asyncio.Lock or
threading.RLock as a class attribute, then wrap all accesses and modifications
to the jobs dictionary within the lock's context to ensure thread safety.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

I'll help implement thread safety for the EvaluationService. Let me first examine the complete implementation to identify all methods that need protection.

@yuval-qf Here's the complete implementation with thread safety using asyncio.Lock:

import asyncio
from typing import Dict, List, Optional, Any
from datetime import datetime, timezone

from ..models.api_models import EvaluationJob, EvaluationStatus
from ..websocket.manager import websocket_manager
from ...services.evaluation_library import EvaluationLibrary
from ...models.scenario import Scenarios


class EvaluationService:
    def __init__(self):
        self.jobs: Dict[str, EvaluationJob] = {}
        self._lock = asyncio.Lock()

    async def add_job(self, job: EvaluationJob):
        async with self._lock:
            self.jobs[job.job_id] = job

    async def get_job(self, job_id: str) -> Optional[EvaluationJob]:
        async with self._lock:
            return self.jobs.get(job_id)

    async def get_jobs(self, status: Optional[EvaluationStatus] = None, limit: int = 50, offset: int = 0) -> List[EvaluationJob]:
        async with self._lock:
            jobs = list(self.jobs.values())
        
        if status:
            jobs = [job for job in jobs if job.status == status]
        
        return jobs[offset:offset + limit]

    async def get_job_count(self, status: Optional[EvaluationStatus] = None) -> int:
        async with self._lock:
            jobs = list(self.jobs.values())
        
        if status:
            jobs = [job for job in jobs if job.status == status]
        
        return len(jobs)

    async def cancel_job(self, job_id: str) -> bool:
        async with self._lock:
            job = self.jobs.get(job_id)
            if job and job.status in [EvaluationStatus.pending, EvaluationStatus.running]:
                job.status = EvaluationStatus.cancelled
                job.completed_at = datetime.now(timezone.utc)
                return True
            return False

    async def run_evaluation(self, job_id: str):
        # Get job outside the lock since we'll be updating it during evaluation
        async with self._lock:
            job = self.jobs.get(job_id)
        
        if not job:
            return

        # Update job status to running
        async with self._lock:
            job.status = EvaluationStatus.running
            job.started_at = datetime.now(timezone.utc)

        try:
            # Convert scenarios
            scenarios = Scenarios([scenario for scenario in job.request.scenarios])
            
            # Define progress callback
            async def progress_callback(status: str, data: Any):
                if status == "progress":
                    async with self._lock:
                        job.progress = data.get("progress", 0.0)
                elif status == "chat":
                    await self._notify_chat_update(job_id, data)
                
                await self._notify_job_update(job)

            # Run evaluation
            results = await EvaluationLibrary.evaluate_agent(
                agent_config=job.request.agent_config,
                scenarios=scenarios,
                business_context="You are evaluating an AI agent's performance across various scenarios.",
                progress_callback=progress_callback
            )

            # Update job with results
            async with self._lock:
                job.results = results
                job.status = EvaluationStatus.completed
                job.progress = 1.0

        except Exception as e:
            async with self._lock:
                job.status = EvaluationStatus.failed
                job.error_message = str(e)
        finally:
            async with self._lock:
                job.completed_at = datetime.now(timezone.utc)
            
            await self._notify_job_update(job)

    async def _notify_job_update(self, job: EvaluationJob):
        """Send job update via WebSocket"""
        asyncio.create_task(websocket_manager.broadcast_job_update(job))

    async def _notify_chat_update(self, job_id: str, chat_data: Any):
        """Send chat update via WebSocket"""
        from ..models.api_models import WebSocketMessage
        message = WebSocketMessage(
            type="chat",
            job_id=job_id,
            data=chat_data
        )
        asyncio.create_task(websocket_manager.broadcast_to_job(job_id, message))

Key changes made:

  1. Added asyncio.Lock: Added self._lock = asyncio.Lock() to the constructor
  2. Protected all dictionary access: Wrapped all reads/writes to self.jobs with async with self._lock:
  3. Made methods async: All methods that access the jobs dictionary are now async
  4. Optimized filtering: In get_jobs and get_job_count, we copy the jobs list inside the lock, then do filtering outside to minimize lock time
  5. Careful lock management in run_evaluation: Used multiple smaller lock acquisitions instead of holding the lock during the entire evaluation process

The API endpoints that call these methods will also need to be updated to use await when calling the service methods.


🏁 Script executed:

#!/bin/bash
# Get the complete EvaluationService implementation to see all methods that access the jobs dictionary
cat rogue/server/services/evaluation_service.py

Length of output: 9516


@yuval-qf Here's the complete updated EvaluationService implementation with thread safety using asyncio.Lock:

import asyncio
from typing import Dict, List, Optional, Any
from datetime import datetime, timezone

from ..models.api_models import EvaluationJob, EvaluationStatus
from ..websocket.manager import websocket_manager
from ...services.evaluation_library import EvaluationLibrary
from ...models.scenario import Scenarios
from ...common.logging import get_logger, set_job_context


class EvaluationService:
    def __init__(self):
        self.jobs: Dict[str, EvaluationJob] = {}
        self.logger = get_logger(__name__)
        self._lock = asyncio.Lock()

    async def add_job(self, job: EvaluationJob):
        async with self._lock:
            self.jobs[job.job_id] = job

    async def get_job(self, job_id: str) -> Optional[EvaluationJob]:
        async with self._lock:
            return self.jobs.get(job_id)

    async def get_jobs(
        self,
        status: Optional[EvaluationStatus] = None,
        limit: int = 50,
        offset: int = 0,
    ) -> List[EvaluationJob]:
        async with self._lock:
            jobs = list(self.jobs.values())
        
        if status:
            jobs = [job for job in jobs if job.status == status]

        jobs.sort(key=lambda x: x.created_at, reverse=True)
        return jobs[offset : offset + limit]

    async def get_job_count(self, status: Optional[EvaluationStatus] = None) -> int:
        async with self._lock:
            if status:
                return len([job for job in self.jobs.values() if job.status == status])
            return len(self.jobs)

    async def cancel_job(self, job_id: str) -> bool:
        async with self._lock:
            job = self.jobs.get(job_id)
            if not job:
                return False

            if job.status in [EvaluationStatus.PENDING, EvaluationStatus.RUNNING]:
                job.status = EvaluationStatus.CANCELLED
                job.completed_at = datetime.now(timezone.utc)
                # Call _notify_job_update outside the lock since it doesn't access jobs dict
                
        # Notify outside the lock to avoid deadlock
        if job and job.status == EvaluationStatus.CANCELLED:
            self._notify_job_update(job)
        
        return True

    async def run_evaluation(self, job_id: str):
        # Get job with lock protection
        async with self._lock:
            job = self.jobs.get(job_id)
        
        if not job:
            return

        try:
            # Set job context for logging
            set_job_context(
                job_id=job_id,
                agent_url=str(job.request.agent_config.evaluated_agent_url),
            )

            self.logger.info(
                "Starting evaluation job",
                extra={
                    "job_status": "running",
                    "scenario_count": len(job.request.scenarios),
                    "agent_url": str(job.request.agent_config.evaluated_agent_url),
                    "judge_llm": job.request.agent_config.judge_llm_model,
                },
            )

            # Update job status with lock protection
            async with self._lock:
                job.status = EvaluationStatus.RUNNING
                job.started_at = datetime.now(timezone.utc)
            
            self._notify_job_update(job)

            # Convert SDK scenarios to legacy scenarios for the evaluation service
            from ...models.scenario import (
                Scenario as LegacyScenario,
                ScenarioType as LegacyScenarioType,
            )

            self.logger.info(
                f"Converting {len(job.request.scenarios)} SDK scenarios to "
                "legacy format"
            )

            legacy_scenarios = []
            for i, sdk_scenario in enumerate(job.request.scenarios):
                # Convert SDK scenario to legacy scenario

                legacy_scenario_type = (
                    LegacyScenarioType.POLICY
                    if sdk_scenario.scenario_type.value == "policy"
                    else LegacyScenarioType.PROMPT_INJECTION
                )

                legacy_scenario = LegacyScenario(
                    scenario=sdk_scenario.scenario,
                    scenario_type=legacy_scenario_type,
                    dataset=sdk_scenario.dataset,
                    dataset_sample_size=sdk_scenario.dataset_sample_size,
                    expected_outcome=sdk_scenario.expected_outcome,
                )

                legacy_scenarios.append(legacy_scenario)

            scenarios = Scenarios(scenarios=legacy_scenarios)
            self.logger.info(
                (
                    "Successfully created Scenarios object with "
                    f"{len(scenarios.scenarios)} scenarios"
                )
            )

            # Create progress callback for real-time updates
            def progress_callback(update_type: str, data: Any):
                self.logger.debug(
                    "Evaluation progress update",
                    extra={
                        "update_type": update_type,
                        "data_preview": str(data)[:100] if data else None,
                    },
                )

                if update_type == "status":
                    # Update progress based on status messages
                    if "Running scenarios" in str(data):
                        # Use asyncio.create_task to handle async lock in callback
                        asyncio.create_task(self._update_job_progress(job_id, 0.1))
                    elif "conversation" in str(data).lower():
                        # Estimate progress based on conversations
                        asyncio.create_task(self._increment_job_progress(job_id, 0.1, 0.9))
                elif update_type == "chat":
                    # Real-time chat updates via WebSocket
                    self.logger.info(f"Received chat update for job {job_id}: {data}")
                    self._notify_chat_update(job_id, data)

            # Use the library interface directly
            self.logger.info(
                "Calling evaluation library",
                extra={
                    "business_context": "The agent provides customer service.",
                    "agent_url": str(job.request.agent_config.evaluated_agent_url),
                    "judge_llm": job.request.agent_config.judge_llm_model,
                    "scenario_count": len(scenarios.scenarios),
                },
            )

            evaluation_results = await EvaluationLibrary.evaluate_agent(
                agent_config=job.request.agent_config,
                scenarios=scenarios,
                # TODO: FIX THIS!@!!!!!!
                business_context="",
                progress_callback=progress_callback,
            )

            self.logger.info("Evaluation library returned results")

            # Update job with results using lock protection
            async with self._lock:
                job.results = evaluation_results.results
                job.status = EvaluationStatus.COMPLETED
                job.progress = 1.0

            self.logger.info(
                "Evaluation job completed successfully",
                extra={
                    "job_status": "completed",
                    "results_count": len(evaluation_results.results),
                    "duration_seconds": (
                        (datetime.now(timezone.utc) - job.started_at).total_seconds()
                        if job.started_at
                        else None
                    ),
                },
            )

        except Exception as e:
            error_type = type(e).__name__
            error_msg = str(e)

            # Provide user-friendly error messages
            if "APIError" in error_type and "Connection error" in error_msg:
                user_error = (
                    "LLM API Connection Error: Cannot connect to "
                    f"{job.request.agent_config.judge_llm_model}. "
                    "Please check your API key and network connection."
                )
            elif "APIError" in error_type and "authentication" in error_msg.lower():
                user_error = (
                    "LLM Authentication Error: Invalid API key for "
                    f"{job.request.agent_config.judge_llm_model}. "
                    "Please verify your judge_llm_api_key."
                )
            elif "timeout" in error_msg.lower():
                user_error = (
                    "Timeout Error: The evaluation took too long. "
                    "The agent under test may not be responding."
                )
            else:
                user_error = f"Evaluation error: {error_msg}"

            # Update job status with lock protection
            async with self._lock:
                job.status = EvaluationStatus.FAILED
                job.error_message = user_error

            self.logger.error(
                "Evaluation job failed",
                extra={
                    "job_status": "failed",
                    "error": error_msg,
                    "error_type": error_type,
                    "user_error": user_error,
                    "duration_seconds": (
                        (datetime.now(timezone.utc) - job.started_at).total_seconds()
                        if job.started_at
                        else None
                    ),
                },
            )
        finally:
            # Update completion time with lock protection
            async with self._lock:
                job.completed_at = datetime.now(timezone.utc)
            
            self._notify_job_update(job)

    async def _update_job_progress(self, job_id: str, progress: float):
        """Helper method to update job progress with lock protection"""
        async with self._lock:
            job = self.jobs.get(job_id)
            if job:
                job.progress = progress
        if job:
            self._notify_job_update(job)

    async def _increment_job_progress(self, job_id: str, increment: float, max_progress: float):
        """Helper method to increment job progress with lock protection"""
        async with self._lock:
            job = self.jobs.get(job_id)
            if job:
                job.progress = min(max_progress, job.progress + increment)
        if job:
            self._notify_job_update(job)

    def _notify_job_update(self, job: EvaluationJob):
        asyncio.create_task(websocket_manager.broadcast_job_update(job))

    def _notify_chat_update(self, job_id: str, chat_data: Any):
        """Send real-time chat updates via WebSocket"""
        from ..models.api_models import WebSocketMessage

        # Ensure chat_data is in the expected format
        if isinstance(chat_data, dict):
            # Chat data is already a dict with role/content
            data = chat_data
        else:
            # Convert string to dict format
            data = {"role": "assistant", "content": str(chat_data)}

        message = WebSocketMessage(
            type="chat_update",
            job_id=job_id,
            data=data,
        )

        self.logger.debug(f"Sending chat update via WebSocket: {data}")
        asyncio.create_task(websocket_manager.broadcast_to_job(job_id, message))

Key changes made:

  1. Added asyncio.Lock: Added self._lock = asyncio.Lock() to the constructor
  2. Made methods async: All methods that access self.jobs are now async and use await
  3. Protected all dictionary access: Wrapped all reads/writes to self.jobs with async with self._lock:
  4. Added helper methods: Created _update_job_progress and _increment_job_progress to handle progress updates in callbacks with proper locking
  5. Optimized lock usage: In get_jobs, we copy the jobs list inside the lock, then do filtering/sorting outside to minimize lock time
  6. Careful notification handling: Moved _notify_job_update calls outside the lock context to avoid potential deadlocks

Important: The API endpoints in rogue/server/api/evaluation.py will need to be updated to use await when calling these service methods since they're now async.


🧠 Learnings used
Learnt from: CR
PR: qualifire-dev/qualifire#0
File: .cursor/rules/clean-code.mdc:0-0
Timestamp: 2025-07-01T11:52:43.736Z
Learning: Refactor continuously

Comment on lines 147 to 155
### Custom HTTP Client

```python
from rogue_client import RogueHttpClient

async with RogueHttpClient(config) as http_client:
health = await http_client.health()
response = await http_client.create_evaluation(request)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means nothing without reading the code. What's custom about the HTTP client if it is imported from Rogue?

Comment on lines 290 to 301
### Type Checking

```bash
python -m mypy rogue_client/
```

### Code Formatting

```bash
python -m black rogue_client/
python -m flake8 rogue_client/
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed

Comment on lines 11 to 13
class EvaluationService:
def __init__(self):
self.jobs: Dict[str, EvaluationJob] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do that

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🔭 Outside diff range comments (9)
.github/actions/rogue/action.yml (1)

83-85: Avoid echoing sensitive arguments in CI logs

The echo prints the full command including potential secrets (e.g., --evaluated-agent-credentials, --judge-llm-api-key). Even with GitHub’s masking, inputs might not be provided as “secrets”, so values could leak.

Apply this diff to prevent leakage and improve masking:

-        echo "Running: 'uv run -m rogue cli "${ARGS[@]}"'"
-        uv run -m rogue cli "${ARGS[@]}"
+        # Mask sensitive inputs if provided
+        if [[ -n "${{ inputs.judge_llm_api_key }}" ]]; then
+          echo "::add-mask::${{ inputs.judge_llm_api_key }}"
+        fi
+        if [[ -n "${{ inputs.evaluated_agent_credentials }}" ]]; then
+          echo "::add-mask::${{ inputs.evaluated_agent_credentials }}"
+        fi
+
+        echo "Running: 'uv run -m rogue cli [args redacted]'"
+        uv run -m rogue cli "${ARGS[@]}"
rogue/models/cli_input.py (1)

40-44: Ensure non-empty credentials when auth is required

The current check allows SecretStr("") to pass, since a SecretStr instance is truthy. Validate the underlying secret string is not empty.

Apply this diff:

-        if auth_type != AuthType.NO_AUTH and not auth_credentials:
+        if auth_type != AuthType.NO_AUTH and (
+            not auth_credentials
+            or not auth_credentials.get_secret_value().strip()
+        ):
             raise ValueError(
                 "Authentication Credentials cannot be empty for the selected auth type."
             )
sdks/python/rogue_sdk/types.py (1)

382-387: Fix HttpUrl coercion: HttpUrl(v) is invalid in Pydantic v2 and will raise.

HttpUrl isn’t directly callable with a URL string. Use TypeAdapter(HttpUrl).validate_python(v) (or simplify by making base_url: HttpUrl and dropping the validator). As-is, constructing RogueClientConfig with a string base_url is likely to crash.

Apply this diff:

@@
-    @field_validator("base_url", mode="after")
-    def validate_base_url(cls, v: str | HttpUrl) -> HttpUrl:
-        if isinstance(v, str):
-            return HttpUrl(v)
-        return v
+    @field_validator("base_url", mode="after")
+    def validate_base_url(cls, v: str | HttpUrl) -> HttpUrl:
+        if isinstance(v, str):
+            from pydantic import TypeAdapter
+            return TypeAdapter(HttpUrl).validate_python(v)
+        return v

Alternatively (simpler and preferred):

  • Change base_url: HttpUrl | str to base_url: HttpUrl
  • Remove the validator entirely and let Pydantic coerce from str.
rogue/tests/models/test_evaluation_result.py (1)

38-48: Fix: calling a @staticmethod in class body raises (staticmethod object not callable).

Within the class body, get_evaluation_result is a staticmethod object, not directly callable. This will fail at import time. Move the helper to module level (preferred) or build the EvaluationResult inline.

Apply this diff to move the helper out of the class and use it:

@@
-class TestEvaluationResults:
+def get_evaluation_result(
+    scenario: Scenario,
+    conversation: ConversationEvaluation,
+) -> EvaluationResult:
+    return EvaluationResult(
+        scenario=scenario,
+        conversations=[conversation],
+        passed=conversation.passed,
+    )
+
+
+class TestEvaluationResults:
@@
-    @staticmethod
-    def get_evaluation_result(
-        scenario: Scenario,
-        conversation: ConversationEvaluation,
-    ) -> EvaluationResult:
-        return EvaluationResult(
-            scenario=scenario,
-            conversations=[conversation],
-            passed=conversation.passed,
-        )
+    # helper moved to module level (see above)
rogue/common/workdir_utils.py (1)

29-35: Critical: credentials are not excluded; PII may be written to disk.

sanitized_config filters keys ending with _key and auth_credentials, but the actual field is evaluated_agent_credentials. This risks persisting credentials to user_config.json.

Apply this diff to robustly exclude keys and None values:

-    config_dict = config.model_dump(mode="json")
-    # Not storing any api keys or credentials
-    sanitized_config = {
-        k: v
-        for k, v in config_dict.items()
-        if not k.endswith("_key") and k != "auth_credentials"
-    }
+    # Not storing any API keys or credentials
+    sanitized_config = config.model_dump(
+        mode="json",
+        exclude={
+            "evaluated_agent_credentials",
+            "judge_llm_api_key",
+        },
+        exclude_none=True,
+    )

Optionally also exclude any key ending with _credentials generically.

rogue/ui/components/config_screen.py (2)

134-139: Error label keys use outdated field names; validation hints won't display correctly.

Pydantic errors will reference evaluated_* fields (e.g., evaluated_agent_url), but error_labels maps "agent_url"/"auth_credentials", so UI won't show the specific field errors.

-        error_labels = {
-            "agent_url": agent_url_error,
-            "auth_credentials": auth_credentials_error,
-            "judge_llm_api_key": judge_llm_api_key_error,
-            # "huggingface_api_key": huggingface_api_key_error,
-        }
+        error_labels = {
+            "evaluated_agent_url": agent_url_error,
+            "evaluated_agent_credentials": auth_credentials_error,
+            "judge_llm_api_key": judge_llm_api_key_error,
+            # "huggingface_api_key": huggingface_api_key_error,
+        }

146-157: State update keys are inconsistent with the new evaluated_ schema.*

You read evaluated_* from config but write legacy keys ("agent_url", "auth_type", "auth_credentials"). Align keys to avoid confusion and ensure downstream consumers rely on one schema.

-    for component, key in [
-        (agent_url, "agent_url"),
+    for component, key in [
+        (agent_url, "evaluated_agent_url"),
         (interview_mode, "interview_mode"),
-        (auth_type, "auth_type"),
-        (auth_credentials, "auth_credentials"),
+        (auth_type, "evaluated_agent_auth_type"),
+        (auth_credentials, "evaluated_agent_credentials"),
         (service_llm, "service_llm"),
         (judge_llm, "judge_llm"),
         (judge_llm_api_key, "judge_llm_api_key"),
         # (huggingface_api_key, "huggingface_api_key"),
         (deep_test_mode, "deep_test_mode"),
         (parallel_runs, "parallel_runs"),
     ]:
rogue/run_cli.py (2)

33-39: Fix argparse type/choices mismatch for --evaluated-agent-auth-type

type=AuthType + choices as strings breaks validation. Use strings for parsing and let Pydantic convert later.

-    parser.add_argument(
+    parser.add_argument(
         "--evaluated-agent-auth-type",
         required=False,
-        type=AuthType,
-        choices=[e.value for e in AuthType],
+        type=str,
+        choices=[e.value for e in AuthType],
         help="How to authenticate with the evaluated agent (if needed)."
         f"Valid options are: {[e.value for e in AuthType]}",
     )

317-319: Avoid logging full CLI input; may leak credentials

cli_input.model_dump() can include API keys and agent credentials. Log only safe metadata.

[security]

-    logger.debug("Running CLI", extra=cli_input.model_dump())
+    logger.debug(
+        "Running CLI",
+        extra={
+            "evaluated_agent_url": cli_input.evaluated_agent_url.encoded_string(),
+            "deep_test_mode": cli_input.deep_test_mode,
+            "judge_llm": cli_input.judge_llm,
+        },
+    )
♻️ Duplicate comments (38)
.github/workflows/rogue.yml (3)

29-34: Consolidate installs into one editable install transaction

Running two separate editable installs can lead to resolution/order drift. Install both in one uv pip call from repo root.

-      - name: Install rogue sdk
-        run: source .venv/bin/activate && uv pip install -e sdks/python
-
-      - name: Install rogue server
-        run: source .venv/bin/activate && uv pip install -e .
+      - name: Install rogue server and SDK (editable)
+        run: source .venv/bin/activate && uv pip install -e . -e sdks/python

40-42: Remove trailing ampersand; background-action handles daemonizing

The background-action manages lifecycle and log tailing. Keeping “&” can orphan processes and break cancellation.

-          run: uv run examples/tshirt_store_agent --host 0.0.0.0 --port 10001 &
+          run: uv run examples/tshirt_store_agent --host 0.0.0.0 --port 10001

44-51: Use background-action without ‘&’; server CLI path is valid in this PR

  • Remove “&” as above.
  • The earlier advice to switch to -m rogue.server is now optional: since this PR adds a top-level server subcommand in rogue/__main__.py, uv run -m rogue server … is valid. Keep it if you prefer unified CLI UX.

Apply:

-          run: uv run -m rogue server --host 0.0.0.0 --port 8000 &
+          run: uv run -m rogue server --host 0.0.0.0 --port 8000

Quick verification script (confirms the server subcommand exists and args are wired):

#!/bin/bash
set -euo pipefail
echo "=== Rogue CLI server subparser and args ==="
rg -n -C3 'add_parser\(\s*"server"' rogue/__main__.py
rg -n -C3 'def set_server_args' rogue/run_server.py
rg -n -C2 '\-\-host|\-\-port' rogue/run_server.py
rogue/server/main.py (2)

56-62: CORS is overly permissive; restrict by environment

Allowing any origin/headers with credentials is risky in production. Restrict origins based on environment and narrow headers/methods as needed.

Example tightening:

+    import os
+    allowed_origins = (
+        os.getenv("ALLOWED_ORIGINS", "").split(",")
+        if os.getenv("ALLOWED_ORIGINS")
+        else ["http://localhost:3000"]
+    )
     app.add_middleware(
         CORSMiddleware,
-        allow_origins=["*"],
-        allow_credentials=True,
-        allow_methods=["GET", "POST", "PUT", "DELETE"],
-        allow_headers=["*"],
+        allow_origins=allowed_origins,
+        allow_credentials=False,
+        allow_methods=["GET", "POST", "PUT", "DELETE"],
+        allow_headers=["Content-Type", "Authorization"],
     )

Note: ensure your chosen allow_methods covers actual usage; CORS middleware will still handle preflight OPTIONS automatically.


103-107: Avoid multiple entrypoints; remove main guard here or in rogue/server/main.py

There’s already a CLI entry via rogue/__main__.py and likely a module entry via rogue/server/__main__.py. Keeping a second if __name__ == "__main__" here can diverge behavior (env vs CLI flags).

Remove the block here and consolidate on the module or CLI entrypoint:

-if __name__ == "__main__":
-    host = os.getenv("HOST", "127.0.0.1")
-    port = int(os.getenv("PORT", "8000"))
-    reload = os.getenv("RELOAD", "false").lower() in ("true", "1", "yes")
-    start_server(host, port, reload)
rogue/ui/components/interviewer.py (3)

38-41: Don't hardcode the server URL; read from config/state.

Use a configurable base URL (from state["config"]["server_url"] or a sensible default) to support different environments.

-                sdk_config = RogueClientConfig(
-                    base_url="http://localhost:8000",
-                    timeout=600.0,
-                )
+                base_url = (config or {}).get("server_url", "http://localhost:8000")
+                sdk_config = RogueClientConfig(
+                    base_url=base_url,
+                    timeout=600.0,
+                )

60-64: Return only the bot text; avoid returning a tuple.

The Chatbot history expects a single string for the bot message. Returning a tuple will leak implementation details into the UI and break downstream usage (e.g., finalize_context stores the tuple as business context).

-                    return (
-                        response.response,
-                        response.is_complete,
-                        response.message_count,
-                    )
+                    return response.response

65-68: Provide a safe fallback on exceptions (avoid returning None).

Currently on failure this returns None, which later gets assigned to history[-1][1]. Return a user-friendly string instead.

                 except Exception:
                     logger.exception("Failed to send interview message")
+                    return "I encountered an error processing your message. Please try again."
                 finally:
                     await sdk.close()
rogue/ui/components/scenario_generator.py (4)

40-45: Don't hardcode the server URL; read from config/state.

Pull base_url from state["config"]["server_url"] (or default) to support non-local environments.

-            sdk_config = RogueClientConfig(
-                base_url="http://localhost:8000",
-                timeout=600.0,
-            )
+            base_url = (config or {}).get("server_url", "http://localhost:8000")
+            sdk_config = RogueClientConfig(
+                base_url=base_url,
+                timeout=600.0,
+            )

51-54: Improve error semantics: log accurately and re-raise to outer handler.

Returning None on error causes downstream None handling issues. Let the outer try/except display the error UI.

-            except Exception:
-                logger.exception("Failed to generate scenarios from LLM response.")
+            except Exception:
+                logger.exception("Failed to generate scenarios via SDK.")
+                raise
             finally:
                 await sdk.close()

56-57: Avoid asyncio.run() inside Gradio; make the handler async and await.

This can raise "asyncio.run() cannot be called from a running event loop".

-        try:
-            scenarios = asyncio.run(generate_scenarios_async())
+        try:
+            scenarios = await generate_scenarios_async()

Outside this hunk, update the callback signature to async:

# Change this:
def generate_and_display_scenarios(state, current_context):

# To this:
async def generate_and_display_scenarios(state, current_context):

46-50: Guard against None model overriding SDK default.

If service_llm is missing, passing model=None can break the call. Provide a fallback.

-                return await sdk.generate_scenarios(
-                    business_context=current_context,
-                    model=service_llm,
-                    api_key=api_key,
-                )
+                return await sdk.generate_scenarios(
+                    business_context=current_context,
+                    model=service_llm or "openai/gpt-4o-mini",
+                    api_key=api_key,
+                )
sdks/python/rogue_sdk/websocket.py (1)

152-172: Track async handler tasks and log exceptions correctly

create_task without tracking can leak tasks; also logger.exception(exc_info=t.exception()) doesn’t attach the traceback in stdlib logging. Track tasks, add a done-callback that logs the exception with a proper exc_info tuple, and cancel/gather on disconnect.

 class RogueWebSocketClient:
@@
         self._stop_event = asyncio.Event()
         self._message_handler_task: Optional[asyncio.Task] = None
+        self._event_handler_tasks: List[asyncio.Task] = []
@@
     async def disconnect(self) -> None:
         """Disconnect from WebSocket."""
         self._stop_event.set()
         self.is_connected = False
+        # Cancel any pending event handler tasks
+        for task in list(self._event_handler_tasks):
+            if not task.done():
+                task.cancel()
+        if self._event_handler_tasks:
+            await asyncio.gather(*self._event_handler_tasks, return_exceptions=True)
+        self._event_handler_tasks.clear()
@@
     def _emit(self, event: WebSocketEventType, data: Any) -> None:
         """Emit event to handlers."""
         handlers = self.event_handlers.get(event, [])
-        for handler in handlers:
+        for handler in handlers:
             try:
                 if asyncio.iscoroutinefunction(handler):
-                    task = asyncio.create_task(handler(event, data))
-                    task.add_done_callback(
-                        lambda t: (
-                            logger.exception(
-                                "WS handler error",
-                                exc_info=t.exception(),
-                            )
-                            if t.exception()
-                            else None
-                        )
-                    )
+                    task = asyncio.create_task(handler(event, data))
+                    self._event_handler_tasks.append(task)
+                    def _on_task_done(t: asyncio.Task):
+                        exc = t.exception()
+                        if exc:
+                            logger.exception(
+                                "WS handler error",
+                                exc_info=(type(exc), exc, exc.__traceback__),
+                            )
+                        # prune finished tasks
+                        self._event_handler_tasks = [
+                            et for et in self._event_handler_tasks if not et.done()
+                        ]
+                    task.add_done_callback(_on_task_done)
                 else:
                     handler(event, data)
             except Exception:
                 logger.exception("Error in event handler")
rogue/server/services/evaluation_service.py (7)

13-16: Prefer importing the shared websocket_manager singleton directly

You can import websocket_manager instead of get_websocket_manager() to avoid confusion about multiple instances (the manager module already exposes a cached singleton).

-from ..websocket.manager import get_websocket_manager
+from ..websocket.manager import websocket_manager
@@
-        self.websocket_manager = get_websocket_manager()
+        self.websocket_manager = websocket_manager

Also applies to: 24-25


211-213: Guard completion timestamp with the lock

Serialize writes to completed_at.

-            job.completed_at = datetime.now(timezone.utc)
+            async with self._lock:
+                job.completed_at = datetime.now(timezone.utc)
             self._notify_job_update(job)

148-157: Guard completion updates with the lock

Avoid concurrent writes to results/status/progress fields.

-            # Update job with results
-            if final_results and final_results.results:
-                job.results = final_results.results
-                job.status = EvaluationStatus.COMPLETED
-                job.progress = 1.0
-            else:
-                # No results - mark as failed
-                job.status = EvaluationStatus.FAILED
-                job.error_message = "Evaluation completed but no results were generated"
+            # Update job with results
+            async with self._lock:
+                if final_results and final_results.results:
+                    job.results = final_results.results
+                    job.status = EvaluationStatus.COMPLETED
+                    job.progress = 1.0
+                else:
+                    # No results - mark as failed
+                    job.status = EvaluationStatus.FAILED
+                    job.error_message = "Evaluation completed but no results were generated"

90-93: Protect job state updates with the async lock

run_job mutates shared job state without holding the lock, risking races with concurrent readers/writers (e.g., cancel_job).

-            job.status = EvaluationStatus.RUNNING
-            job.started_at = datetime.now(timezone.utc)
+            async with self._lock:
+                job.status = EvaluationStatus.RUNNING
+                job.started_at = datetime.now(timezone.utc)
             self._notify_job_update(job)

196-198: Guard failure updates with the lock

Serialize failure state assignment.

-            job.status = EvaluationStatus.FAILED
-            job.error_message = user_error
+            async with self._lock:
+                job.status = EvaluationStatus.FAILED
+                job.error_message = user_error

131-142: Guard progress mutations with the lock

These updates should be serialized to avoid lost updates or inconsistent views.

                 if update_type == "status":
                     # Update progress based on status messages
                     if "Running scenarios" in str(data):
-                        job.progress = 0.1
+                        async with self._lock:
+                            job.progress = 0.1
                         self._notify_job_update(job)
                 elif update_type == "chat":
                     # Real-time chat updates via WebSocket
                     logger.info(f"Received chat update for job {job_id}: {data}")
                     self._notify_chat_update(job_id, data)
-                    job.progress = min(0.9, job.progress + 0.1)
+                    async with self._lock:
+                        job.progress = min(0.9, job.progress + 0.1)
                     self._notify_job_update(job)

56-67: Fix cancel semantics: return True only when a cancellation happened

Currently returns True even if no state change occurred. Guard with a lock and return based on actual transition.

-    async def cancel_job(self, job_id: str) -> bool:
-        job = await self.get_job(job_id)
-        if not job:
-            return False
-
-        if job.status in [EvaluationStatus.PENDING, EvaluationStatus.RUNNING]:
-            job.status = EvaluationStatus.CANCELLED
-            job.completed_at = datetime.now(timezone.utc)
-            self._notify_job_update(job)
-
-        return True
+    async def cancel_job(self, job_id: str) -> bool:
+        cancelled = False
+        async with self._lock:
+            job = self.jobs.get(job_id)
+            if not job:
+                return False
+            if job.status in [EvaluationStatus.PENDING, EvaluationStatus.RUNNING]:
+                job.status = EvaluationStatus.CANCELLED
+                job.completed_at = datetime.now(timezone.utc)
+                cancelled = True
+        if cancelled:
+            self._notify_job_update(job)
+        return cancelled
sdks/python/rogue_sdk/__init__.py (1)

6-40: Update example in sdks/python/rogue_sdk/init.py to use run_evaluation and correctly construct EvaluationRequest

The docstring’s sample still references the removed quick_evaluate method and passes raw dicts to run_evaluation_with_updates, which no longer matches the SDK’s API. Please update the example to use run_evaluation and to wrap parameters in the appropriate Pydantic models.

• File: sdks/python/rogue_sdk/__init__.py
• Lines: 6–40

Suggested diff:

-    import asyncio
-    from rogue_sdk import RogueSDK, RogueClientConfig, AuthType, ScenarioType
+    import asyncio
+    from rogue_sdk import (
+        RogueSDK,
+        RogueClientConfig,
+        AuthType,
+        ScenarioType,
+        EvaluationRequest,
+        AgentConfig,
+        Scenario,
+        Scenarios,
+    )

@@
-        async with RogueSDK(config) as client:
-            # Quick evaluation
-            result = await client.quick_evaluate(
-                agent_url="http://localhost:3000",
-                scenarios=["The agent should be polite", "No discounts allowed"]
-            )
-            print(f"Evaluation result: {result.status}")
+        async with RogueSDK(config) as client:
+            # Quick evaluation
+            result = await client.run_evaluation(
+                agent_url="http://localhost:3000",
+                scenarios=Scenarios(
+                    scenarios=[
+                        Scenario(scenario="The agent should be polite", scenario_type=ScenarioType.POLICY),
+                        Scenario(scenario="No discounts allowed", scenario_type=ScenarioType.POLICY),
+                    ]
+                ),
+                business_context="Customer service context",
+            )
+            print(f"Evaluation status: {result.status}")

@@
-            # Evaluation with real-time updates
-            job = await client.run_evaluation_with_updates(
-                agent_config=AgentConfig(
-                    evaluated_agent_url="http://localhost:3000",
-                    evaluated_agent_auth_type=AuthType.NO_AUTH,
-                    judge_llm="openai/gpt-4o-mini",
-                    business_context="Customer service context",
-                ),
-                scenarios=Scenarios(scenarios=[
-                    Scenario(scenario="Test scenario", scenario_type=ScenarioType.POLICY)
-                ]),
-                on_update=lambda update: print(f"Status: {update.get('status')}"),
-                on_chat=lambda chat: print(f"Chat: {chat}"),
-            )
+            # Evaluation with real-time updates
+            job = await client.run_evaluation_with_updates(
+                request=EvaluationRequest(
+                    agent_config=AgentConfig(
+                        evaluated_agent_url="http://localhost:3000",
+                        evaluated_agent_auth_type=AuthType.NO_AUTH,
+                        judge_llm="openai/gpt-4o-mini",
+                        business_context="Customer service context",
+                    ),
+                    scenarios=[
+                        Scenario(scenario="Test scenario", scenario_type=ScenarioType.POLICY)
+                    ],
+                ),
+                on_update=lambda update: print(f"Status: {update['status']}"),
+                on_chat=lambda chat: print(f"Chat: {chat}"),
+            )
rogue/ui/components/scenario_runner.py (6)

134-140: Clamp parallel_runs to UI capacity and enforce >0

Avoid creating more groups than the UI renders and prevent zero/negative batches. This also avoids empty batches and index math issues.

-        parallel_runs = config.get("parallel_runs", 1)
+        parallel_runs = max(1, int(config.get("parallel_runs", 1)))
+        if parallel_runs > MAX_PARALLEL_RUNS:
+            logger.warning(
+                f"parallel_runs={parallel_runs} exceeds UI capacity; clamping to {MAX_PARALLEL_RUNS}"
+            )
+            parallel_runs = MAX_PARALLEL_RUNS
         logger.info(f"Setting up {parallel_runs} parallel runs")

231-250: Synchronous callback schedules async task without error handling

Uncaught exceptions in the created task get lost. At minimum, wrap create_task in try/except or attach a done callback to log task exceptions.

                 def on_chat_update(chat_data):
                     """Handle real-time chat updates from SDK"""
                     logger.info(
                         f"Worker {worker_id}: Received chat update: {chat_data}",
                     )
-                    # Use asyncio.create_task to schedule the async operation
-                    asyncio.create_task(
-                        update_queue.put(
-                            (
-                                worker_id,
-                                "chat",
-                                {
-                                    "role": chat_data.get("role", "assistant"),
-                                    "content": chat_data.get("content", ""),
-                                },
-                            )
-                        )
-                    )
+                    try:
+                        task = asyncio.create_task(
+                            update_queue.put(
+                                (
+                                    worker_id,
+                                    "chat",
+                                    {
+                                        "role": chat_data.get("role", "assistant"),
+                                        "content": chat_data.get("content", ""),
+                                    },
+                                )
+                            )
+                        )
+                        task.add_done_callback(
+                            lambda t: logger.error(f"chat update task failed: {t.exception()}")
+                            if t.exception()
+                            else None
+                        )
+                    except Exception as e:
+                        logger.error(f"Failed to queue chat update: {e}")

252-269: Same issue for status callback; handle task errors

Mirror the error handling pattern used for chat updates.

                 def on_status_update(status_data):
@@
-                    # Use asyncio.create_task to schedule the async operation
-                    asyncio.create_task(
-                        update_queue.put((worker_id, "status", status_msg))
-                    )
+                    try:
+                        task = asyncio.create_task(
+                            update_queue.put((worker_id, "status", status_msg))
+                        )
+                        task.add_done_callback(
+                            lambda t: logger.error(f"status update task failed: {t.exception()}")
+                            if t.exception()
+                            else None
+                        )
+                    except Exception as e:
+                        logger.error(f"Failed to queue status update: {e}")

195-200: Hardcoded server URL; make configurable

Using HttpUrl("http://localhost:8000") hardcodes the server endpoint and breaks non-local deployments. Read from worker_config (or env) with a sensible default.

             logger.info(f"🔌 Worker {worker_id}: Initializing SDK connection")
-            sdk_config = RogueClientConfig(
-                base_url=HttpUrl("http://localhost:8000"),
-                timeout=600.0,
-            )
+            server_url = worker_config.get("server_url", os.getenv("ROGUE_SERVER_URL", "http://localhost:8000"))
+            sdk_config = RogueClientConfig(
+                base_url=server_url,
+                timeout=600.0,
+            )
             sdk = RogueSDK(sdk_config)

Note: add at top of file:

import os

214-221: scenarios is undefined here; pass the batch wrapped in Scenarios

This will raise NameError. Use the current worker’s batch and wrap it with the SDK Scenarios type.

-                # Create evaluation request
-                request = EvaluationRequest(
-                    agent_config=agent_config,
-                    scenarios=scenarios,
-                )
+                # Create evaluation request from this worker's batch
+                sdk_scenarios = Scenarios(scenarios=batch)
+                request = EvaluationRequest(
+                    agent_config=agent_config,
+                    scenarios=sdk_scenarios,
+                )

477-493: Summary generation: hardcoded URL, wrong model, resource leak, and undefined ‘summary’ on failure

  • Hardcoded base_url makes deployments brittle.
  • Model should be the judge model; using service_llm is misleading.
  • sdk.close() should be tied to a context or finally.
  • ‘summary’ is used even if an exception occurs → UnboundLocalError.
-        # Generate summary using SDK (server-based)
-        try:
-            sdk_config = RogueClientConfig(
-                base_url="http://localhost:8000",
-                timeout=600.0,
-            )
-            sdk = RogueSDK(sdk_config)
-
-            summary = await sdk.generate_summary(
-                results=all_results,
-                model=config.get("service_llm"),
-                api_key=config.get("judge_llm_api_key"),
-            )
-
-            await sdk.close()
-        except Exception:
-            logger.exception("Summary generation failed")
+        # Generate summary using SDK (server-based)
+        summary = "Failed to generate summary."
+        try:
+            server_url = config.get("server_url", os.getenv("ROGUE_SERVER_URL", "http://localhost:8000"))
+            sdk_config = RogueClientConfig(
+                base_url=server_url,
+                timeout=600.0,
+            )
+            async with RogueSDK(sdk_config) as sdk:
+                summary = await sdk.generate_summary(
+                    results=all_results,
+                    model=(config.get("judge_llm") or config.get("judge_llm_model") or config.get("service_llm")),
+                    api_key=config.get("judge_llm_api_key"),
+                )
+        except Exception:
+            logger.exception("Summary generation failed")

Note: Ensure import os is present at the top.

rogue/evaluator_agent/evaluator_agent.py (1)

559-572: Return the agent’s response text, not the serialized object

The docstring promises "response" is a string. Returning response.model_dump_json() breaks that contract and confuses downstream consumers.

             logger.info(
                 "✅ A2A call successful - received response from evaluated agent",
                 extra={
                     "response_length": len(agent_response_text),
                     "response_preview": (
                         agent_response_text[:100] + "..."
                         if len(agent_response_text) > 100
                         else agent_response_text
                     ),
                     "context_id": context_id,
                 },
             )
-            return {"response": response.model_dump_json()}
+            return {"response": agent_response_text}
rogue/server/api/evaluation.py (2)

27-33: Use HTTP 202 for async job creation and include Location header

This endpoint enqueues background work. 202 Accepted with a Location to the job resource is more appropriate.

-@router.post("", response_model=EvaluationResponse)
+@router.post("", response_model=EvaluationResponse, status_code=202)
 async def create_evaluation(
     request: EvaluationRequest,
     background_tasks: BackgroundTasks,
+    response,
     evaluation_service: EvaluationService = Depends(get_evaluation_service),
 ):
@@
     await evaluation_service.add_job(job)
     background_tasks.add_task(evaluation_service.run_job, job_id)
@@
-    return EvaluationResponse(
+    # Point clients to the job status endpoint
+    try:
+        response.headers["Location"] = f"/evaluations/{job_id}"
+    except Exception:
+        # best-effort; do not fail the call on header issues
+        pass
+    return EvaluationResponse(
         job_id=job_id,
         status=EvaluationStatus.PENDING,
         message="Evaluation job created successfully",
     )

Add missing import at top:

from fastapi import Response, status

and annotate response: Response in the signature.

Also applies to: 64-76


101-105: Do not log entire job objects; risk of leaking secrets/PII

Serializing the full job can expose credentials from agent_config or API keys. Log only safe metadata.

[security]

-    job = await evaluation_service.get_job(job_id)
-    logger.info(f"Job: {job}")
+    job = await evaluation_service.get_job(job_id)
+    logger.info(
+        "Retrieved evaluation job",
+        extra={"job_id": job_id, "status": getattr(job, "status", None)},
+    )
rogue/run_cli.py (3)

147-151: Handle health check failures gracefully

Awaiting sdk.health() can raise; exit with a clear message instead of a stack trace.

-        # Check if server is available
-        await sdk.health()
+        # Check if server is available
+        try:
+            await sdk.health()
+        except Exception as e:
+            logger.error(f"Unable to reach Rogue server at {rogue_server_url}: {e}")
+            return None

261-263: Default judge model likely has a typo ("o4-mini" → "gpt-4o-mini")

Other parts of the codebase and providers use gpt-4o-mini. Align default for consistency and fewer surprises.

-    if data.get("judge_llm") is None:
-        data["judge_llm"] = "openai/o4-mini"
+    if data.get("judge_llm") is None:
+        data["judge_llm"] = "openai/gpt-4o-mini"

171-176: Ensure output directory exists before writing results

Without this, write_text can raise FileNotFoundError when parent dirs are missing.

-            evaluation_results_output_path.write_text(
+            evaluation_results_output_path.parent.mkdir(parents=True, exist_ok=True)
+            evaluation_results_output_path.write_text(
                 results.model_dump_json(indent=2, exclude_none=True),
                 encoding="utf-8",
             )
rogue/server/services/evaluation_library.py (2)

148-161: Add error handling to streaming evaluator

The streaming variant lacks the robust try/except of evaluate_agent. Mirror that to emit/log failures consistently.

-        async for update_type, data in service.evaluate_scenarios():
-            yield update_type, data
+        try:
+            async for update_type, data in service.evaluate_scenarios():
+                yield update_type, data
+        except Exception:
+            logger.exception("💥 EvaluationLibrary.evaluate_agent_streaming failed")
+            raise

181-188: Guard against running inside an existing event loop

asyncio.run() will raise in Jupyter/async contexts. Detect and provide a clearer error.

-        return asyncio.run(
+        try:
+            loop = asyncio.get_running_loop()
+            if loop.is_running():
+                raise RuntimeError(
+                    "evaluate_agent_sync cannot run inside an active event loop. "
+                    "Use evaluate_agent instead."
+                )
+        except RuntimeError:
+            # No running loop; safe to proceed
+            pass
+        return asyncio.run(
             EvaluationLibrary.evaluate_agent(
                 agent_config=agent_config,
                 scenarios=scenarios,
                 business_context=business_context,
                 progress_callback=progress_callback,
             )
         )
sdks/python/rogue_sdk/sdk.py (3)

100-101: Normalize base_url for WebSocket client; strip trailing slash (and drop the no-op + "").

Inconsistent trailing slash handling can break WS URL composition and differs from the HTTP client's normalization.

-        self.ws_client = RogueWebSocketClient(str(self.config.base_url) + "", job_id)
+        self.ws_client = RogueWebSocketClient(str(self.config.base_url).rstrip("/"), job_id)

248-257: Bug: Incorrect use of Pydantic HttpUrl — do not call it as a constructor.

HttpUrl is a type for validation, not a callable. This will raise at runtime. Pass the raw string and let AgentConfig validate it.

-        agent_config = AgentConfig(
-            evaluated_agent_url=HttpUrl(agent_url),
+        agent_config = AgentConfig(
+            evaluated_agent_url=agent_url,
             evaluated_agent_auth_type=auth_type,
             evaluated_agent_credentials=auth_credentials,
             judge_llm=judge_model,
             deep_test_mode=deep_test,
             interview_mode=True,
             parallel_runs=1,
             business_context=business_context,
         )

324-333: Docstring is incorrect; method returns SendMessageResponse, not a tuple.

This is user-facing API documentation and will mislead consumers.

     async def send_interview_message(
         self,
         session_id: str,
         message: str,
     ) -> SendMessageResponse:
-        """
-        Send a message in an interview session.
-
-        Returns:
-            tuple: (response, is_complete, message_count)
-        """
+        """
+        Send a message in an interview session.
+        Returns:
+            SendMessageResponse
+        """
         return await self.http_client.send_interview_message(
             session_id=session_id,
             message=message,
         )

Comment on lines +23 to +25
evaluate_agent = EvaluationLibrary.evaluate_agent
evaluate_agent_streaming = EvaluationLibrary.evaluate_agent_streaming
evaluate_agent_sync = EvaluationLibrary.evaluate_agent_sync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Exported callables are unbound methods; will require a self argument at call time.

Assigning class methods directly exposes unbound functions that still expect self. Callers invoking evaluate_agent(...) will fail.

Use an instance-bound interface:

-from .server.services.evaluation_library import EvaluationLibrary
+from .server.services.evaluation_library import EvaluationLibrary
+_evaluation_library = EvaluationLibrary()
@@
-evaluate_agent = EvaluationLibrary.evaluate_agent
-evaluate_agent_streaming = EvaluationLibrary.evaluate_agent_streaming
-evaluate_agent_sync = EvaluationLibrary.evaluate_agent_sync
+evaluate_agent = _evaluation_library.evaluate_agent
+evaluate_agent_streaming = _evaluation_library.evaluate_agent_streaming
+evaluate_agent_sync = _evaluation_library.evaluate_agent_sync
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
evaluate_agent = EvaluationLibrary.evaluate_agent
evaluate_agent_streaming = EvaluationLibrary.evaluate_agent_streaming
evaluate_agent_sync = EvaluationLibrary.evaluate_agent_sync
from .server.services.evaluation_library import EvaluationLibrary
_evaluation_library = EvaluationLibrary()
evaluate_agent = _evaluation_library.evaluate_agent
evaluate_agent_streaming = _evaluation_library.evaluate_agent_streaming
evaluate_agent_sync = _evaluation_library.evaluate_agent_sync
🤖 Prompt for AI Agents
In rogue/__init__.py around lines 23-25 the module assigns class methods
directly (evaluate_agent = EvaluationLibrary.evaluate_agent etc.), which exports
unbound functions that still expect a self argument; instantiate
EvaluationLibrary and export bound instance methods instead (e.g., create a
single EvaluationLibrary() object and set evaluate_agent,
evaluate_agent_streaming, evaluate_agent_sync to that object's methods) or
alternatively convert those methods to @staticmethod/@classmethod in the
class—pick the instance-bound approach to avoid changing class semantics.

Comment on lines 13 to 14
evaluated_agent_url: HttpUrl
evaluated_agent_auth_type: AuthType = AuthType.NO_AUTH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use AnyHttpUrl to allow localhost/IPs

HttpUrl is strict and often rejects localhost/IPs. Your docs and examples use http://localhost:10001, which will likely fail validation with HttpUrl in Pydantic. Switch to AnyHttpUrl.

Apply this diff:

-from pydantic import BaseModel, Field, HttpUrl, SecretStr, model_validator
+from pydantic import BaseModel, Field, AnyHttpUrl, SecretStr, model_validator
@@
-    evaluated_agent_url: HttpUrl
+    evaluated_agent_url: AnyHttpUrl
@@
-    evaluated_agent_url: HttpUrl | None = Field(default=None)
+    evaluated_agent_url: AnyHttpUrl | None = Field(default=None)

Also applies to: 54-54

🤖 Prompt for AI Agents
In rogue/models/cli_input.py around lines 13-14 (and also update the similar
declaration at line 54), the field typed as HttpUrl is too strict for
localhost/IP addresses used in docs; change the type annotation from HttpUrl to
AnyHttpUrl (importing AnyHttpUrl from pydantic if not already imported) for
evaluated_agent_url (and the other occurrence) so validation accepts localhost
and IP forms.

Comment on lines 18 to 19
input_scenarios_file: Path = workdir / "scenarios.json"
output_report_file: Path = workdir / "report.md"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make file path defaults derive from the actual workdir

input_scenarios_file and output_report_file are computed from the class-level default workdir, so if a caller provides a custom workdir, these won’t update accordingly. Mirror the PartialCLIInput approach so they depend on the instance’s workdir.

Apply this diff to make them optional and set them post-init:

-    input_scenarios_file: Path = workdir / "scenarios.json"
-    output_report_file: Path = workdir / "report.md"
+    input_scenarios_file: Path | None = None
+    output_report_file: Path | None = None

Then add this method (outside the selected range) to CLIInput:

def model_post_init(self, __context):
    if self.input_scenarios_file is None:
        self.input_scenarios_file = self.workdir / "scenarios.json"
    if self.output_report_file is None:
        self.output_report_file = self.workdir / "report.md"

Also applies to: 65-74

🤖 Prompt for AI Agents
In rogue/models/cli_input.py around lines 18-19 (and similarly for the fields
defined at lines 65-74), the default paths are computed from the class-level
workdir so they won't reflect a caller-provided workdir; make
input_scenarios_file and output_report_file Optional[Path] with a default of
None, remove the class-level Path joins, and implement a post-init method on
CLIInput named model_post_init(self, __context) that sets each field to
self.workdir / "scenarios.json" and self.workdir / "report.md" respectively only
if the field is None; apply the same pattern to the other occurrence at lines
65-74.

from pydantic import BaseModel, Field

from .chat_history import ChatHistory
from rogue_sdk.types import ChatHistory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Switch to centralized ChatHistory import looks correct; verify packaging to avoid runtime ImportError

Good move consolidating types to rogue_sdk.types. Please confirm the rogue package declares a runtime dependency on the SDK (e.g., rogue-sdk) so that from rogue_sdk.types import ChatHistory works for end-users who install only the server/package (outside CI where both are installed in editable mode).

Run this script to verify that the root package declares a dependency on the SDK:

If rogue-sdk (or the actual package name for the SDK) isn’t listed in the root pyproject.toml under [project.dependencies], users doing pip install -e . will hit an ImportError here.


🏁 Script executed:

#!/bin/bash
# Check for 'rogue-sdk' (or similar SDK package name) in project dependencies
echo "Searching for SDK dependency in pyproject files..."
rg -n -C2 -g 'pyproject.toml' -P '(?i)^\s*dependencies\s*=\s*\[|^\s*optional-dependencies|rogue[-_ ]sdk'

# Check that the SDK importable module exists in the repo (sanity)
echo "Verifying SDK module presence..."
fd -H 'types.py' | rg -n 'sdks/python/.*/rogue_sdk/types\.py' -n

Length of output: 1138


Ensure the SDK is declared as a runtime dependency
The root pyproject.toml currently lists

[project]
dependencies = [
  "a2a-sdk==0.2.10",
  "backoff>=2.2.1",
    
]

but does not include "rogue-sdk". Since rogue/models/prompt_injection.py does

from rogue_sdk.types import ChatHistory

users installing your package (outside of editable mode) will get an ImportError.

• In pyproject.toml under [project.dependencies], add the SDK, e.g.:

[project]
dependencies = [
+  "rogue-sdk>=1.0.0",
   "a2a-sdk==0.2.10",
   "backoff>=2.2.1",
   …
]

• Verify the version constraint matches the SDK’s release (see sdks/python/pyproject.toml).
• Run pip install . in a clean venv to confirm from rogue_sdk.types import ChatHistory succeeds at runtime.

🤖 Prompt for AI Agents
In rogue/models/prompt_injection.py around line 2 the module imports ChatHistory
from rogue_sdk but the root pyproject.toml does not list "rogue-sdk" as a
runtime dependency; add an appropriate rogue-sdk entry under
[project.dependencies] in the root pyproject.toml (use the version constraint
consistent with sdks/python/pyproject.toml), save, then verify by creating a
clean virtualenv and running pip install . and importing from rogue_sdk.types to
confirm the ImportError is resolved.

Comment on lines +6 to +13
Usage:
python -m rogue.server

Environment Variables:
HOST: Server host (default: 127.0.0.1)
PORT: Server port (default: 8000)
RELOAD: Enable auto-reload for development (default: False)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Docstring usage may mislead

This file can’t be executed via python -m rogue.server; that targets rogue/server/__main__.py. Either ensure that main delegates here (preferred) or revise usage text to python -m rogue.server.main.

Quick check:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "=== Does rogue/server/__main__.py exist? ==="
fd -a '__main__.py' | rg -n 'rogue/server/__main__\.py' || true

Length of output: 181


Fix module entrypoint in docstring or add __main__.py
The current usage text in rogue/server/main.py claims you can run the server via:

python -m rogue.server

—but there is no rogue/server/__main__.py, so that command will error. To resolve this, choose one of the following:

• Add an entrypoint under rogue/server/__main__.py that imports and invokes main.py, for example:

# rogue/server/__main__.py
from .main import main

if __name__ == "__main__":
    main()

• Or, update the usage section in rogue/server/main.py (lines 6–13) to reference the actual module path:

-    python -m rogue.server
+    python -m rogue.server.main
🤖 Prompt for AI Agents
In rogue/server/main.py around lines 6 to 13, the docstring claims you can run
the server with "python -m rogue.server" but there is no
rogue/server/__main__.py entrypoint; either add a simple __main__.py that
imports and calls main() from main.py (so python -m rogue.server works), or
change the usage text to a correct command (for example "python -m
rogue.server.main" or "python rogue/server/main.py") to reflect the actual
entrypoint; implement one of these two fixes and update the docstring
accordingly.

Comment on lines 81 to 86
cli_input = CLIInputWithMockScenarios(
evaluated_agent_url=HttpUrl("https://example.com"),
judge_llm_model="example-model",
judge_llm="example-model",
business_context="example-context",
input_scenarios_file=input_scenarios_file_mock,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t instantiate HttpUrl directly; pass a string and let Pydantic parse.

HttpUrl("https://example.com") is invalid for Pydantic v2 and will raise. Provide a plain string; the model will validate/coerce it.

-        cli_input = CLIInputWithMockScenarios(
-            evaluated_agent_url=HttpUrl("https://example.com"),
+        cli_input = CLIInputWithMockScenarios(
+            evaluated_agent_url="https://example.com",
             judge_llm="example-model",
             business_context="example-context",
             input_scenarios_file=input_scenarios_file_mock,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cli_input = CLIInputWithMockScenarios(
evaluated_agent_url=HttpUrl("https://example.com"),
judge_llm_model="example-model",
judge_llm="example-model",
business_context="example-context",
input_scenarios_file=input_scenarios_file_mock,
)
cli_input = CLIInputWithMockScenarios(
evaluated_agent_url="https://example.com",
judge_llm="example-model",
business_context="example-context",
input_scenarios_file=input_scenarios_file_mock,
)
🤖 Prompt for AI Agents
In rogue/tests/models/test_cli_input.py around lines 81 to 86, the test
instantiates HttpUrl directly which is incompatible with Pydantic v2; instead
pass a plain string (e.g., "https://example.com") for evaluated_agent_url so
Pydantic can validate/parse it when constructing CLIInputWithMockScenarios.

Comment on lines +70 to +71
# Run async function in sync context
bot_message = asyncio.run(handle_interview_message())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid asyncio.run() inside a running event loop; make the handler async.

Gradio supports async handlers. Using asyncio.run here is prone to "asyncio.run() cannot be called from a running event loop".

-            # Run async function in sync context
-            bot_message = asyncio.run(handle_interview_message())
+            # Run async in Gradio's event loop
+            bot_message = await handle_interview_message()

Outside this hunk, change the callback signature to async so Gradio can await it:

# Change this:
def respond(message, history, state):

# To this:
async def respond(message, history, state):
🤖 Prompt for AI Agents
In rogue/ui/components/interviewer.py around lines 70-71, the handler uses
asyncio.run(handle_interview_message()) which will fail if an event loop is
already running; change the callback to an async function (e.g., async def
respond(...)) and remove asyncio.run, instead await handle_interview_message()
directly; ensure any call sites and Gradio registration use the async handler
and that handle_interview_message and any downstream functions are awaitable
(add awaits or make them async as needed).

Comment on lines +49 to +55
# All types
from .types import *
from .websocket import RogueWebSocketClient

# Version
__version__ = "1.0.0"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid wildcard import; export symbols explicitly and keep all in sync

Explicit imports make the public API clear and prevent namespace pollution. Also, all is missing Scenarios, EvaluationResults, and WebSocketEventType.

-# All types
-from .types import *
-from .websocket import RogueWebSocketClient
+# All types - explicit exports
+from .types import (
+    RogueClientConfig,
+    EvaluationRequest,
+    EvaluationResponse,
+    EvaluationJob,
+    EvaluationResult,
+    EvaluationResults,
+    AgentConfig,
+    Scenario,
+    Scenarios,
+    AuthType,
+    ScenarioType,
+    EvaluationStatus,
+    HealthResponse,
+    JobListResponse,
+    WebSocketMessage,
+    WebSocketEventType,
+)
+from .websocket import RogueWebSocketClient
@@
 __all__ = [
     "RogueSDK",
     "RogueHttpClient",
     "RogueWebSocketClient",
     "RogueClientConfig",
     "EvaluationRequest",
     "EvaluationResponse",
     "EvaluationJob",
     "EvaluationResult",
+    "EvaluationResults",
     "AgentConfig",
     "Scenario",
+    "Scenarios",
     "AuthType",
     "ScenarioType",
     "EvaluationStatus",
     "HealthResponse",
     "JobListResponse",
     "WebSocketMessage",
+    "WebSocketEventType",
 ]

Also applies to: 56-74

🤖 Prompt for AI Agents
In sdks/python/rogue_sdk/__init__.py around lines 49 to 55 (and similarly apply
to lines 56 to 74), remove the wildcard import from .types and explicitly import
the public symbols you intend to export (e.g., import Scenarios,
EvaluationResults, WebSocketEventType, and any other public types) and keep an
explicit __all__ list that includes those symbols plus RogueWebSocketClient;
update the top-level exports to only the named imports so the package API is
explicit and __all__ stays in sync with the exported names.

Comment on lines +80 to +83
response = await self._client.request(method, endpoint, **kwargs)
response.raise_for_status()
return response.json()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle 204/empty and non-JSON responses to avoid runtime errors.

response.json() will raise for 204 No Content or non-JSON content types (e.g., some DELETE endpoints). This can break cancel/end flows.

-            response = await self._client.request(method, endpoint, **kwargs)
-            response.raise_for_status()
-            return response.json()
+            response = await self._client.request(method, endpoint, **kwargs)
+            response.raise_for_status()
+
+            # Gracefully handle no-content and non-JSON responses
+            content_type = (response.headers.get("content-type") or "").lower()
+            if response.status_code == 204 or not response.content:
+                return {}
+            if "application/json" in content_type:
+                return response.json()
+            # Fallback: return empty dict to keep downstream parsing safe
+            return {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = await self._client.request(method, endpoint, **kwargs)
response.raise_for_status()
return response.json()
response = await self._client.request(method, endpoint, **kwargs)
response.raise_for_status()
# Gracefully handle no-content and non-JSON responses
content_type = (response.headers.get("content-type") or "").lower()
if response.status_code == 204 or not response.content:
return {}
if "application/json" in content_type:
return response.json()
# Fallback: return empty dict to keep downstream parsing safe
return {}
🤖 Prompt for AI Agents
In sdks/python/rogue_sdk/client.py around lines 80 to 83, response.json() can
raise for 204 No Content or non-JSON responses; change the return logic to
safely handle those cases by: after response.raise_for_status(), check if
response.status_code == 204 or not response.content and return None (or an empty
value appropriate for callers), otherwise inspect
response.headers.get("Content-Type","") for "application/json" and attempt to
parse JSON inside a try/except (catch JSON decoding errors and return None or
response.text), so non-JSON or empty responses don’t raise runtime errors.

Copy link
Collaborator

@yuval-qf yuval-qf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTD

@yuval-qf yuval-qf enabled auto-merge (squash) August 17, 2025 14:48
@yuval-qf yuval-qf disabled auto-merge August 17, 2025 14:48
@yuval-qf yuval-qf enabled auto-merge (squash) August 17, 2025 14:52
@yuval-qf yuval-qf merged commit 980dee1 into main Aug 17, 2025
8 of 9 checks passed
@yuval-qf yuval-qf deleted the feature/tui branch August 17, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants