FIRE-751 - report to BFF by drorIvry · Pull Request #37 · qualifire-dev/rogue

drorIvry · 2025-09-04T12:38:29Z

This reverts commit bb3fb1c.

Summary by CodeRabbit

New Features
- Standardized API format for evaluation results and a new report endpoint to send structured summaries to Qualifire.
- Clients (SDKs, CLI, UIs, TUI) now return and persist a StructuredSummary alongside legacy text.
Improvements
- Summary generation produces structured JSON (overall, key findings, recommendations, detailed breakdown).
- Added deep_test, judge_model, job_id and robust timestamp handling; Qualifire reporting flows integrated.
Tests
- Added tests validating conversion to the new API format.

This reverts commit bb3fb1c.

linear · 2025-09-04T12:38:32Z

FIRE-751 Feature | rogue Client side reporting the experiments to the server

if the env var / toml is set populate it in the input field

coderabbitai · 2025-09-04T12:38:36Z

Walkthrough

Adds Pydantic API models and conversion utilities for ApiEvaluationResult; shifts LLM summaries to a StructuredSummary JSON shape with parsing and fallbacks; introduces Qualifire reporting and a /report_summary endpoint; updates SDKs, CLI, UI, TUI, and tests to produce, consume, and display structured summaries.

Changes

Cohort / File(s)	Summary
Server API models & responses `rogue/server/models/api_format.py`, `rogue/server/models/__init__.py`, `rogue/server/api/llm.py`	New Api* Pydantic models and `ServerSummaryGenerationResponse`; re-exports server models; `/summary` now returns structured summary type and depends on EvaluationService; added `/report_summary` endpoint and Qualifire reporting flow.
API-format conversion service & exports `rogue/server/services/api_format_service.py`, `rogue/server/services/__init__.py`	New conversion functions (`convert_to_api_format`, `convert_with_structured_summary`) mapping `EvaluationResults` → `ApiEvaluationResult` with timezone-aware timestamp parsing and StructuredSummary → summary/keyFindings/recommendation mapping; exported via services package.
Qualifire reporting service `rogue/server/services/qualifire_service.py`	New `QualifireService.report_summary` builds API-format payload and POSTs to Qualifire with X-qualifire-key header, 300s timeout, error logging and exception on non-OK responses.
LLM summary generation `rogue/server/services/llm_service.py`	Prompts now request structured JSON; `generate_summary_from_results` strips code fences, parses JSON into `StructuredSummary`, handles JSONDecodeError with deterministic fallback StructuredSummary, and returns StructuredSummary.
Types & conversion in SDK `sdks/python/rogue_sdk/types.py`, `rogue/evaluator_agent/evaluator_agent.py`	Added `StructuredSummary` and Api* models; added `convert_to_api_format`; `ApiChatMessage.timestamp` uses datetime; scenario shape includes `expected_outcome`; evaluator logging accepts optional `scenario_type`.
Python SDK client & public SDK `sdks/python/rogue_sdk/client.py`, `sdks/python/rogue_sdk/sdk.py`	`generate_summary` gains parameters (`qualifire_api_key`, `job_id`, `deep_test`, `judge_model`) and now returns `(markdown_str, StructuredSummary)`; client populates new request fields and constructs StructuredSummary from responses.
UI and CLI integration `rogue/ui/components/report_generator.py`, `rogue/ui/components/scenario_runner.py`, `rogue/run_cli.py`	UI calls `convert_with_structured_summary` and serializes API-format JSON for display; `scenario_runner` and CLI forward new summary parameters, handle multi-value SDK return, and persist `structured_summary` in state.
Server evaluation wiring `rogue/server/api/evaluation.py`, `rogue/server/api/llm.py`	`deep_test` and `judge_model` stored on new EvaluationJob fields at creation; server will report to Qualifire when request fields supplied.
TUI (Go) updates `packages/tui/internal/tui/*.go`, `packages/tui/go.mod`	Adds StructuredSummary/SummaryResp types, adapts GenerateSummary/ReportSummary signatures, formats Markdown from structured summary, adds detailedBreakdown table rows, clipboard paste handling, layout adjustments, and adds bubbles table dependency.
TUI parsing & scenarios `packages/tui/internal/tui/eval_ui.go`, `packages/tui/internal/tui/evaluation.go`	`EvaluationViewState.Scenarios` → `[]EvalScenario`; scenario loader parses objects with `scenario_type` and `expected_outcome`; `StartEvaluation`/`GenerateSummary` signatures updated to accept structured scenarios and qualifire/deep_test params.
Tests `rogue/tests/models/test_evaluation_result.py`, `sdks/python/rogue_sdk/tests/*`, `rogue/tests/models/test_cli_input.py`	Adds `test_convert_to_api_format` verifying conversion and message timestamps as datetime; other tests primarily reformatting.
Minor formatting & hygiene `examples//package.json`, `rogue/common/`, `rogue/server/services/interviewer_service.py`, `rogue/server/services/scenario_evaluation_service.py`, `sdks/python/rogue_sdk/websocket.py`, ...	Predominantly trailing-newline/trailing-comma and argument reformatting; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant UI as Frontend UI
  participant SDK as Python SDK
  participant Server as Rogue Server
  participant Conv as API Format Service
  participant Qual as Qualifire

  User->>UI: Request report / select tab
  UI->>SDK: generate_summary(results, qualifire_api_key?, job_id?, deep_test?, judge_model?)
  SDK-->>Server: POST /api/v1/llm/summary {job_id, deep_test, qualifire_api_key, judge_model}
  Server->>Server: LLM produces StructuredSummary JSON (parse/validate)
  Server-->>SDK: (markdown, StructuredSummary)
  SDK-->>UI: (markdown, StructuredSummary)
  UI->>Conv: convert_with_structured_summary(results, StructuredSummary, deep_test, start_time, judge_model)
  Conv-->>UI: ApiEvaluationResult
  UI->>UI: render ApiEvaluationResult.model_dump_json(indent=2, exclude_none=True)

  alt report to Qualifire requested
    Server->>Qual: POST {qualifire_url}/api/rogue/v1/report (X-qualifire-key, ApiEvaluationResult JSON)
    Qual-->>Server: response or error
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "FIRE-751 - report to BFF" is concise and directly related to the PR’s main purpose—adding reporting/integration to the backend (BFF/Qualifire)—so it accurately reflects the primary change being introduced.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Poem

I hopped through tests and JSON light,
I stitched timestamps tidy and bright,
Scenarios lined up in order tight,
Carrots of fallback held on tight,
A rabbit cheers — conversions done right. 🥕

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e5c5a0 and 7693f96.

📒 Files selected for processing (2)

.github/workflows/rogue.yml (1 hunks)
packages/tui/go.mod (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

packages/tui/go.mod

🧰 Additional context used

📓 Path-based instructions (1)

.github/**

📄 CodeRabbit inference engine (AGENTS.md)

Keep CI/CD workflow files under .github/

Files:

.github/workflows/rogue.yml

🔇 Additional comments (1)

.github/workflows/rogue.yml (1)

58-61: Verify composite action input name and consider event-based model selection

Verification failed — the supplied script produced no output; cannot confirm whether the composite action input is judge_llm or judge_model.

Confirm composite action inputs: show .github/actions/rogue/action.{yml,yaml} and verify it defines input "judge_llm".

Find all usages: run rg -n --hidden -e 'judge_(llm|model)' . || grep -R -n 'judge_' . and paste results.

If "judge_llm" is correct, prefer event-based model selection (PRs → openai/gpt-4.1-mini; main → openai/gpt-4.1) or increase job timeout-minutes if keeping openai/gpt-4.1 for all events.

Return the action file contents and search output for re-verification.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/FIRE-751-rogue-report-to-bff

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

sdks/python/rogue_sdk/types.py (1)
88-89: Nitpick: Line too long.

Line 88 exceeds the typical 88-character limit for Python code. Consider breaking it into multiple lines.
-                "Authentication Credentials cannot be empty for the selected auth type.",  # noqa: E501
+                "Authentication Credentials cannot be empty for the selected auth type.",
rogue/ui/components/report_generator.py (1)
159-165: Consider using model_dump_json for fallback as well.

The fallback uses str(results) which may not provide a consistent JSON format.
         except Exception as e:
             logger.warning(
                 f"Failed to convert results to API format: {e}",
                 extra={
                     "results": results,
                 },
             )
-            results_json = str(results)
+            results_json = results.model_dump_json(indent=2)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between bb3fb1c and 3574a98.

📒 Files selected for processing (3)

rogue/tests/models/test_evaluation_result.py (8 hunks)
rogue/ui/components/report_generator.py (2 hunks)
sdks/python/rogue_sdk/types.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/ui/components/report_generator.py
rogue/tests/models/test_evaluation_result.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/ui/components/report_generator.py
rogue/tests/models/test_evaluation_result.py

🧬 Code graph analysis (1)

rogue/tests/models/test_evaluation_result.py (1)

rogue/ui/components/report_generator.py (2)

convert_to_api_format (44-94)

ApiEvaluationResult (38-41)

🔇 Additional comments (3)

rogue/ui/components/report_generator.py (1)

151-166: Good error handling for API conversion.

The try-except block properly handles conversion failures with logging and fallback behavior.

rogue/tests/models/test_evaluation_result.py (2)

168-191: Well-structured API conversion test.

The test thoroughly validates the conversion to the new API format, including all nested structures and data types. Good coverage of the conversion logic.

62-63: Formatting change: trailing commas.

The addition of trailing commas is a good practice for better version control diffs.

Also applies to: 68-69, 75-76, 81-82, 94-95, 100-101, 113-114, 119-120, 132-133, 138-139, 154-155

rogue/tests/models/test_evaluation_result.py

rogue/ui/components/report_generator.py

sdks/python/rogue_sdk/types.py

coderabbitai · 2025-09-04T12:50:19Z

sdks/python/rogue_sdk/types.py

+# Conversion functions for new API format
+def convert_to_api_format(evaluation_results: EvaluationResults) -> ApiEvaluationResult:
+    """Convert legacy EvaluationResults to new API format.
+
+    Args:
+        evaluation_results: Legacy evaluation results to convert
+
+    Returns:
+        ApiEvaluationResult: New format evaluation result
+    """
+    api_scenarios = []
+
+    for result in evaluation_results.results:
+        # Convert conversations to new format
+        api_conversations = []
+        for conv_eval in result.conversations:
+            # Convert ChatHistory messages to ApiChatMessage
+            api_messages = []
+            for msg in conv_eval.messages.messages:
+                timestamp = datetime.now(timezone.utc)
+                if msg.timestamp:
+                    try:
+                        if isinstance(msg.timestamp, str):
+                            timestamp = datetime.fromisoformat(
+                                msg.timestamp.replace("Z", "+00:00"),
+                            )
+                        else:
+                            timestamp = msg.timestamp
+                    except (ValueError, AttributeError):
+                        timestamp = datetime.now(timezone.utc)
+
+                api_messages.append(
+                    ApiChatMessage(
+                        role=msg.role,
+                        content=msg.content,
+                        timestamp=timestamp,
+                    ),
+                )
+
+            api_conversations.append(
+                ApiConversationEvaluation(
+                    passed=conv_eval.passed,
+                    messages=api_messages,
+                    reason=conv_eval.reason if conv_eval.reason else None,
+                ),
+            )
+
+        api_scenarios.append(
+            ApiScenarioResult(
+                description=result.scenario.scenario,
+                totalConversations=len(api_conversations),
+                flaggedConversations=len(
+                    [c for c in api_conversations if not c.passed],
+                ),
+                conversations=api_conversations,
+            ),
+        )
+
+    return ApiEvaluationResult(scenarios=api_scenarios)
+


⚠️ Potential issue

Code duplication detected: convert_to_api_format function.

The convert_to_api_format function is duplicated between sdks/python/rogue_sdk/types.py and rogue/ui/components/report_generator.py. This creates a significant maintenance burden.

The function should exist in only one location. Since the SDK is the source of truth for types, keep it in sdks/python/rogue_sdk/types.py and import it in the UI component.

🤖 Prompt for AI Agents

In sdks/python/rogue_sdk/types.py around lines 244-303 the convert_to_api_format implementation is duplicated elsewhere; keep the canonical implementation here and remove the duplicate in rogue/ui/components/report_generator.py, then make the function importable from the SDK (ensure it is top-level in types.py and exported via the package __init__ if needed), update report_generator.py to import convert_to_api_format from sdks.python.rogue_sdk.types (or from the package export) and remove the local copy, and run tests/flake8 to ensure imports and references are correct.

sdks/python/rogue_sdk/types.py

yuval-qf · 2025-09-04T15:51:00Z

sdks/python/rogue_sdk/types.py

+
+
+# Conversion functions for new API format
+def convert_to_api_format(evaluation_results: EvaluationResults) -> ApiEvaluationResult:


Design suggestion:
For each class API...(BaseModel) create a classmethod to convert from the regular type. For example:

class ApiChatMessage(BaseModel): """Chat message for new API format with datetime timestamp.""" role: str content: str timestamp: datetime @classmethod def from_chat_message(cls, message: ChatMessage) -> Self: timestamp = datetime.now(timezone.utc) if msg.timestamp: try: if isinstance(msg.timestamp, str): timestamp = datetime.fromisoformat( msg.timestamp.replace("Z", "+00:00"), ) else: timestamp = msg.timestamp except (ValueError, AttributeError): timestamp = datetime.now(timezone.utc) return cls( role=msg.role, content=msg.content, timestamp=timestamp, )

The same for the rest

rogue/ui/components/report_generator.py

yuval-qf

Left a question and a design suggestion

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

rogue/ui/components/scenario_runner.py (1)

484-501: Always close SDK in finally to avoid connection leak.

If generate_summary raises, sdk.close() is skipped.

Apply:

-        # Generate summary using SDK (server-based)
-        summary = "Summary generation failed."
-        try:
+        # Generate summary using SDK (server-based)
+        summary = "Summary generation failed."
+        sdk: RogueSDK | None = None
+        try:
             sdk_config = RogueClientConfig(
                 base_url=state.get("rogue_server_url", "http://localhost:8000"),
                 timeout=600.0,
             )
-            sdk = RogueSDK(sdk_config)
+            sdk = RogueSDK(sdk_config)
 
             summary = await sdk.generate_summary(
                 results=all_results,
                 model=config.get("service_llm"),
                 api_key=config.get("judge_llm_api_key"),
             )
 
-            await sdk.close()
         except Exception:
             logger.exception("Summary generation failed")
+        finally:
+            if sdk:
+                await sdk.close()

sdks/python/rogue_sdk/websocket.py (1)

156-169: Avoid CancelledError in done-callback and duplicate exception lookups.

Task.exception() raises on cancelled tasks; the lambda also calls it twice.

Apply:

-                if asyncio.iscoroutinefunction(handler):
-                    task = asyncio.create_task(handler(event, data))
-                    task.add_done_callback(
-                        lambda t: (
-                            logger.exception(
-                                "WS handler error",
-                                exc_info=t.exception(),
-                            )
-                            if t.exception()
-                            else None
-                        ),
-                    )
+                if asyncio.iscoroutinefunction(handler):
+                    task = asyncio.create_task(handler(event, data))
+                    def _on_done(t: asyncio.Task) -> None:
+                        try:
+                            exc = t.exception()
+                        except asyncio.CancelledError:
+                            return
+                        if exc:
+                            logger.exception("WS handler error", exc_info=exc)
+                    task.add_done_callback(_on_done)

rogue/server/api/llm.py (1)

23-24: Add an explicit return type to satisfy typing guidelines.

The function lacks a return annotation under rogue/**/*.py rules.

-@router.post("/scenarios", response_model=ScenarioGenerationResponse)
-async def generate_scenarios(request: ScenarioGenerationRequest):
+@router.post("/scenarios", response_model=ScenarioGenerationResponse)
+async def generate_scenarios(
+    request: ScenarioGenerationRequest,
+) -> ScenarioGenerationResponse:

♻️ Duplicate comments (2)

sdks/python/rogue_sdk/types.py (1)
210-242: Consolidate API models to a single source of truth.

Api* models are duplicated here and in rogue/server/models/api_format.py. Keep one canonical set (SDK or server) and import from it to avoid drift.
#!/bin/bash
rg -nP '(class\s+Api(ChatMessage|ConversationEvaluation|ScenarioResult|EvaluationResult)\b)' \
  sdks/python/rogue_sdk/types.py rogue/server/models/api_format.py rogue/ui/components -S
rogue/tests/models/test_evaluation_result.py (1)
11-12: Import from the SDK types to avoid server/SDK duplication.

Use the canonical SDK exports to keep tests decoupled from server internals (matches prior feedback).
-from rogue.server.services.api_format_service import convert_to_api_format
-from rogue.server.models.api_format import ApiEvaluationResult, StructuredSummary
+from rogue_sdk.types import (
+    convert_to_api_format,
+    ApiEvaluationResult,
+    StructuredSummary,
+)

🧹 Nitpick comments (7)

rogue/server/services/__init__.py (1)

2-2: Export looks good. Consider aligning with isort order.

Optional: keep the import group alphabetized to satisfy isort consistently.

rogue/server/models/api_format.py (2)

41-57: PEP 8 naming vs API casing: prefer snake_case with aliases.

Fields like totalConversations, flaggedConversations, startTime, judgeModel break snake_case. Consider snake_case internals with Pydantic v2 aliases to keep API JSON camelCase.

Example:
-from pydantic import BaseModel
+from pydantic import BaseModel, Field, ConfigDict

 class ApiScenarioResult(BaseModel):
+    model_config = ConfigDict(populate_by_name=True)
-    totalConversations: Optional[int] = None
-    flaggedConversations: Optional[int] = None
+    total_conversations: Optional[int] = Field(
+        default=None, serialization_alias="totalConversations", validation_alias="totalConversations"
+    )
+    flagged_conversations: Optional[int] = Field(
+        default=None, serialization_alias="flaggedConversations", validation_alias="flaggedConversations"
+    )
Same approach for ApiEvaluationResult.startTime -> start_time, judgeModel -> judge_model, and keyFindings/recommendation if needed.

16-20: Type stronger structure for detailed_breakdown.

Replace List[dict] with a dedicated model (e.g., BreakdownRow) to get validation and IDE help.

sdks/python/rogue_sdk/types.py (1)

263-274: Log parse failures before falling back to now().

Silent fallback hides data issues; add a warning with the offending value and error.

Apply:

-                    except (ValueError, AttributeError):
-                        timestamp = datetime.now(timezone.utc)
+                    except (ValueError, AttributeError) as parse_err:
+                        logger.warning(
+                            f"Failed to parse timestamp: {msg.timestamp!r}; "
+                            f"falling back to now. error={parse_err}"
+                        )
+                        timestamp = datetime.now(timezone.utc)

sdks/python/rogue_sdk/sdk.py (1)

299-324: Include “Detailed Breakdown” in markdown output (optional).

Add a section rendering the per-scenario breakdown to improve UI completeness.

         structured_summary = response_data.summary
         if hasattr(structured_summary, "overall_summary"):
             # Format as markdown string for UI display
             summary_parts = [
                 f"# Evaluation Results Summary\n\n## Overall Summary\n"
                 f"{structured_summary.overall_summary}",
             ]

             if structured_summary.key_findings:
                 findings = "\n".join(
                     f"- {finding}" for finding in structured_summary.key_findings
                 )
                 summary_parts.append(f"\n---\n\n## Key Findings\n{findings}")

             if structured_summary.recommendations:
                 recommendations = "\n".join(
                     f"{i + 1}. {rec}"
                     for i, rec in enumerate(structured_summary.recommendations)
                 )
                 summary_parts.append(f"\n---\n\n## Recommendations\n{recommendations}")
+
+            if getattr(structured_summary, "detailed_breakdown", None):
+                rows = [
+                    "| Scenario | Status | Outcome |",
+                    "|---|---|---|",
+                ]
+                for item in structured_summary.detailed_breakdown:
+                    scenario = item.get("scenario", "")
+                    status = item.get("status", "")
+                    outcome = item.get("outcome", "")
+                    rows.append(f"| {scenario} | {status} | {outcome} |")
+                table = "\n".join(rows)
+                summary_parts.append(f"\n---\n\n## Detailed Breakdown\n{table}")
 
             return "\n".join(summary_parts)
         else:
             # Fallback for string response
             return str(structured_summary)

rogue/tests/models/test_evaluation_result.py (1)

213-217: Remove duplicated assertions.

These repeat checks already made on Lines 197–201.

-        # Test new fields
-        assert api_format.summary == "Test summary for overall evaluation"
-        assert api_format.keyFindings == "• Key finding 1\n• Key finding 2"
-        assert api_format.recommendation == "• Recommendation 1\n• Recommendation 2"
-        assert api_format.deepTest is True
-        assert api_format.judgeModel == "openai/gpt-4o-mini"
-        assert isinstance(api_format.startTime, datetime)
+        assert isinstance(api_format.startTime, datetime)

rogue/server/services/llm_service.py (1)

253-263: Deduplicate code-fence stripping.

Factor into a tiny helper to avoid repeating logic and to keep parsing clear.

+def _strip_json_fences(content: str) -> str:
+    content = content.strip()
+    if content.startswith("```json"):
+        content = content[7:]
+    if content.endswith("```"):
+        content = content[:-3]
+    return content.strip()
+
             # Parse the JSON response from the LLM
-            content = response.choices[0].message.content.strip()
-
-            # Remove markdown code blocks if present
-            if content.startswith("```json"):
-                content = content[7:]
-            if content.endswith("```"):
-                content = content[:-3]
-            content = content.strip()
+            content = _strip_json_fences(response.choices[0].message.content)
 
             # Parse JSON and create StructuredSummary
             summary_data = json.loads(content)
             return StructuredSummary(**summary_data)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 3574a98 and dc2f6e9.

📒 Files selected for processing (23)

examples/js/cli/package.json (1 hunks)
examples/js/langgraph-js-example/package.json (1 hunks)
examples/js/vercel-ai-example/package.json (1 hunks)
examples/tshirt_store_agent/tshirt_store_agent_executor.py (4 hunks)
packages/tui/internal/theme/themes/vesper.json (1 hunks)
rogue/common/generic_agent_executor.py (4 hunks)
rogue/common/remote_agent_connection.py (2 hunks)
rogue/server/api/llm.py (3 hunks)
rogue/server/models/api_format.py (1 hunks)
rogue/server/services/__init__.py (1 hunks)
rogue/server/services/interviewer_service.py (3 hunks)
rogue/server/services/llm_service.py (6 hunks)
rogue/server/services/scenario_evaluation_service.py (2 hunks)
rogue/tests/models/test_cli_input.py (1 hunks)
rogue/tests/models/test_evaluation_result.py (8 hunks)
rogue/ui/components/config_screen.py (4 hunks)
rogue/ui/components/report_generator.py (2 hunks)
rogue/ui/components/scenario_runner.py (1 hunks)
sdks/python/rogue_sdk/client.py (3 hunks)
sdks/python/rogue_sdk/sdk.py (5 hunks)
sdks/python/rogue_sdk/tests/test_types.py (2 hunks)
sdks/python/rogue_sdk/types.py (5 hunks)
sdks/python/rogue_sdk/websocket.py (2 hunks)

✅ Files skipped from review due to trivial changes (11)

rogue/tests/models/test_cli_input.py
examples/js/langgraph-js-example/package.json
examples/js/vercel-ai-example/package.json
packages/tui/internal/theme/themes/vesper.json
sdks/python/rogue_sdk/tests/test_types.py
rogue/ui/components/config_screen.py
examples/js/cli/package.json
rogue/server/services/interviewer_service.py
rogue/server/services/scenario_evaluation_service.py
examples/tshirt_store_agent/tshirt_store_agent_executor.py
rogue/common/remote_agent_connection.py

🚧 Files skipped from review as they are similar to previous changes (1)

rogue/ui/components/report_generator.py

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/server/services/__init__.py
rogue/server/models/api_format.py
rogue/ui/components/scenario_runner.py
rogue/server/api/llm.py
rogue/tests/models/test_evaluation_result.py
rogue/server/services/llm_service.py
rogue/common/generic_agent_executor.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/server/services/__init__.py
rogue/server/models/api_format.py
rogue/ui/components/scenario_runner.py
rogue/server/api/llm.py
rogue/tests/models/test_evaluation_result.py
rogue/server/services/llm_service.py
rogue/common/generic_agent_executor.py

🔇 Additional comments (6)

rogue/ui/components/scenario_runner.py (1)
486-491: Sanity-check RogueClientConfig.base_url handling.

The SDK’s field validator may attempt to construct HttpUrl(v) from str, which can be invalid in Pydantic v2. Verify it accepts a plain string safely.

Run:
#!/bin/bash
rg -n -C3 'class RogueClientConfig' sdks/python/rogue_sdk/types.py
rg -n -C3 'def validate_base_url' sdks/python/rogue_sdk/types.py
rogue/common/generic_agent_executor.py (1)

151-156: LGTM — formatting-only changes.

No semantic impact detected.

Also applies to: 159-163, 186-190, 195-201

sdks/python/rogue_sdk/types.py (1)

448-449: Confirm all callers updated for structured summary.

SummaryGenerationResponse.summary now uses StructuredSummary. Ensure UI/report code doesn’t assume a plain str.

sdks/python/rogue_sdk/client.py (1)

167-186: Structured-summary handling looks correct and backward-compatible.

Graceful dict-check + legacy fallback is solid. No action needed.

rogue/server/api/llm.py (1)

61-65: API switch to ServerSummaryGenerationResponse looks good.

Return type aligns with server-side StructuredSummary.

rogue/server/services/llm_service.py (1)

264-283: Good fallbacks on parse/system errors.

Deterministic StructuredSummary on failure keeps API stable.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

rogue/server/models/__init__.py (1)
3-15: Consider re-exporting StructuredSummary for a single, stable import path.

Several callers (e.g., services) need StructuredSummary; exposing it here avoids importing from a submodule directly and aligns with “centralize public API surface”.
 from .api_format import (
     ApiChatMessage,
     ApiConversationEvaluation,
     ApiEvaluationResult,
     ApiScenarioResult,
+    StructuredSummary,
 )
 
 __all__ = [
     "ApiChatMessage",
     "ApiConversationEvaluation",
     "ApiEvaluationResult",
     "ApiScenarioResult",
+    "StructuredSummary",
 ]
rogue/server/services/api_format_service.py (3)
84-86: Avoid building a temporary list when counting flagged conversations.

Use a generator with sum for lower memory and slightly cleaner intent.
-                flaggedConversations=len(
-                    [c for c in api_conversations if not c.passed],
-                ),
+                flaggedConversations=sum(1 for c in api_conversations if not c.passed),
12-18: Optional: Import via the package’s public surface once StructuredSummary is re-exported.

Keeps imports consistent and respects the “single import path” intent.
-from ..models.api_format import (
+from ..models import (
     ApiChatMessage,
     ApiConversationEvaluation,
     ApiEvaluationResult,
     ApiScenarioResult,
     StructuredSummary,
 )
Please confirm that rogue/server/models/init.py re-exports StructuredSummary before applying.

91-104: Minor: Preserve list structure for key findings and recommendations if the API model supports list[str].

If ApiEvaluationResult.keyFindings/recommendation are lists, emitting newline-joined strings forces the UI to split. If the model expects strings, ignore this.

If lists are supported:
-        key_findings = "\n".join(
-            f"• {finding}" for finding in structured_summary.key_findings
-        )
-        recommendation = "\n".join(
-            f"• {rec}" for rec in structured_summary.recommendations
-        )
+        key_findings = list(structured_summary.key_findings)
+        recommendation = list(structured_summary.recommendations)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between dc2f6e9 and d682161.

📒 Files selected for processing (2)

rogue/server/models/__init__.py (1 hunks)
rogue/server/services/api_format_service.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/server/services/api_format_service.py
rogue/server/models/__init__.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/server/services/api_format_service.py
rogue/server/models/__init__.py

🔇 Additional comments (1)

rogue/server/models/__init__.py (1)

1-2: Good module surface + docstring.

Clean re-export hub; no runtime logic.

rogue/server/services/api_format_service.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

sdks/python/rogue_sdk/sdk.py (1)

287-335: Fix return type and robustly coerce summary into StructuredSummary

The second return value can be None or a non-model (e.g., dict/str). The current signature Tuple[str, StructuredSummary] is unsound and will fail mypy/consumers. Also, prefer model_validate/isinstance over hasattr.

Apply:

@@
-    ) -> Tuple[str, StructuredSummary]:
+    ) -> Tuple[str, Optional[StructuredSummary]]:
@@
-        # Convert structured summary back to string format for backward compatibility
-        structured_summary = response_data.summary
-        if hasattr(structured_summary, "overall_summary"):
+        # Try to coerce the server response into StructuredSummary
+        raw_summary = response_data.summary
+        structured_summary: Optional[StructuredSummary]
+        try:
+            structured_summary = (
+                raw_summary
+                if isinstance(raw_summary, StructuredSummary)
+                else StructuredSummary.model_validate(raw_summary)
+                if raw_summary is not None
+                else None
+            )
+        except Exception:
+            structured_summary = None
+
+        if structured_summary is not None:
@@
-            return "\n".join(summary_parts), structured_summary
-        else:
-            # Fallback for string response
-            return str(structured_summary), structured_summary
+            return "\n".join(summary_parts), structured_summary
+        else:
+            # Fallback for non-structured/None response
+            return ("" if raw_summary is None else str(raw_summary)), None

Run to find other call sites that still expect a single return (and update them to destructure):

#!/bin/bash
# Find generate_summary usages and show surrounding lines
rg -nC2 --type=py '\bgenerate_summary\s*\('
# Heuristic: calls not destructuring assignment on same line
rg -nP --type=py '(^|\s)(await\s+)?(\w+\s*=\s*)?sdk\.generate_summary\s*\('

rogue/run_cli.py (1)

206-210: Avoid flake8 F841: unused structured_summary

You’re not using the second return value. Assign to underscore.

Apply:

-        summary, structured_summary = await sdk.generate_summary(
+        summary, _ = await sdk.generate_summary(
             results=results,
             model=judge_llm,
             api_key=judge_llm_api_key,
         )

🧹 Nitpick comments (1)

rogue/run_cli.py (1)
214-216: Specify encoding when writing the report

Match the UTF-8 encoding you use for results to avoid platform-default discrepancies.

Apply:
-    output_report_file.write_text(summary)
+    output_report_file.write_text(summary, encoding="utf-8")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between d682161 and 1d7e28a.

📒 Files selected for processing (4)

rogue/run_cli.py (1 hunks)
rogue/ui/components/report_generator.py (2 hunks)
rogue/ui/components/scenario_runner.py (1 hunks)
sdks/python/rogue_sdk/sdk.py (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

rogue/ui/components/scenario_runner.py

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/run_cli.py
rogue/ui/components/report_generator.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/run_cli.py
rogue/ui/components/report_generator.py

🔇 Additional comments (5)

sdks/python/rogue_sdk/sdk.py (4)

118-120: LGTM: clearer error message formatting

Trailing comma is fine under Black; message text reads well.

196-198: LGTM: clearer exception path

Good use of trailing comma and specific message.

214-215: LGTM: explicit WebSocket error copy-through

Message conveys source error succinctly.

231-233: LGTM: TimeoutError message

Clear and actionable timeout text.
rogue/ui/components/report_generator.py (1)
8-9: Fix relative import depth: use two dots, not three.

from ...server... attempts to import beyond the top-level package from rogue/ui/components/. This will raise "attempted relative import beyond top-level package" at runtime.

Apply:
-from ...server.services.api_format_service import convert_with_structured_summary
+from ..server.services.api_format_service import convert_with_structured_summary
Likely an incorrect or invalid review comment.

rogue/ui/components/report_generator.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (6)

packages/tui/internal/tui/evaluation.go (3)
95-107: Name the nested types for reuse and clarity

Defining anonymous structs makes reuse/tests harder. Introduce named items for the summary block and breakdown entries.

Apply:
-type SummaryResp struct {
-	Summary struct {
-		OverallSummary    string   `json:"overall_summary"`
-		KeyFindings       []string `json:"key_findings"`
-		Recommendations   []string `json:"recommendations"`
-		DetailedBreakdown []struct {
-			Scenario string `json:"scenario"`
-			Status   string `json:"status"`
-			Outcome  string `json:"outcome"`
-		} `json:"detailed_breakdown"`
-	} `json:"summary"`
-	Message string `json:"message"`
-}
+type DetailedBreakdownItem struct {
+	Scenario string `json:"scenario"`
+	Status   string `json:"status"`
+	Outcome  string `json:"outcome"`
+}
+
+type SummaryBlock struct {
+	OverallSummary    string                 `json:"overall_summary"`
+	KeyFindings       []string               `json:"key_findings"`
+	Recommendations   []string               `json:"recommendations"`
+	DetailedBreakdown []DetailedBreakdownItem `json:"detailed_breakdown"`
+}
+
+type SummaryResp struct {
+	Summary SummaryBlock `json:"summary"`
+	Message string       `json:"message"`
+}
462-474: Guard empty results and omit api_key when empty

Avoid sending an empty results envelope and exclude api_key when not provided.

Apply:
-	if job.Results == nil {
-		return nil, fmt.Errorf("no results available for job %s", jobID)
-	}
+	if len(job.Results) == 0 {
+		return nil, fmt.Errorf("no results available for job %s", jobID)
+	}
@@
-	summaryReq := map[string]interface{}{
-		"model":   model,
-		"api_key": apiKey,
-		"results": map[string]interface{}{
-			"results": job.Results,
-		},
-	}
+	summaryReq := map[string]interface{}{
+		"model": model,
+		"results": map[string]interface{}{
+			"results": job.Results,
+		},
+	}
+	if apiKey != "" {
+		summaryReq["api_key"] = apiKey
+	}
484-485: Set Accept header for explicit JSON negotiation

Small hardening for content-type expectations.

Apply:
 req.Header.Set("Content-Type", "application/json")
+req.Header.Set("Accept", "application/json")
packages/tui/internal/tui/report_view.go (1)
25-25: Main container height likely truncates content

Using m.height-12 can clip background/layout. Prefer full height unless there’s a deliberate reserved region.

Apply:
-		Height(m.height - 12).
+		Height(m.height).
packages/tui/internal/tui/app.go (2)

160-174: New detailedBreakdown state is unused

The Model field detailedBreakdown []table.Row isn’t populated or rendered yet. Either wire it into ReportScreen (e.g., maintain a table.Model with headers/rows) or defer adding the field until the table view lands to avoid dead state.

Would you like a follow-up patch to populate rows from structuredSummary.Summary.DetailedBreakdown and render a table in renderReport?

269-269: Report viewport height reduced to 15 lines

This tightens visible report space; verify it plays well with typical summary lengths and doesn’t regress usability on smaller terminals. Consider making this responsive to terminal height (min/max) if needed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 1d7e28a and 6e0d9ee.

⛔ Files ignored due to path filters (1)

packages/tui/go.sum is excluded by !**/*.sum

📒 Files selected for processing (4)

packages/tui/go.mod (1 hunks)
packages/tui/internal/tui/app.go (4 hunks)
packages/tui/internal/tui/evaluation.go (4 hunks)
packages/tui/internal/tui/report_view.go (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

packages/tui/internal/tui/report_view.go (1)

packages/tui/internal/styles/styles.go (3)

NewStyle (45-47)

Border (23-23)

Style (40-42)

packages/tui/internal/tui/evaluation.go (2)

packages/tui/internal/tui/app.go (1)

Scenario (203-207)

packages/sdk/src/types.ts (1)

Scenario (41-47)

packages/tui/internal/tui/app.go (5)

packages/sdk/src/types.ts (1)

Scenario (41-47)

packages/tui/internal/components/command_line.go (1)

CommandInput (21-30)

packages/tui/internal/components/dialog.go (1)

Dialog (38-51)

packages/tui/internal/components/scenario_editor.go (1)

ScenarioEditor (33-71)

packages/tui/internal/components/viewport.go (1)

NewViewport (74-93)

🔇 Additional comments (5)

packages/tui/go.mod (1)

26-26: Confirm v2 imports and table API compatibility

All bubbles, bubbletea, and lipgloss imports reference /v2 (no lingering v1 paths). Please manually verify that the github.com/charmbracelet/bubbles/v2/table usage in packages/tui/internal/tui/app.go matches the v2.0.0-beta.1 API.

packages/tui/internal/tui/evaluation.go (1)

455-509: Signature change ripple check complete

All call sites of GenerateSummary now correctly handle the new *SummaryResp return:

app.go: uses structuredSummary.Summary.OverallSummary (line 92) before assigning to m.evalState.Summary

Error path in app.go still assigns a string to m.evalState.Summary

report_view.go and eval_view.go continue to treat m.evalState.Summary as a string for rendering

No remaining changes required.

packages/tui/internal/tui/report_view.go (2)

105-112: Request viewport and container setup: Provide the code where viewportHeight is computed and contentArea is initialized (the lines immediately before 105) so we can confirm the height passed to lipgloss.Place matches the container.

74-75: No changes required for BorderBackground usage

Style.BorderBackground(...) is available in lipgloss/v2 v2.0.0-beta.3, so the existing call is valid.

packages/tui/internal/tui/app.go (1)

11-13: I've added steps to locate the module file and verify that only /v2 imports are used across the codebase.

packages/tui/internal/tui/app.go

packages/tui/internal/tui/report_view.go

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

packages/tui/internal/tui/evaluation.go (2)

423-428: Strengthen URL validation (ParseRequestURI + scheme/host check)

url.Parse accepts many invalid inputs; enforce scheme/host.

- // Validate URLs
- if _, err := url.Parse(serverURL); err != nil {
-   return nil, nil, fmt.Errorf("invalid server url: %w", err)
- }
- if _, err := url.Parse(agentURL); err != nil {
-   return nil, nil, fmt.Errorf("invalid agent url: %w", err)
- }
+ // Validate URLs strictly (require scheme + host)
+ if u, err := url.ParseRequestURI(serverURL); err != nil || u.Scheme == "" || u.Host == "" {
+   return nil, nil, fmt.Errorf("invalid server url: %q", serverURL)
+ }
+ if u, err := url.ParseRequestURI(agentURL); err != nil || u.Scheme == "" || u.Host == "" {
+   return nil, nil, fmt.Errorf("invalid agent url: %q", agentURL)
+ }

468-479: Nil deref on qualifireAPIKey; hardcoded Qualifire URL

Dereferencing a nil *string will panic. Also defaulting to localhost is risky in prod builds.

-  summaryReq := map[string]interface{}{
-    "model":   model,
-    "api_key": apiKey,
-    "results": map[string]interface{}{
-      "results": job.Results,
-    },
-    "job_id":            jobID,
-    "qualifire_api_key": *qualifireAPIKey,
-    "qualifire_url":     "http://localhost:3000",
-  }
+  summaryReq := map[string]interface{}{
+    "model":   model,
+    "api_key": apiKey,
+    "results": map[string]interface{}{"results": job.Results},
+    "job_id":  jobID,
+    // Prefer production default; make configurable if needed.
+    "qualifire_url": "https://app.qualifire.ai",
+  }
+  if qualifireAPIKey != nil && *qualifireAPIKey != "" {
+    summaryReq["qualifire_api_key"] = *qualifireAPIKey
+  }

Optional: also treat empty results as no-op instead of error:

- if job.Results == nil {
+ if job.Results == nil || (len(job.Results) == 0) {
   return nil, fmt.Errorf("no results available for job %s", jobID)
 }

Also applies to: 507-514

sdks/python/rogue_sdk/types.py (1)

431-442: Unify Qualifire URL default across SDKs
Detected two hardcoded defaults in Python (sdks/python/rogue_sdk/types.py lines 441 & 510) set to "https://app.qualifire.ai" and a mismatched Go default ("http://localhost:3000" in packages/tui/internal/tui/evaluation.go line 477). Extract qualifire_url into a shared config or environment variable to prevent drift.

rogue/evaluator_agent/evaluator_agent.py (1)

361-369: Tool signature mismatch breaks FunctionTool invocation.

The LLM instructions call _log_evaluation with 4 params, but the function requires a 5th (scenario_type). This will raise at runtime. Make scenario_type optional with a default or remove it.

-    def _log_evaluation(
-        self,
-        scenario: dict[str, str],
-        context_id: str,
-        evaluation_passed: bool,
-        reason: str,
-        scenario_type: Optional[str],
-    ) -> None:
+    def _log_evaluation(
+        self,
+        scenario: dict[str, str],
+        context_id: str,
+        evaluation_passed: bool,
+        reason: str,
+        scenario_type: Optional[str] = None,
+    ) -> None:
@@
                 "evaluation_passed (from agent)": evaluation_passed,
                 "reason (from agent)": reason,
-                "scenario_type": scenario_type,
+                "scenario_type": scenario_type,
             },
         )

Option B (safer): drop the extra param entirely and read type from the parsed scenario.

Also applies to: 384-399

♻️ Duplicate comments (4)

packages/tui/internal/tui/report_view.go (1)

61-67: Normalize viewport sizing; use one source of truth (and add guards).

Inconsistent subtractions (-2, -8) cause clipping/off-by-one scroll issues and misaligned centering. Compute width/height once, guard for small terminals, and reuse those exact values for SetSize and Style.Height.

-	viewportWidth := m.width - 8   // Leave margins
-	viewportHeight := m.height - 8 // title(3) + help(1) + margins(4)
+	viewportWidth := m.width - 8   // leave margins
+	viewportHeight := m.height - 4 // title(3) + help(1) + margins(4)
+	if viewportWidth < 20 {
+		viewportWidth = 20
+	}
+	if viewportHeight < 5 {
+		viewportHeight = 5
+	}
@@
-	viewport.SetSize(viewportWidth, viewportHeight-2)
+	viewport.SetSize(viewportWidth, viewportHeight)
@@
-		Height(viewportHeight - 8).
+		Height(viewportHeight).
@@
-		Width(viewportWidth).
-		Height(viewportHeight-8).
+		Width(viewportWidth).
+		Height(viewportHeight).
@@
-		Height(viewportHeight - 8).
+		Height(viewportHeight).
@@
-			viewportHeight-8,
+			viewportHeight,

Also applies to: 71-76, 81-83, 106-113

packages/tui/internal/tui/app.go (1)

90-114: Fix panic risk (err not checked) and avoid O(n²) string concat; only emit non-empty sections.

Dereferencing structuredSummary before checking err can panic on failures; repeated + concatenation is inefficient.

-		structuredSummary, err := sdk.GenerateSummary(ctx, m.evalState.JobID, judgeModel, apiKey, &m.config.QualifireAPIKey)
-
-		overallSummary := structuredSummary.Summary.OverallSummary
-		keyFindings := structuredSummary.Summary.KeyFindings
-		parsedKeyFindings := ""
-		for _, finding := range keyFindings {
-			parsedKeyFindings += "- " + finding + "\n"
-		}
-		recommendations := structuredSummary.Summary.Recommendations
-		parsedRecommendations := ""
-		for _, recommendation := range recommendations {
-			parsedRecommendations += "- " + recommendation + "\n"
-		}
-
-		detailedBreakdown := structuredSummary.Summary.DetailedBreakdown
-		parsedDetailedBreakdown := ""
-		for _, breakdown := range detailedBreakdown {
-			parsedDetailedBreakdown += "- " + breakdown.Scenario + " - " + breakdown.Status + " - " + breakdown.Outcome + "\n"
-		}
-
-		summary := "## Overall Summary\n\n" + overallSummary +
-			"\n\n" + "## Key Findings\n\n" + parsedKeyFindings +
-			"\n\n" + "## Recommendations\n\n" + parsedRecommendations +
-			"\n\n" + "## Detailed Breakdown\n\n" + parsedDetailedBreakdown
+		structuredSummary, err := sdk.GenerateSummary(ctx, m.evalState.JobID, judgeModel, apiKey, &m.config.QualifireAPIKey)
+		if err != nil {
+			return SummaryGeneratedMsg{Summary: "", Err: err}
+		}
+
+		sum := structuredSummary.Summary
+		var b strings.Builder
+		if sum.OverallSummary != "" {
+			b.WriteString("## Overall Summary\n\n")
+			b.WriteString(sum.OverallSummary)
+			b.WriteString("\n\n")
+		}
+		if len(sum.KeyFindings) > 0 {
+			b.WriteString("## Key Findings\n\n")
+			for _, f := range sum.KeyFindings {
+				if f == "" {
+					continue
+				}
+				b.WriteString("- ")
+				b.WriteString(f)
+				b.WriteByte('\n')
+			}
+			b.WriteByte('\n')
+		}
+		if len(sum.Recommendations) > 0 {
+			b.WriteString("## Recommendations\n\n")
+			for _, r := range sum.Recommendations {
+				if r == "" {
+					continue
+				}
+				b.WriteString("- ")
+				b.WriteString(r)
+				b.WriteByte('\n')
+			}
+			b.WriteByte('\n')
+		}
+		if len(sum.DetailedBreakdown) > 0 {
+			b.WriteString("## Detailed Breakdown\n\n")
+			for _, br := range sum.DetailedBreakdown {
+				b.WriteString("- ")
+				b.WriteString(br.Scenario)
+				b.WriteString(" - ")
+				b.WriteString(br.Status)
+				b.WriteString(" - ")
+				b.WriteString(br.Outcome)
+				b.WriteByte('\n')
+			}
+		}
+		summary := strings.TrimSpace(b.String())

sdks/python/rogue_sdk/types.py (1)

264-275: Log timestamp parse failures instead of silent fallback

Retain current fallback but log once per failure for observability.

-            except (ValueError, AttributeError):
-                timestamp = datetime.now(timezone.utc)
+            except (ValueError, AttributeError) as exc:
+                logger.warning(
+                    f"Failed to parse timestamp: {msg.timestamp!r}; using now()",
+                    exc_info=exc,
+                )
+                timestamp = datetime.now(timezone.utc)

rogue/server/services/api_format_service.py (1)

51-61: Normalize message timestamps to timezone-aware UTC (repeat of prior review).

A naive parsed datetime leaks through when msg.timestamp lacks tzinfo. Always coerce to aware UTC.

Apply this diff:

-                timestamp = datetime.now(timezone.utc)
-                if msg.timestamp:
-                    try:
-                        if isinstance(msg.timestamp, str):
-                            timestamp = datetime.fromisoformat(
-                                msg.timestamp.replace("Z", "+00:00"),
-                            )
-                        else:
-                            timestamp = msg.timestamp
-                    except (ValueError, AttributeError):
-                        timestamp = datetime.now(timezone.utc)
+                timestamp = datetime.now(timezone.utc)
+                if msg.timestamp:
+                    try:
+                        if isinstance(msg.timestamp, str):
+                            parsed = datetime.fromisoformat(
+                                msg.timestamp.replace("Z", "+00:00"),
+                            )
+                            timestamp = (
+                                parsed.replace(tzinfo=timezone.utc)
+                                if parsed.tzinfo is None
+                                else parsed.astimezone(timezone.utc)
+                            )
+                        elif isinstance(msg.timestamp, datetime):
+                            timestamp = (
+                                msg.timestamp.replace(tzinfo=timezone.utc)
+                                if msg.timestamp.tzinfo is None
+                                else msg.timestamp.astimezone(timezone.utc)
+                            )
+                        else:
+                            timestamp = datetime.now(timezone.utc)
+                    except (ValueError, AttributeError, TypeError):
+                        timestamp = datetime.now(timezone.utc)

🧹 Nitpick comments (21)

packages/tui/internal/tui/report_view.go (1)

25-25: Main container should fill full height.

Using Height(m.height - 12) can leave a background gap. Prefer full height.
-		Height(m.height - 12).
+		Height(m.height).

packages/tui/internal/tui/app.go (1)

115-118: Propagate nil error when summary builds successfully.

After the refactor above, return Err: nil to avoid implying partial failure.
-		return SummaryGeneratedMsg{
-			Summary: summary,
-			Err:     err,
-		}
+		return SummaryGeneratedMsg{Summary: summary, Err: nil}

packages/tui/internal/tui/eval_ui.go (2)

38-49: Harden parsing: default scenario_type when missing; minor robustness.

Provide a sane default when scenario_type is empty to match SDK defaults (e.g., "policy").
-				for _, s := range v.Scenarios {
+				for _, s := range v.Scenarios {
+					st := s.ScenarioType
+					if st == "" {
+						st = "policy"
+					}
 					if s.Scenario != "" {
 						out = append(out, EvalScenario{
 							Scenario:        s.Scenario,
-							ScenarioType:    ScenarioType(s.ScenarioType),
+							ScenarioType:    ScenarioType(st),
 							ExpectedOutcome: s.ExpectedOutcome,
 						})
 					}
 				}
Also applies to: 52-52, 56-59

39-41: Handle Getwd error and return empty slice instead of nil.

Avoids surprises for callers and early-exits cleanly.
-	wd, _ := os.Getwd()
+	wd, err := os.Getwd()
+	if err != nil {
+		return []EvalScenario{}
+	}
@@
-	return nil
+	return []EvalScenario{}
Also applies to: 71-71

rogue/server/services/__init__.py (1)

2-9: Alphabetize imports to satisfy isort/flake8.

Keep submodule imports sorted lexicographically.

 from . import (
-    api_format_service,
-    evaluation_library,
-    evaluation_service,
-    interviewer_service,
-    llm_service,
-    scenario_evaluation_service,
-    qualifire_service,
+    api_format_service,
+    evaluation_library,
+    evaluation_service,
+    interviewer_service,
+    llm_service,
+    qualifire_service,
+    scenario_evaluation_service,
 )

rogue/evaluator_agent/evaluator_agent.py (1)

120-126: Keep tool docstring in sync with callable signature.

If you keep scenario_type as a parameter, reflect it in the instructions; otherwise, remove it from the signature and leave the docstring as-is.
-3. `_log_evaluation(scenario: dict, context_id: str, evaluation_passed: bool, reason: str)`
+3. `_log_evaluation(scenario: dict, context_id: str, evaluation_passed: bool, reason: str, scenario_type: Optional[str] = None)`

rogue/server/services/qualifire_service.py (1)

10-13: Add explicit return type for report_summary.

Annotate return for mypy and readability.

-    def report_summary(
+    def report_summary(
         request: ReportSummaryRequest,
         evaluation_result: EvaluationResult,
-    ):
+    ) -> dict:

rogue/run_cli.py (3)

191-194: Plumb Qualifire key through CLI/config.

You added qualifire_api_key_secret to create_report but it isn’t sourced from CLI/config here. Ensure CLIInput/args support and forward it.

220-222: Specify encoding when writing the report.

Avoid locale-dependent defaults.
-    output_report_file.write_text(summary)
+    output_report_file.write_text(summary, encoding="utf-8")
361-363: Forward Qualifire key to create_report if available.

Pipe the secret from CLIInput/config to create_report.

rogue/server/models/api_format.py (2)

10-12: Import Field/ConfigDict for aliases and stricter validation

To support aliasing (camelCase in JSON) and extra="forbid" configs in models below, extend imports.
-from pydantic import BaseModel
+from pydantic import BaseModel, Field, ConfigDict
14-20: Forbid unknown fields and require timezone-aware timestamps

Add extra="forbid" to block unexpected data.

Enforce/verify timestamps are timezone-aware (UTC) to avoid ambiguity.
 class ApiChatMessage(BaseModel):
     """Chat message for new API format with datetime timestamp."""
 
+    model_config = ConfigDict(extra="forbid")
     role: str
     content: str
     timestamp: datetime
Would you like me to add a validator that rejects naive datetimes, or is upstream guaranteed to send RFC 3339 with Z?

rogue/server/services/llm_service.py (4)

254-259: Harden code-fence stripping (handles case/whitespace/newlines)

Use a single regex to remove leading/trailing fences in more edge cases.
-            # Remove markdown code blocks if present
-            if content.startswith("```json"):
-                content = content[7:]
-            if content.endswith("```"):
-                content = content[:-3]
-            content = content.strip()
+            # Remove leading/trailing ```/```json fences robustly
+            content = re.sub(
+                r'^\s*```(?:json)?\s*|\s*```\s*$',
+                '',
+                content.strip(),
+                flags=re.IGNORECASE,
+            ).strip()
(Remember to import re at the top.)
-import json
+import json
+import re
261-263: Prefer model_validate for clarity

Minor readability: use Pydantic’s validator for dict input.
-            summary_data = json.loads(content)
-            return StructuredSummary(**summary_data)
+            summary_data = json.loads(content)
+            return StructuredSummary.model_validate(summary_data)
156-163: Externalize STATIC_SCENARIOS to config

Hardcoding dataset ids/sample sizes makes deployments rigid. Consider moving to a config/env flag or settings module.

244-248: Enforce JSON-mode in LiteLLM completions
Add response_format={'type': 'json_object'} to the completion call (LiteLLM supports this natively) and guard it with supports_response_schema(model) or via litellm.get_supported_openai_params() to ensure the target provider honors JSON schemas.
File: rogue/server/services/llm_service.py
@@ lines 244-248
             response = completion(
                 model=model,
                 messages=messages,
                 api_key=api_key,
+                response_format={'type': 'json_object'},
             )

sdks/python/rogue_sdk/sdk.py (1)

308-343: Improve Markdown formatting for detailed_breakdown

Current f-string prints raw dicts. Format rows for readability and guard non-dict entries.

-        if hasattr(structured_summary, "overall_summary"):
+        if hasattr(structured_summary, "overall_summary"):
             # Format as markdown string for UI display
@@
-            if structured_summary.detailed_breakdown:
-                breakdown = "\n".join(
-                    f"{i + 1}. {row}"
-                    for i, row in enumerate(structured_summary.detailed_breakdown)
-                )
-                summary_parts.append(f"\n---\n\n## Detailed Breakdown\n{breakdown}")
+            if structured_summary.detailed_breakdown:
+                def _fmt_row(i, row):
+                    if isinstance(row, dict):
+                        scenario = row.get("scenario", "N/A")
+                        status = row.get("status", "")
+                        outcome = row.get("outcome", "")
+                        return f"{i + 1}. {status} {scenario} — {outcome}"
+                    return f"{i + 1}. {row}"
+                breakdown = "\n".join(
+                    _fmt_row(i, row)
+                    for i, row in enumerate(structured_summary.detailed_breakdown)
+                )
+                summary_parts.append(f"\n---\n\n## Detailed Breakdown\n{breakdown}")

rogue/server/services/api_format_service.py (1)

39-41: Coerce start_time to aware UTC.

If a naive datetime is supplied, normalize to UTC to avoid mixed-aware/naive issues.

Apply this diff:

     if start_time is None:
         start_time = datetime.now(timezone.utc)
+    else:
+        start_time = (
+            start_time.replace(tzinfo=timezone.utc)
+            if start_time.tzinfo is None
+            else start_time.astimezone(timezone.utc)
+        )

rogue/server/api/llm.py (3)

126-129: Reduce sensitive/verbose logging of summary/results.

Avoid dumping full summary and entire results; log counts/metadata instead.

Apply this diff:

-            logger.info(
-                "Summary",
-                extra={"summary": summary, "results": request.results},
-            )
+            logger.info(
+                "Summary generated",
+                extra={
+                    "key_findings_count": len(getattr(summary, "key_findings", []) or []),
+                    "results_count": len(request.results.results),
+                },
+            )

136-141: Ensure start_time is timezone-aware UTC when using job.created_at.

job.created_at may be naive; normalize before sending to Qualifire.

Apply this diff:

-                    start_time=(
-                        job.created_at
-                        if job is not None
-                        else datetime.now(timezone.utc)
-                    ),
+                    start_time=(
+                        (
+                            job.created_at.replace(tzinfo=timezone.utc)
+                            if job.created_at.tzinfo is None
+                            else job.created_at.astimezone(timezone.utc)
+                        )
+                        if job is not None and getattr(job, "created_at", None)
+                        else datetime.now(timezone.utc)
+                    ),

90-95: Avoid blocking the event loop with sync LLM calls.

LLMService.generate_summary_from_results is sync; offload to a threadpool.

You can replace the call with:

from fastapi.concurrency import run_in_threadpool
summary = await run_in_threadpool(
    LLMService.generate_summary_from_results,
    model=request.model,
    results=request.results,
    llm_provider_api_key=request.api_key,
)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 6e0d9ee and 45c037f.

📒 Files selected for processing (18)

packages/tui/internal/tui/app.go (4 hunks)
packages/tui/internal/tui/eval_ui.go (2 hunks)
packages/tui/internal/tui/evaluation.go (6 hunks)
packages/tui/internal/tui/report_view.go (3 hunks)
rogue/evaluator_agent/evaluator_agent.py (3 hunks)
rogue/run_cli.py (3 hunks)
rogue/server/api/__init__.py (1 hunks)
rogue/server/api/evaluation.py (1 hunks)
rogue/server/api/llm.py (4 hunks)
rogue/server/models/api_format.py (1 hunks)
rogue/server/services/__init__.py (1 hunks)
rogue/server/services/api_format_service.py (1 hunks)
rogue/server/services/llm_service.py (6 hunks)
rogue/server/services/qualifire_service.py (1 hunks)
rogue/ui/components/scenario_runner.py (1 hunks)
sdks/python/rogue_sdk/client.py (4 hunks)
sdks/python/rogue_sdk/sdk.py (7 hunks)
sdks/python/rogue_sdk/types.py (8 hunks)

✅ Files skipped from review due to trivial changes (1)

rogue/server/api/init.py

🚧 Files skipped from review as they are similar to previous changes (1)

rogue/ui/components/scenario_runner.py

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/server/services/__init__.py
rogue/server/services/qualifire_service.py
rogue/server/services/api_format_service.py
rogue/server/models/api_format.py
rogue/server/api/evaluation.py
rogue/evaluator_agent/evaluator_agent.py
rogue/server/services/llm_service.py
rogue/run_cli.py
rogue/server/api/llm.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/server/services/__init__.py
rogue/server/services/qualifire_service.py
rogue/server/services/api_format_service.py
rogue/server/models/api_format.py
rogue/server/api/evaluation.py
rogue/evaluator_agent/evaluator_agent.py
rogue/server/services/llm_service.py
rogue/run_cli.py
rogue/server/api/llm.py

🧬 Code graph analysis (13)

rogue/server/services/qualifire_service.py (2)

rogue/server/services/api_format_service.py (1)

convert_with_structured_summary (116-141)

sdks/python/rogue_sdk/client.py (1)

request (79-83)

rogue/server/services/api_format_service.py (1)

rogue/server/models/api_format.py (4)

ApiChatMessage (14-19)

ApiConversationEvaluation (22-27)

ApiEvaluationResult (40-49)

ApiScenarioResult (30-37)

packages/tui/internal/tui/app.go (3)

sdks/python/rogue_sdk/types.py (1)

Scenario (94-133)

packages/sdk/src/types.ts (1)

Scenario (41-47)

packages/tui/internal/components/viewport.go (1)

NewViewport (74-93)

rogue/server/models/api_format.py (1)

sdks/python/rogue_sdk/types.py (1)

StructuredSummary (444-450)

packages/tui/internal/tui/eval_ui.go (3)

sdks/python/rogue_sdk/types.py (3)

Scenarios (136-154)

Scenario (94-133)

ScenarioType (45-49)

packages/tui/internal/tui/evaluation.go (2)

EvalScenario (44-48)

ScenarioType (27-27)

packages/sdk/src/types.ts (1)

Scenario (41-47)

rogue/server/services/llm_service.py (1)

sdks/python/rogue_sdk/types.py (5)

EvaluationResults (192-208)

Scenario (94-133)

Scenarios (136-154)

ScenarioType (45-49)

StructuredSummary (444-450)

sdks/python/rogue_sdk/client.py (1)

sdks/python/rogue_sdk/types.py (3)

StructuredSummary (444-450)

SummaryGenerationResponse (453-457)

SummaryGenerationRequest (431-441)

sdks/python/rogue_sdk/sdk.py (3)

sdks/python/rogue_sdk/types.py (1)

StructuredSummary (444-450)

rogue/server/api/llm.py (1)

generate_summary (72-158)

sdks/python/rogue_sdk/client.py (1)

generate_summary (147-194)

sdks/python/rogue_sdk/types.py (2)

rogue/server/models/api_format.py (4)

ApiChatMessage (14-19)

ApiConversationEvaluation (22-27)

ApiScenarioResult (30-37)

ApiEvaluationResult (40-49)

rogue/server/services/api_format_service.py (1)

convert_to_api_format (20-113)

rogue/run_cli.py (3)

rogue/server/api/llm.py (1)

generate_summary (72-158)

sdks/python/rogue_sdk/client.py (1)

generate_summary (147-194)

sdks/python/rogue_sdk/sdk.py (1)

generate_summary (287-342)

packages/tui/internal/tui/evaluation.go (3)

sdks/python/rogue_sdk/types.py (2)

Scenario (94-133)

ScenarioType (45-49)

packages/sdk/src/types.ts (1)

Scenario (41-47)

sdks/python/rogue_sdk/sdk.py (1)

RogueSDK (34-391)

rogue/server/api/llm.py (9)

sdks/python/rogue_sdk/types.py (3)

SummaryGenerationRequest (431-441)

ReportSummaryResponse (513-516)

ReportSummaryRequest (501-510)

rogue/server/api/evaluation.py (1)

get_evaluation_service (23-24)

rogue/server/services/evaluation_service.py (2)

EvaluationService (20-234)

get_job (31-33)

rogue/server/models/api_format.py (1)

ServerSummaryGenerationResponse (52-56)

rogue/common/logging/config.py (1)

get_logger (155-167)

rogue/server/services/llm_service.py (1)

LLMService (166-283)

rogue/server/services/qualifire_service.py (2)

QualifireService (8-40)

report_summary (10-40)

sdks/python/rogue_sdk/client.py (2)

generate_summary (147-194)

request (79-83)

sdks/python/rogue_sdk/sdk.py (1)

generate_summary (287-342)

packages/tui/internal/tui/report_view.go (1)

packages/tui/internal/styles/styles.go (3)

NewStyle (45-47)

Border (23-23)

Style (40-42)

🔇 Additional comments (23)

packages/tui/internal/tui/app.go (4)

160-174: detailedBreakdown field addition looks good.

Type ties the import; no unused-import build issue.

269-269: Report viewport initial height changed to 15.

Verify this still fits the new layout; otherwise align with standardized sizing or compute dynamically.

11-13: Import path confirmed
The import path github.com/charmbracelet/bubbles/v2/table is correct for Bubbles v2 according to pkg.go.dev and the official Charmbracelet repo.

382-387: Incorrect sizing refactor suggestion
The report view in report_view.go uses viewport.SetSize(viewportWidth, viewportHeight-2), not msg.Width-8/msg.Height-4, and other components apply context-specific offsets (e.g., height-12 for headers, –4 for help/interview). Switching to the proposed diff will break those layouts. Instead, extract and use shared constants for horizontal and vertical margins if you need consistent sizing.

Likely an incorrect or invalid review comment.

packages/tui/internal/tui/eval_ui.go (1)

17-17: Shift to structured scenarios is solid.

Aligns UI with API/SDK models.

packages/tui/internal/tui/evaluation.go (4)

45-47: Adds expected_outcome to scenarios — OK

JSON tag + omitempty are correct; aligns with SDK/server contracts.

96-108: Structured summary response type — OK

Fields map cleanly to the server’s StructuredSummary shape.

447-450: Scenario conversion — OK

Scenario, ScenarioType, ExpectedOutcome are forwarded correctly.

419-421: Signature change verified The TUI caller in eval_ui.go has been updated to pass []EvalScenario; no further updates required.

sdks/python/rogue_sdk/types.py (4)

80-81: AgentConfig.qualifire_api_key — OK

Matches new flows; optional and safe default.

389-391: EvaluationJob.deep_test/judge_model — OK

Captures flags required downstream in reports.

444-457: StructuredSummary model — OK

Field names and shapes match the response used by clients.

501-517: Report summary models — OK

Surface looks minimal and sufficient for BFF reporting.

rogue/server/api/evaluation.py (1)

62-64: Persist deep_test and judge_model on job — OK

Enables downstream formatting/reporting without re-deriving flags.

sdks/python/rogue_sdk/client.py (3)

147-156: API surface extension looks good; confirm server supports new fields.

Params (qualifire_api_key, job_id, deep_test, judge_model) align with SummaryGenerationRequest. OK.

175-194: Structured response handling LGTM.

Good fallback to legacy string; preserves backward compatibility.

269-270: Minor style change is fine.

Trailing comma in exception message is harmless and Black-compatible.

rogue/server/models/api_format.py (2)

52-56: LGTM: ServerSummaryGenerationResponse aligns with SDK types

Returning a StructuredSummary here matches the client expectations and reduces ad-hoc parsing downstream.

43-47: No changes needed: keyFindings is intentionally a string for the API contract
The api_format_service.py joins StructuredSummary.key_findings into a single string and the existing tests assert this behavior, so retaining Optional[str] aligns with the legacy BFF shape.

rogue/server/services/llm_service.py (1)

173-185: Docstring/API contract improvements look good

Clearer args/returns/raises improve maintainability. No action needed.

sdks/python/rogue_sdk/sdk.py (2)

118-119: LGTM: clearer error message with no semantic change

Minor formatting-only tweak reads better and stays lint-clean.

292-297: Call sites updated for new Tuple return signature Callers in rogue/run_cli.py and rogue/ui/components/scenario_runner.py already unpack both values; no further action required.

rogue/server/api/llm.py (1)

178-189: Report only the first evaluation result results[0] ignores all other results in the job. Confirm the signature of QualifireService.report_summary – if it should handle multiple results, either loop over job.results and call report_summary for each, or pass the full EvaluationResults container instead of just the first element.

rogue/run_cli.py

coderabbitai · 2025-09-09T10:53:41Z

rogue/server/api/llm.py

+        logger.info(
+            "Qualifire API key",
+            extra={"qualifire_api_key": request.qualifire_api_key},
+        )


⚠️ Potential issue

Do not log secrets (qualifire_api_key).

This logs a raw API key. Remove or only log presence.

Apply this diff:

- logger.info( - "Qualifire API key", - extra={"qualifire_api_key": request.qualifire_api_key}, - ) + logger.info( + "Qualifire API key provided", + extra={"present": bool(request.qualifire_api_key)}, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

logger.info(

"Qualifire API key",

extra={"qualifire_api_key": request.qualifire_api_key},

)

logger.info(

"Qualifire API key provided",

extra={"present": bool(request.qualifire_api_key)},

)

🤖 Prompt for AI Agents

In rogue/server/api/llm.py around lines 98 to 101, the code logs the raw qualifire_api_key which is a secret; remove the secret from logs and instead log only its presence or a boolean/masked value. Replace the logger.info call so it does not include request.qualifire_api_key (e.g., log "qualifire_api_key present": True or "qualifire_api_key": "*****"), or remove the log entirely if not needed, ensuring no secret value is ever written to logs.

coderabbitai · 2025-09-09T10:53:41Z

rogue/server/models/api_format.py

+class ApiScenarioResult(BaseModel):
+    """Result of evaluating a single scenario in new API format."""
+
+    description: Optional[str] = None
+    expectedOutcome: Optional[str] = None
+    totalConversations: Optional[int] = None
+    flaggedConversations: Optional[int] = None
+    conversations: List[ApiConversationEvaluation]
+


🛠️ Refactor suggestion

Use snake_case fields with serialization aliases to meet PEP 8 while emitting camelCase JSON

Internal snake_case keeps mypy/flake8 happy and avoids mixed styles. Emit camelCase via serialization_alias so the wire format is unchanged.

class ApiScenarioResult(BaseModel): """Result of evaluating a single scenario in new API format.""" - description: Optional[str] = None - expectedOutcome: Optional[str] = None - totalConversations: Optional[int] = None - flaggedConversations: Optional[int] = None - conversations: List[ApiConversationEvaluation] + model_config = ConfigDict(extra="forbid") + description: Optional[str] = None + expected_outcome: Optional[str] = Field(None, serialization_alias="expectedOutcome") + total_conversations: Optional[int] = Field(None, serialization_alias="totalConversations") + flagged_conversations: Optional[int] = Field(None, serialization_alias="flaggedConversations") + conversations: List[ApiConversationEvaluation] = Field(default_factory=list) class ApiEvaluationResult(BaseModel): """New API format for evaluation results.""" - scenarios: List[ApiScenarioResult] - summary: Optional[str] = None - keyFindings: Optional[str] = None - recommendation: Optional[str] = None - deepTest: bool = False - startTime: datetime - judgeModel: Optional[str] = None + model_config = ConfigDict(extra="forbid") + scenarios: List[ApiScenarioResult] + summary: Optional[str] = None + key_findings: Optional[str] = Field(None, serialization_alias="keyFindings") + recommendation: Optional[str] = None + deep_test: bool = Field(False, serialization_alias="deepTest") + start_time: datetime = Field(..., serialization_alias="startTime") + judge_model: Optional[str] = Field(None, serialization_alias="judgeModel")

Note: This preserves the API surface (camelCase) while making codebase PEP 8 compliant. You may need to update construction sites to use snake_case names (aliases still serialize to camelCase). I can bulk-update call sites if helpful.

Also applies to: 40-49

coderabbitai · 2025-09-09T10:53:41Z

rogue/server/services/api_format_service.py

+        api_scenarios.append(
+            ApiScenarioResult(
+                description=result.scenario.scenario,
+                expectedOutcome=result.scenario.expected_outcome,
+                totalConversations=len(api_conversations),
+                flaggedConversations=len(
+                    [c for c in api_conversations if not c.passed],
+                ),
+                conversations=api_conversations,
+            ),


⚠️ Potential issue

Don’t pass unknown field expectedOutcome to ApiScenarioResult (guard or remove).

ApiScenarioResult (per models/api_format.py) doesn’t declare expectedOutcome. Passing it will either raise a validation error (if extras=forbid) or be dropped silently. Gate this add at runtime to avoid breakage.

Apply this diff to make it safe:

- api_scenarios.append( - ApiScenarioResult( - description=result.scenario.scenario, - expectedOutcome=result.scenario.expected_outcome, - totalConversations=len(api_conversations), - flaggedConversations=len( - [c for c in api_conversations if not c.passed], - ), - conversations=api_conversations, - ), - ) + scenario_kwargs = dict( + description=result.scenario.scenario, + totalConversations=len(api_conversations), + flaggedConversations=len([c for c in api_conversations if not c.passed]), + conversations=api_conversations, + ) + expected = getattr(result.scenario, "expected_outcome", None) + if ( + expected is not None + and hasattr(ApiScenarioResult, "model_fields") + and "expectedOutcome" in ApiScenarioResult.model_fields + ): + scenario_kwargs["expectedOutcome"] = expected + + api_scenarios.append(ApiScenarioResult(**scenario_kwargs))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

api_scenarios.append(

ApiScenarioResult(

description=result.scenario.scenario,

expectedOutcome=result.scenario.expected_outcome,

totalConversations=len(api_conversations),

flaggedConversations=len(

[c for c in api_conversations if not c.passed],

),

conversations=api_conversations,

),

# Build kwargs for ApiScenarioResult, omitting unknown fields

scenario_kwargs = dict(

description=result.scenario.scenario,

totalConversations=len(api_conversations),

flaggedConversations=len([c for c in api_conversations if not c.passed]),

conversations=api_conversations,

)

# Only include expectedOutcome if the model actually declares it

expected = getattr(result.scenario, "expected_outcome", None)

if (

expected is not None

and hasattr(ApiScenarioResult, "model_fields")

and "expectedOutcome" in ApiScenarioResult.model_fields

):

scenario_kwargs["expectedOutcome"] = expected

api_scenarios.append(ApiScenarioResult(**scenario_kwargs))

🤖 Prompt for AI Agents

In rogue/server/services/api_format_service.py around lines 79-88, the code is passing an unknown field expectedOutcome to ApiScenarioResult which will be rejected or dropped; fix this by removing the expectedOutcome argument (or alternatively only include it when the ApiScenarioResult dataclass/model actually defines that field). Concretely, stop passing expectedOutcome=result.scenario.expected_outcome into the ApiScenarioResult constructor, or build the kwargs dict dynamically and add expectedOutcome only if hasattr(ApiScenarioResult, "expectedOutcome") or the model's field list contains it, then pass that dict to ApiScenarioResult.

coderabbitai · 2025-09-09T10:53:41Z

rogue/server/services/qualifire_service.py

+        api_evaluation_result = convert_with_structured_summary(
+            evaluation_results=evaluation_result,
+            structured_summary=request.structured_summary,
+            deep_test=request.deep_test,
+            start_time=request.start_time,
+            judge_model=request.judge_model,
+        )


⚠️ Potential issue

Wrong type passed to converter (EvaluationResult vs EvaluationResults).

convert_with_structured_summary expects EvaluationResults, but a single EvaluationResult is passed. This will mis-shape the payload.

-from rogue_sdk.types import EvaluationResult, ReportSummaryRequest +from rogue_sdk.types import EvaluationResult, EvaluationResults, ReportSummaryRequest @@ - api_evaluation_result = convert_with_structured_summary( - evaluation_results=evaluation_result, + api_evaluation_result = convert_with_structured_summary( + evaluation_results=EvaluationResults(results=[evaluation_result]), structured_summary=request.structured_summary, deep_test=request.deep_test, start_time=request.start_time, judge_model=request.judge_model, )

🤖 Prompt for AI Agents

In rogue/server/services/qualifire_service.py around lines 18 to 24, the call passes a single EvaluationResult (evaluation_result) to convert_with_structured_summary which expects an EvaluationResults collection; replace the single-item argument with an EvaluationResults instance (e.g. use the existing evaluation_results collection variable or construct an EvaluationResults wrapping the single result) so the converter receives the correct type and payload shape, and update any variable names accordingly.

coderabbitai · 2025-09-09T10:53:41Z

rogue/server/services/qualifire_service.py

+        response = requests.post(
+            f"{request.qualifire_url}/api/rogue/v1/report",
+            headers={"X-qualifire-key": request.qualifire_api_key},
+            json=api_evaluation_result.model_dump(mode="json"),
+            timeout=300,
+        )


🛠️ Refactor suggestion

Harden network/error handling and avoid json() on error bodies.

requests.post can raise; response.json() may fail for non-JSON errors. Add try/except and prefer response.text on failures.

- response = requests.post( - f"{request.qualifire_url}/api/rogue/v1/report", - headers={"X-qualifire-key": request.qualifire_api_key}, - json=api_evaluation_result.model_dump(mode="json"), - timeout=300, - ) - - if not response.ok: - logger.error( - "Failed to report summary to Qualifire", - extra={"response": response.json()}, - ) - raise Exception(f"Failed to report summary to Qualifire: {response.json()}") + try: + headers = {} + if request.qualifire_api_key: + headers["X-Qualifire-Key"] = request.qualifire_api_key + response = requests.post( + f"{request.qualifire_url}/api/rogue/v1/report", + headers=headers, + json=api_evaluation_result.model_dump(mode="json"), + timeout=300, + ) + except requests.RequestException as e: + logger.exception("Network error reporting summary to Qualifire") + raise + + if not response.ok: + body = None + try: + body = response.json() + except Exception: + body = response.text + logger.error( + "Failed to report summary to Qualifire", + extra={"status": response.status_code, "response": body}, + ) + raise Exception(f"Failed to report summary to Qualifire: {body}")

Also applies to: 33-39

🤖 Prompt for AI Agents

In rogue/server/services/qualifire_service.py around lines 26-31 (and similarly for lines 33-39), the code calls requests.post and then assumes a JSON body; wrap the network call and response parsing in a try/except that catches requests.RequestException and handles timeouts/connection errors, check response.ok before calling response.json(), and if parsing fails fall back to response.text; log or raise a clear error including status_code and response.text when response.ok is False or JSON decoding fails, and ensure the timeout is preserved and resources are cleaned up.

coderabbitai · 2025-09-09T10:53:42Z

sdks/python/rogue_sdk/client.py

        data = SummaryGenerationRequest(
            results=results,
            model=model,
+            qualifire_api_key=qualifire_api_key,
+            job_id=job_id,
+            deep_test=deep_test,
+            judge_model=judge_model,
        )


🛠️ Refactor suggestion

Avoid sending None for non-optional job_id.

SummaryGenerationRequest.job_id is str; passing None overrides the default and may fail validation.

- data = SummaryGenerationRequest( + data = SummaryGenerationRequest( results=results, model=model, - qualifire_api_key=qualifire_api_key, - job_id=job_id, + qualifire_api_key=qualifire_api_key, + job_id=job_id or "", deep_test=deep_test, judge_model=judge_model, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

data = SummaryGenerationRequest(

results=results,

model=model,

qualifire_api_key=qualifire_api_key,

job_id=job_id,

deep_test=deep_test,

judge_model=judge_model,

)

data = SummaryGenerationRequest(

results=results,

model=model,

qualifire_api_key=qualifire_api_key,

job_id=job_id or "",

deep_test=deep_test,

judge_model=judge_model,

)

🤖 Prompt for AI Agents

In sdks/python/rogue_sdk/client.py around lines 158 to 165, the code passes job_id which may be None into SummaryGenerationRequest.job_id (a non-optional str) causing validation errors; change the call to only supply job_id when it is not None (e.g., build kwargs and conditionally add 'job_id' or default to a valid string) so None is never passed into the SummaryGenerationRequest constructor.

coderabbitai · 2025-09-09T10:53:42Z

sdks/python/rogue_sdk/types.py

+class ApiEvaluationResult(BaseModel):
+    """New API format for evaluation results."""
+
+    scenarios: List[ApiScenarioResult]
+
+


🛠️ Refactor suggestion

ApiEvaluationResult missing metadata fields

Add optional fields to align with server API surface.

class ApiEvaluationResult(BaseModel): """New API format for evaluation results.""" scenarios: List[ApiScenarioResult] + summary: Optional[str] = None + keyFindings: Optional[str] = None + recommendation: Optional[str] = None + deepTest: bool = False + startTime: datetime = Field(default_factory=lambda: datetime.now(timezone.utc)) + judgeModel: Optional[str] = None

Optional: extend convert_to_api_format(...) to accept structured_summary/deep_test/start_time/judge_model and populate these.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class ApiEvaluationResult(BaseModel):

"""New API format for evaluation results."""

scenarios: List[ApiScenarioResult]

class ApiEvaluationResult(BaseModel):

"""New API format for evaluation results."""

scenarios: List[ApiScenarioResult]

summary: Optional[str] = None

keyFindings: Optional[str] = None

recommendation: Optional[str] = None

deepTest: bool = False

startTime: datetime = Field(

default_factory=lambda: datetime.now(timezone.utc)

)

judgeModel: Optional[str] = None

🤖 Prompt for AI Agents

In sdks/python/rogue_sdk/types.py around lines 239 to 244, ApiEvaluationResult currently only declares scenarios but is missing optional metadata fields expected by the server API; add optional attributes for structured_summary (e.g., Optional[Dict[str, Any]] or Optional[StructuredSummaryType]), deep_test (Optional[bool]), start_time (Optional[datetime] or Optional[str] ISO), and judge_model (Optional[str]) to the Pydantic model, marking them Optional and with sensible defaults (None); optionally update the convert_to_api_format(...) helper to accept and pass through structured_summary, deep_test, start_time, and judge_model parameters so ApiEvaluationResult instances are populated with these fields when available.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

packages/tui/internal/tui/app.go (1)

90-114: Panic risk: dereference before err check; also O(n²) string building

Check err before accessing structuredSummary; build output with strings.Builder and only add non-empty sections.

Apply:

-		structuredSummary, err := sdk.GenerateSummary(ctx, m.evalState.JobID, judgeModel, apiKey, &m.config.QualifireAPIKey)
-
-		overallSummary := structuredSummary.Summary.OverallSummary
-		keyFindings := structuredSummary.Summary.KeyFindings
-		parsedKeyFindings := ""
-		for _, finding := range keyFindings {
-			parsedKeyFindings += "- " + finding + "\n"
-		}
-		recommendations := structuredSummary.Summary.Recommendations
-		parsedRecommendations := ""
-		for _, recommendation := range recommendations {
-			parsedRecommendations += "- " + recommendation + "\n"
-		}
-
-		detailedBreakdown := structuredSummary.Summary.DetailedBreakdown
-		parsedDetailedBreakdown := ""
-		for _, breakdown := range detailedBreakdown {
-			parsedDetailedBreakdown += "- " + breakdown.Scenario + " - " + breakdown.Status + " - " + breakdown.Outcome + "\n"
-		}
-
-		summary := "## Overall Summary\n\n" + overallSummary +
-			"\n\n" + "## Key Findings\n\n" + parsedKeyFindings +
-			"\n\n" + "## Recommendations\n\n" + parsedRecommendations +
-			"\n\n" + "## Detailed Breakdown\n\n" + parsedDetailedBreakdown
+		structuredSummary, err := sdk.GenerateSummary(ctx, m.evalState.JobID, judgeModel, apiKey, &m.config.QualifireAPIKey)
+		if err != nil {
+			return SummaryGeneratedMsg{Summary: "", Err: err}
+		}
+
+		sum := structuredSummary.Summary
+		var b strings.Builder
+
+		if sum.OverallSummary != "" {
+			b.WriteString("## Overall Summary\n\n")
+			b.WriteString(sum.OverallSummary)
+			b.WriteString("\n\n")
+		}
+		if len(sum.KeyFindings) > 0 {
+			b.WriteString("## Key Findings\n\n")
+			for _, f := range sum.KeyFindings {
+				if f == "" {
+					continue
+				}
+				b.WriteString("- ")
+				b.WriteString(f)
+				b.WriteByte('\n')
+			}
+			b.WriteByte('\n')
+		}
+		if len(sum.Recommendations) > 0 {
+			b.WriteString("## Recommendations\n\n")
+			for _, r := range sum.Recommendations {
+				if r == "" {
+					continue
+				}
+				b.WriteString("- ")
+				b.WriteString(r)
+				b.WriteByte('\n')
+			}
+			b.WriteByte('\n')
+		}
+		if len(sum.DetailedBreakdown) > 0 {
+			b.WriteString("## Detailed Breakdown\n\n")
+			for _, br := range sum.DetailedBreakdown {
+				b.WriteString("- ")
+				b.WriteString(br.Scenario)
+				b.WriteString(" - ")
+				b.WriteString(br.Status)
+				b.WriteString(" - ")
+				b.WriteString(br.Outcome)
+				b.WriteByte('\n')
+			}
+		}
+
+		summary := strings.TrimSpace(b.String())

#!/bin/bash
# Confirm all call sites match the updated SDK signature and there’s exactly one definition.
rg -nP 'GenerateSummary\s*\(' -C2
rg -nP 'func\s*\(\s*\*?RogueSDK\s*\)\s*GenerateSummary\s*\(' -C3

🧹 Nitpick comments (1)

packages/tui/internal/tui/app.go (1)

316-333: Paste into dialogs: insert at cursor, not always append

For consistency with LLM dialog, insert at InputCursor; also normalize CRLF.

Apply:

-			// Clean the clipboard text (remove newlines and trim whitespace)
-			cleanText := strings.TrimSpace(strings.ReplaceAll(clipboardText, "\n", ""))
+			// Clean the clipboard text (strip CRLF/newlines and trim)
+			cleanText := strings.TrimSpace(strings.ReplaceAll(strings.ReplaceAll(clipboardText, "\r\n", ""), "\n", ""))
@@
-			m.dialog.Input += cleanText
-			m.dialog.InputCursor = len(m.dialog.Input)
+			before := m.dialog.Input[:m.dialog.InputCursor]
+			after := m.dialog.Input[m.dialog.InputCursor:]
+			m.dialog.Input = before + cleanText + after
+			m.dialog.InputCursor += len(cleanText)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Lite

📥 Commits

Reviewing files that changed from the base of the PR and between 45c037f and 7036bb0.

📒 Files selected for processing (2)

packages/tui/internal/components/llm_config_dialog.go (2 hunks)
packages/tui/internal/tui/app.go (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/tui/internal/tui/app.go (3)

packages/tui/internal/components/command_line.go (1)

CommandInput (21-30)

packages/tui/internal/components/dialog.go (1)

Dialog (38-51)

packages/tui/internal/components/viewport.go (1)

NewViewport (74-93)

🔇 Additional comments (5)

packages/tui/internal/components/llm_config_dialog.go (2)

552-556: Good: now using the exported clipboard helper

This aligns with reuse across the components package.

572-600: Ignore duplicate removal suggestion No duplicate GetClipboardContent definition exists—only the one in packages/tui/internal/components/llm_config_dialog.go was found.

Likely an incorrect or invalid review comment.

packages/tui/internal/tui/app.go (3)

160-175: Model fields extension looks fine

detailedBreakdown typed to table.Row aligns with the new import.

269-269: Viewport height tweak acknowledged

Reduces report panel height to 15; no concerns.

11-13: No action required: dependency already declared in packages/tui/go.mod

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

rogue/server/api/llm.py (1)

91-96: Offload sync LLM summary generation to threadpool

Prevent event-loop blocking by running the synchronous LLM call in a threadpool.

-        summary = LLMService.generate_summary_from_results(
-            model=request.model,
-            results=request.results,
-            llm_provider_api_key=request.api_key,
-        )
+        summary = await run_in_threadpool(
+            LLMService.generate_summary_from_results,
+            model=request.model,
+            results=request.results,
+            llm_provider_api_key=request.api_key,
+        )

♻️ Duplicate comments (3)

rogue/server/api/llm.py (3)

99-102: Do not log secrets (qualifire_api_key)

This logs a raw API key. Log only presence.

-        logger.info(
-            "Qualifire API key",
-            extra={"qualifire_api_key": request.qualifire_api_key},
-        )
+        logger.info(
+            "Qualifire API key provided",
+            extra={"present": bool(request.qualifire_api_key)},
+        )

132-147: Run Qualifire reporting in threadpool and fix type mismatch to EvaluationResults

Offload the blocking requests.post.
QualifireService.report_summary expects EvaluationResult but callers pass EvaluationResults; align kwarg and type to evaluation_results. This comment complements prior review feedback.

Proposed change here (assumes updating QualifireService signature to accept evaluation_results: EvaluationResults):

-            QualifireService.report_summary(
-                ReportSummaryRequest(
-                    job_id=request.job_id,
-                    structured_summary=summary,
-                    deep_test=job.deep_test if job else request.deep_test,
-                    start_time=(
-                        job.created_at
-                        if job is not None
-                        else datetime.now(timezone.utc)
-                    ),
-                    judge_model=job.judge_model if job else request.judge_model,
-                    qualifire_url=request.qualifire_url,
-                    qualifire_api_key=request.qualifire_api_key,
-                ),
-                evaluation_result=request.results,
-            )
+            await run_in_threadpool(
+                QualifireService.report_summary,
+                ReportSummaryRequest(
+                    job_id=request.job_id,
+                    structured_summary=summary,
+                    deep_test=job.deep_test if job else request.deep_test,
+                    start_time=(
+                        job.created_at
+                        if job is not None
+                        else datetime.now(timezone.utc)
+                    ),
+                    judge_model=job.judge_model if job else request.judge_model,
+                    qualifire_url=request.qualifire_url,
+                    qualifire_api_key=request.qualifire_api_key,
+                ),
+                evaluation_results=request.results,
+            )

Verification script to locate and confirm the service signature and all call sites:

#!/bin/bash
# Find QualifireService.report_summary signature and usages
rg -nP 'class\s+QualifireService\b|def\s+report_summary\s*\(' -C3 rogue
rg -nP 'QualifireService\.report_summary\s*\(' -n -C2 rogue

I can prep the matching patch for rogue/server/services/qualifire_service.py if you want.

187-190: Same two issues here: blocking call and EvaluationResults type kwarg

Mirror the fixes from the other call site.

-        QualifireService.report_summary(
-            request,
-            evaluation_result=EvaluationResults(results=results),
-        )
+        await run_in_threadpool(
+            QualifireService.report_summary,
+            request,
+            evaluation_results=EvaluationResults(results=results),
+        )

🧹 Nitpick comments (3)

rogue/server/api/llm.py (3)

127-131: Avoid dumping large/PII payloads into logs

Logging full summary and results can bloat logs and risk sensitive data exposure. Log sizes/ids instead.

-            logger.info(
-                "Summary",
-                extra={"summary": summary, "results": request.results},
-            )
+            logger.info(
+                "Summary ready",
+                extra={
+                    "results_count": len(request.results.results),
+                },
+            )

162-166: Add return type annotation

Annotate the handler to satisfy mypy and repo guidelines.

-async def report_summary_handler(
+async def report_summary_handler(
     request: ReportSummaryRequest,
     evaluation_service: EvaluationService = Depends(get_evaluation_service),
-):
+) -> ReportSummaryResponse:

195-200: Preserve HTTPException details; don’t wrap

Re-raise HTTPException to keep its detail/status unmodified; wrap only unexpected exceptions.

-    except Exception as e:
-        logger.exception("Failed to report summary")
-        raise HTTPException(
-            status_code=e.status_code if hasattr(e, "status_code") else 500,
-            detail=f"Failed to report summary: {str(e)}",
-        )
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.exception("Failed to report summary")
+        raise HTTPException(status_code=500, detail=f"Failed to report summary: {e}")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64ff464 and 2f23574.

📒 Files selected for processing (2)

rogue/evaluator_agent/evaluator_agent.py (3 hunks)
rogue/server/api/llm.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

rogue/evaluator_agent/evaluator_agent.py

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/server/api/llm.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/server/api/llm.py

🧬 Code graph analysis (1)

rogue/server/api/llm.py (7)

sdks/python/rogue_sdk/types.py (4)

EvaluationResults (192-208)

SummaryGenerationRequest (431-441)

ReportSummaryResponse (513-516)

ReportSummaryRequest (501-510)

rogue/server/api/evaluation.py (1)

get_evaluation_service (23-24)

rogue/server/services/evaluation_service.py (2)

EvaluationService (20-234)

get_job (31-33)

rogue/server/models/api_format.py (1)

ServerSummaryGenerationResponse (52-56)

rogue/common/logging/config.py (1)

get_logger (103-115)

rogue/server/services/llm_service.py (1)

LLMService (166-283)

rogue/server/services/qualifire_service.py (2)

QualifireService (8-40)

report_summary (10-40)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: codestyle

🔇 Additional comments (2)

rogue/server/api/llm.py (2)

69-76: Endpoint shape and DI look good

Switch to ServerSummaryGenerationResponse and injecting EvaluationService via Depends is clean.

149-152: Response construction looks good

Returning the structured summary and success message via the server model is consistent.

coderabbitai · 2025-09-10T10:25:24Z

rogue/server/api/llm.py

+        if not results or len(results) == 0:
+            raise HTTPException(
+                status_code=404,
+                detail="Evaluation results not found or empty",
+            )


⚠️ Potential issue

Bug: len() called on EvaluationResults (no len)

This will raise TypeError. Check the underlying list instead.

- if not results or len(results) == 0: + if not results or not results.results:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if not results or len(results) == 0:

raise HTTPException(

status_code=404,

detail="Evaluation results not found or empty",

)

if not results or not results.results:

raise HTTPException(

status_code=404,

detail="Evaluation results not found or empty",

)

🤖 Prompt for AI Agents

In rogue/server/api/llm.py around lines 181-185, the code calls len(results) on an EvaluationResults object which lacks __len__; replace that check by inspecting the underlying list attribute instead. Change the condition to check the actual list field on the EvaluationResults instance (for example results.evaluations or results.items — use the real attribute name in this class) and call len() on that list or simply test its truthiness (e.g., if not results or not results.<list_attr>: raise HTTPException(...)). Ensure you do not call len() directly on the EvaluationResults object.

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

sdks/python/rogue_sdk/types.py (1)
484-499: Bug: invalid HttpUrl construction in validator

HttpUrl(v) is not a constructor in Pydantic v2 and will raise. Either drop this validator or use TypeAdapter.
-from pydantic import (
+from pydantic import (
     BaseModel,
     ConfigDict,
     Field,
     HttpUrl,
     field_validator,
     model_validator,
 )
+from pydantic import TypeAdapter
@@
-    @field_validator("base_url", mode="after")
-    def validate_base_url(cls, v: str | HttpUrl) -> HttpUrl:
-        if isinstance(v, str):
-            return HttpUrl(v)
-        return v
+    @field_validator("base_url", mode="before")
+    def validate_base_url(cls, v):
+        # Accept str or HttpUrl; coerce strings into HttpUrl
+        return TypeAdapter(HttpUrl).validate_python(v)
packages/tui/internal/tui/evaluation.go (1)
29-32: Align ScenarioType with server API (replace "custom").

Python/TS SDKs expose "policy" and "prompt_injection" (no "custom"). Using "custom" will likely 400 the request when validated server-side.

Apply:
-const (
-	ScenarioTypePolicy ScenarioType = "policy"
-	ScenarioTypeCustom ScenarioType = "custom"
-)
+const (
+	ScenarioTypePolicy          ScenarioType = "policy"
+	ScenarioTypePromptInjection ScenarioType = "prompt_injection"
+)

♻️ Duplicate comments (3)

packages/tui/internal/tui/app.go (2)

101-106: Good: error-checked before dereference

Early-return on err prevents panics when accessing structuredSummary. Matches earlier guidance.

110-131: Avoid O(n²) string concatenations; use strings.Builder and skip empty items

The loops build large strings via +=. Prefer strings.Builder and write sections conditionally.

-        overallSummary := structuredSummary.Summary.OverallSummary
-        keyFindings := structuredSummary.Summary.KeyFindings
-        parsedKeyFindings := ""
-        for _, finding := range keyFindings {
-            parsedKeyFindings += "- " + finding + "\n"
-        }
-        recommendations := structuredSummary.Summary.Recommendations
-        parsedRecommendations := ""
-        for _, recommendation := range recommendations {
-            parsedRecommendations += "- " + recommendation + "\n"
-        }
-        detailedBreakdown := structuredSummary.Summary.DetailedBreakdown
-        parsedDetailedBreakdown := ""
-        for _, breakdown := range detailedBreakdown {
-            parsedDetailedBreakdown += "- " + breakdown.Scenario + " - " + breakdown.Status + " - " + breakdown.Outcome + "\n"
-        }
-
-        summary := "## Overall Summary\n\n" + overallSummary +
-            "\n\n" + "## Key Findings\n\n" + parsedKeyFindings +
-            "\n\n" + "## Recommendations\n\n" + parsedRecommendations +
-            "\n\n" + "## Detailed Breakdown\n\n" + parsedDetailedBreakdown
+        sum := structuredSummary.Summary
+        var b strings.Builder
+        if sum.OverallSummary != "" {
+            b.WriteString("## Overall Summary\n\n")
+            b.WriteString(sum.OverallSummary)
+            b.WriteString("\n\n")
+        }
+        if len(sum.KeyFindings) > 0 {
+            b.WriteString("## Key Findings\n\n")
+            for _, f := range sum.KeyFindings {
+                if f == "" { continue }
+                b.WriteString("- ")
+                b.WriteString(f)
+                b.WriteByte('\n')
+            }
+            b.WriteByte('\n')
+        }
+        if len(sum.Recommendations) > 0 {
+            b.WriteString("## Recommendations\n\n")
+            for _, r := range sum.Recommendations {
+                if r == "" { continue }
+                b.WriteString("- ")
+                b.WriteString(r)
+                b.WriteByte('\n')
+            }
+            b.WriteByte('\n')
+        }
+        if len(sum.DetailedBreakdown) > 0 {
+            b.WriteString("## Detailed Breakdown\n\n")
+            for _, br := range sum.DetailedBreakdown {
+                b.WriteString("- ")
+                b.WriteString(br.Scenario)
+                b.WriteString(" - ")
+                b.WriteString(br.Status)
+                b.WriteString(" - ")
+                b.WriteString(br.Outcome)
+                b.WriteByte('\n')
+            }
+        }
+        summary := strings.TrimSpace(b.String())

rogue/server/api/llm.py (1)

99-103: Do not log secrets (qualifire_api_key)

This prints the raw API key to logs.

-        logger.info(
-            "Qualifire API key",
-            extra={"qualifire_api_key": request.qualifire_api_key},
-        )
+        logger.info("Qualifire API key provided", extra={"present": bool(request.qualifire_api_key)})

🧹 Nitpick comments (10)

packages/tui/internal/tui/app.go (1)

334-351: Paste handling: avoid re-reading clipboard; sanitize minimally

On tea.PasteMsg, you re-read the OS clipboard (can be slow/error-prone) and strip all newlines. Prefer handling the paste event’s content (if available in your version) or at least replace \n with spaces to preserve word boundaries.
- cleanText := strings.TrimSpace(strings.ReplaceAll(clipboardText, "\n", ""))
+ cleaned := strings.ReplaceAll(clipboardText, "\r\n", "\n")
+ cleaned = strings.TrimSpace(strings.ReplaceAll(cleaned, "\n", " "))

sdks/python/rogue_sdk/types.py (2)

264-275: Surface timestamp parse failures

Silently falling back to now() hides data issues. Log a warning when parsing fails.

-                if msg.timestamp:
+                if msg.timestamp:
                     try:
                         if isinstance(msg.timestamp, str):
                             timestamp = datetime.fromisoformat(
                                 msg.timestamp.replace("Z", "+00:00"),
                             )
                         else:
                             timestamp = msg.timestamp
                     except (ValueError, AttributeError):
-                        timestamp = datetime.now(timezone.utc)
+                        logger.warning("Failed to parse timestamp: %r", msg.timestamp)
+                        timestamp = datetime.now(timezone.utc)

444-451: Consider typing detailed_breakdown rows

List[dict] loses structure. Optional: model it to match server/TUI: {scenario,status,outcome}.

-class StructuredSummary(BaseModel):
+class StructuredBreakdownItem(BaseModel):
+    scenario: str
+    status: str
+    outcome: str
+
+class StructuredSummary(BaseModel):
@@
-    detailed_breakdown: List[dict]  # Table rows for scenario breakdown
+    detailed_breakdown: List[StructuredBreakdownItem]

rogue/server/api/llm.py (1)

127-131: Reduce log verbosity for large payloads

Logging the entire summary and results at INFO can bloat logs. Use DEBUG or drop payload fields.

-        logger.info(
-            "Summary",
-            extra={"summary": summary, "results": request.results},
-        )
+        logger.debug("Summary generated", extra={"results_count": len(request.results.results)})

packages/tui/internal/tui/evaluation.go (6)

96-105: Name the breakdown row type for reuse and clarity.

Inline anon struct hampers reuse/validation.

 type StructuredSummary struct {
 	OverallSummary    string   `json:"overall_summary"`
 	KeyFindings       []string `json:"key_findings"`
 	Recommendations   []string `json:"recommendations"`
-	DetailedBreakdown []struct {
-		Scenario string `json:"scenario"`
-		Status   string `json:"status"`
-		Outcome  string `json:"outcome"`
-	} `json:"detailed_breakdown"`
+	DetailedBreakdown []DetailedBreakdownRow `json:"detailed_breakdown"`
 }

Add below (outside selected range):

type DetailedBreakdownRow struct {
	Scenario string `json:"scenario"`
	Status   string `json:"status"`
	Outcome  string `json:"outcome"`
}

431-437: Stricter URL validation.

url.Parse accepts many invalid inputs. Check scheme+host or use ParseRequestURI.

- if _, err := url.Parse(serverURL); err != nil {
+ u1, err := url.ParseRequestURI(serverURL)
+ if err != nil || (u1.Scheme != "http" && u1.Scheme != "https") || u1.Host == "" {
   return nil, nil, fmt.Errorf("invalid server url: %w", err)
 }
- if _, err := url.Parse(agentURL); err != nil {
+ u2, err := url.ParseRequestURI(agentURL)
+ if err != nil || (u2.Scheme != "http" && u2.Scheme != "https") || u2.Host == "" {
   return nil, nil, fmt.Errorf("invalid agent url: %w", err)
 }

453-461: Pre-allocate scenarios slice.

Avoids reallocations on large lists.

-// Convert scenarios
-for _, s := range scenarios {
+// Convert scenarios
+request.Scenarios = make([]EvalScenario, 0, len(scenarios))
+for _, s := range scenarios {
   request.Scenarios = append(request.Scenarios, EvalScenario{
     Scenario:        s.Scenario,
     ScenarioType:    s.ScenarioType,
     ExpectedOutcome: s.ExpectedOutcome,
   })
}

520-523: Be lenient on success status codes.

APIs often return 200/201/202. Treat 2xx as success.

-if resp.StatusCode != http.StatusOK {
+if resp.StatusCode < 200 || resp.StatusCode >= 300 {
   body, _ := io.ReadAll(resp.Body)
   return nil, fmt.Errorf("summary generation failed: %d %s", resp.StatusCode, string(body))
 }

534-574: ReportSummary: allow 2xx and consider longer timeout.

The call may wait on downstream BFF; 30s can be tight.

- resp, err := sdk.httpClient.Do(req)
+ client := &http.Client{Timeout: 2 * time.Minute}
+ resp, err := client.Do(req)
 ...
-if resp.StatusCode != http.StatusOK {
+if resp.StatusCode < 200 || resp.StatusCode >= 300 {
   body, _ := io.ReadAll(resp.Body)
   return fmt.Errorf("report summary failed: %d %s", resp.StatusCode, string(body))
 }

465-474: Comment no longer matches behavior.

Function returns structured JSON, not “markdown summary”.

-// GenerateSummary generates a markdown summary from evaluation results
+// GenerateSummary generates a structured summary (LLM-backed) from evaluation results.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f23574 and d9ce9e2.

📒 Files selected for processing (5)

packages/tui/internal/tui/app.go (7 hunks)
packages/tui/internal/tui/eval_ui.go (2 hunks)
packages/tui/internal/tui/evaluation.go (6 hunks)
rogue/server/api/llm.py (4 hunks)
sdks/python/rogue_sdk/types.py (8 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

packages/tui/internal/tui/eval_ui.go

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/server/api/llm.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/server/api/llm.py

🧬 Code graph analysis (4)

rogue/server/api/llm.py (6)

sdks/python/rogue_sdk/types.py (4)

EvaluationResults (192-208)

SummaryGenerationRequest (431-441)

ReportSummaryResponse (513-516)

ReportSummaryRequest (501-510)

rogue/server/api/evaluation.py (1)

get_evaluation_service (23-24)

rogue/server/services/evaluation_service.py (2)

EvaluationService (20-234)

get_job (31-33)

rogue/server/models/api_format.py (1)

ServerSummaryGenerationResponse (52-56)

rogue/server/services/qualifire_service.py (2)

QualifireService (8-40)

report_summary (10-40)

sdks/python/rogue_sdk/client.py (2)

generate_summary (147-194)

request (79-83)

packages/tui/internal/tui/evaluation.go (2)

sdks/python/rogue_sdk/types.py (3)

Scenario (94-133)

ScenarioType (45-49)

StructuredSummary (444-450)

packages/sdk/src/types.ts (1)

Scenario (41-47)

sdks/python/rogue_sdk/types.py (3)

rogue/server/models/api_format.py (4)

ApiChatMessage (14-19)

ApiConversationEvaluation (22-27)

ApiScenarioResult (30-37)

ApiEvaluationResult (40-49)

rogue/server/services/api_format_service.py (1)

convert_to_api_format (20-113)

packages/tui/internal/tui/evaluation.go (1)

StructuredSummary (96-105)

packages/tui/internal/tui/app.go (6)

packages/tui/internal/tui/evaluation.go (2)

StructuredSummary (96-105)

NewRogueSDK (112-119)

packages/tui/internal/components/command_line.go (1)

CommandInput (21-30)

packages/tui/internal/components/dialog.go (3)

Dialog (38-51)

NewInfoDialog (65-78)

PrimaryButton (32-32)

packages/tui/internal/components/llm_config_dialog.go (2)

LLMConfigDialog (132-152)

GetClipboardContent (573-600)

packages/tui/internal/components/viewport.go (1)

NewViewport (74-93)

packages/tui/internal/components/dialog_helpers.go (1)

ShowErrorDialog (6-19)

🔇 Additional comments (3)

packages/tui/internal/tui/app.go (1)

178-193: New detailedBreakdown field appears unused

Model.detailedBreakdown []table.Row isn’t read or rendered here. Either wire it into views or remove to avoid dead state.

packages/tui/internal/tui/evaluation.go (2)

441-448: The directory paths and struct definitions need verification before proceeding. Ran scripts to list the repo root and locate EvaluationRequest and its JSON tags—once you share the output, we can see if the server expects judge_llm or judge_model.

491-496: Guard optional fields and confirm payload JSON schema
Only set qualifire_api_key when qualifireAPIKey is non-nil/non-empty; verify the server’s SummaryGenerationRequest struct JSON tags (especially the results field nesting) to ensure you’re sending the expected shape.

coderabbitai · 2025-09-10T11:20:40Z

packages/tui/internal/tui/app.go

+		structuredSummary, err := sdk.GenerateSummary(
+			ctx,
+			m.evalState.JobID,
+			judgeModel,
+			apiKey,
+			&m.config.QualifireAPIKey,
+			m.evalState.DeepTest,
+			judgeModel,
+			m.config.ServerURL,
+		)


💡 Verification agent

🧩 Analysis chain

Possible incorrect SDK call: duplicate judgeModel argument; verify signature

GenerateSummary receives judgeModel twice (Lines 93 and 97). Confirm the parameter order and types to avoid passing a wrong value silently.

Run to locate the Go SDK method signature and all callsites:

🏁 Script executed:

#!/bin/bash rg -nP 'func\s+$\s*\*\s*RogueSDK\s*$\s*GenerateSummary\(' -C3 rg -nP '\bGenerateSummary\(' -n -C2

Length of output: 662

Replace duplicate judgeModel argument in GenerateSummary call
GenerateSummary’s signature expects two distinct model parameters; the 7th argument here reuses judgeModel but should be the intended model (e.g. summaryModel or m.config.AnswerModel). [packages/tui/internal/tui/app.go:90]

🤖 Prompt for AI Agents

In packages/tui/internal/tui/app.go around lines 90 to 99, the call to sdk.GenerateSummary mistakenly passes judgeModel twice (as the 4th and 7th arguments); replace the 7th argument with the intended summary/answer model (for example summaryModel or m.config.AnswerModel) so the function receives the two distinct model parameters expected by its signature, and ensure you import or reference the correct model variable that exists in the surrounding scope.

coderabbitai · 2025-09-10T11:20:41Z

packages/tui/internal/tui/app.go

+			case "save_qualifire_and_report":
+				// Handle Qualifire API key save and report persistence
+				if m.dialog != nil && m.dialog.Title == "Configure Qualifire API Key" {
+					// Save the API key to config (allow empty to clear the key)
+					m.config.QualifireAPIKey = msg.Input
+					// Only enable integration if there's an API key
+					if msg.Input != "" {
+						m.config.QualifireEnabled = true
+						if m.configState != nil {
+							m.configState.QualifireEnabled = true
+							m.configState.HasChanges = true
+						}
+					}
+
+					// immediately report the summary
+					if m.evalState != nil && m.evalState.Completed {
+						sdk := NewRogueSDK(m.config.ServerURL)
+						err := sdk.ReportSummary(
+							context.Background(),
+							m.evalState.JobID,
+							m.evalState.StructuredSummary,
+							m.evalState.DeepTest,
+							m.evalState.JudgeModel,
+							m.config.QualifireAPIKey,
+						)
+						if err != nil {
+							// Show error dialog
+							errorDialog := components.ShowErrorDialog(
+								"Report Summary Error",
+								fmt.Sprintf("Failed to report summary: %v", err),
+							)
+							m.dialog = &errorDialog
+						}
+
+						err = m.saveConfig()
+						if err != nil {
+							// Show error dialog
+							errorDialog := components.ShowErrorDialog(
+								"Configuration Error",
+								fmt.Sprintf("Failed to save Qualifire configuration: %v", err),
+							)
+							m.dialog = &errorDialog
+							return m, nil
+						} else {
+							// Show appropriate success dialog
+							var message string
+							if msg.Input != "" {
+								message = "Qualifire API key has been successfully saved and integration is now enabled. Your evaluation report will now be automatically persisted."
+							} else {
+								message = "Qualifire API key has been cleared and integration is now disabled."
+							}
+							successDialog := components.NewInfoDialog(
+								"Qualifire Configured",
+								message,
+							)
+							m.dialog = &successDialog
+							return m, nil
+						}
+					}
+				}


🛠️ Refactor suggestion

Do not report when API key is empty; also avoid blocking UI with network I/O

This path attempts to report even when msg.Input == "", and performs a synchronous HTTP call inside Update, freezing the TUI during the request.

Guard against empty key and skip reporting.

Offload reporting into a tea.Cmd (follow the pattern used by summaryGenerationCmd) and show a spinner.

Minimal immediate guard:

- // immediately report the summary - if m.evalState != nil && m.evalState.Completed { + // immediately report the summary if we have a key + if m.evalState != nil && m.evalState.Completed && m.config.QualifireAPIKey != "" { sdk := NewRogueSDK(m.config.ServerURL) err := sdk.ReportSummary(

Follow-up: happy to sketch a reportSummaryCmd(jobID, summary, deepTest, judgeModel, apiKey) if you want it non-blocking.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

case "save_qualifire_and_report":

// Handle Qualifire API key save and report persistence

if m.dialog != nil && m.dialog.Title == "Configure Qualifire API Key" {

// Save the API key to config (allow empty to clear the key)

m.config.QualifireAPIKey = msg.Input

// Only enable integration if there's an API key

if msg.Input != "" {

m.config.QualifireEnabled = true

if m.configState != nil {

m.configState.QualifireEnabled = true

m.configState.HasChanges = true

}

}

// immediately report the summary

if m.evalState != nil && m.evalState.Completed {

sdk := NewRogueSDK(m.config.ServerURL)

err := sdk.ReportSummary(

context.Background(),

m.evalState.JobID,

m.evalState.StructuredSummary,

m.evalState.DeepTest,

m.evalState.JudgeModel,

m.config.QualifireAPIKey,

)

if err != nil {

// Show error dialog

errorDialog := components.ShowErrorDialog(

"Report Summary Error",

fmt.Sprintf("Failed to report summary: %v", err),

)

m.dialog = &errorDialog

}

err = m.saveConfig()

if err != nil {

// Show error dialog

errorDialog := components.ShowErrorDialog(

"Configuration Error",

fmt.Sprintf("Failed to save Qualifire configuration: %v", err),

)

m.dialog = &errorDialog

return m, nil

} else {

// Show appropriate success dialog

var message string

if msg.Input != "" {

message = "Qualifire API key has been successfully saved and integration is now enabled. Your evaluation report will now be automatically persisted."

} else {

message = "Qualifire API key has been cleared and integration is now disabled."

}

successDialog := components.NewInfoDialog(

"Qualifire Configured",

message,

)

m.dialog = &successDialog

return m, nil

}

}

}

// immediately report the summary if we have a key

if m.evalState != nil && m.evalState.Completed && m.config.QualifireAPIKey != "" {

sdk := NewRogueSDK(m.config.ServerURL)

err := sdk.ReportSummary(

context.Background(),

m.evalState.JobID,

m.evalState.StructuredSummary,

m.evalState.DeepTest,

m.evalState.JudgeModel,

m.config.QualifireAPIKey,

)

🤖 Prompt for AI Agents

In packages/tui/internal/tui/app.go around lines 552 to 611, the code currently attempts to report the summary synchronously even when msg.Input is empty and performs a blocking HTTP call inside Update; change it so that if msg.Input == "" you skip any reporting entirely, and if non-empty, do NOT call sdk.ReportSummary inline—instead spawn a non-blocking tea.Cmd (follow the existing summaryGenerationCmd pattern) that runs ReportSummary in a goroutine, dispatches progress/response messages to update (so you can show a spinner and success/error dialogs), and only call m.saveConfig() either before enqueuing the report cmd or handle save errors through the cmd flow; ensure Update returns immediately with the cmd rather than performing network I/O synchronously.

packages/tui/internal/tui/evaluation.go

rogue/server/api/llm.py

sdks/python/rogue_sdk/types.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

….com/qualifire-dev/rogue-private into feature/FIRE-751-rogue-report-to-bff

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

rogue/run_cli.py (1)

210-214: Correct: unwrap SecretStr before passing to SDK

Converting SecretStr to a raw str before calling sdk.generate_summary addresses the earlier serialization issue.

🧹 Nitpick comments (1)

rogue/run_cli.py (1)

219-221: Consider passing job_id to enable Qualifire reporting

Server-side reporting to Qualifire only happens when both qualifire_api_key and job_id are provided. You’re passing qualifire_api_key but not job_id, so reporting will be skipped. If intended, ignore; otherwise, wire job_id through.

Add job_id to the SDK call:

         summary, _ = await sdk.generate_summary(
             results=results,
             model=judge_llm,
             api_key=judge_llm_api_key,
             qualifire_api_key=qualifire_api_key,
             deep_test=deep_test_mode,
             judge_model=judge_model,
+            job_id=job_id,
         )

Apply supporting changes outside this hunk:

Extend create_report signature:

-    judge_model: str | None = None,
+    judge_model: str | None = None,
+    job_id: str | None = None,

Return and propagate job_id from scenario run:

-async def run_scenarios(... ) -> EvaluationResults | None:
+async def run_scenarios(... ) -> tuple[EvaluationResults, str] | None:

-async def _run_scenarios_with_sdk(... ) -> EvaluationResults | None:
+async def _run_scenarios_with_sdk(... ) -> tuple[EvaluationResults, str] | None:

-            return results
+            return results, final_job.job_id

Unpack and pass job_id in run_cli:

-    results = await run_scenarios(
+    run_out = await run_scenarios(
...
-    if not results:
+    if not run_out:
         raise ValueError(...)
+    results, job_id = run_out

     report_summary = await create_report(
         rogue_server_url=args.rogue_server_url,
         judge_llm=cli_input.judge_llm,
         results=results,
         output_report_file=cli_input.output_report_file,
         judge_llm_api_key_secret=cli_input.judge_llm_api_key,
         deep_test_mode=cli_input.deep_test_mode,
         judge_model=cli_input.judge_llm,
+        job_id=job_id,
     )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5134b3 and 38ccedd.

📒 Files selected for processing (1)

rogue/run_cli.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/run_cli.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/run_cli.py

🧬 Code graph analysis (1)

rogue/run_cli.py (3)

rogue/server/api/llm.py (1)

generate_summary (73-159)

sdks/python/rogue_sdk/client.py (1)

generate_summary (147-194)

sdks/python/rogue_sdk/sdk.py (1)

generate_summary (287-342)

🪛 Flake8 (7.2.0)

rogue/run_cli.py

[error] 209-209: IndentationError: expected an indented block after 'try' statement on line 208

(E999)

🪛 GitHub Actions: pr

rogue/run_cli.py

[error] 208-209: mypy: syntax error - expected an indented block after 'try' statement in rogue/run_cli.py (line 208). Command: 'uv run mypy --config-file .mypy.ini'.

🪛 GitHub Actions: Test

rogue/run_cli.py

[error] 208-208: IndentationError: expected an indented block after 'try' statement on line 208.

🪛 GitHub Actions: Rogue

rogue/run_cli.py

[error] 208-209: IndentationError: expected an indented block after 'try' statement on line 208 in rogue/run_cli.py (during startup of server; command: uv run -m rogue server --host 0.0.0.0 --port 8000 &).

🔇 Additional comments (1)

rogue/run_cli.py (1)

191-193: Signature extension looks good

Adding qualifire_api_key_secret, deep_test_mode, and judge_model to create_report is appropriate and correctly typed with safe defaults.

rogue/run_cli.py

coderabbitai · 2025-09-11T11:00:57Z

rogue/run_cli.py

+        deep_test_mode=cli_input.deep_test_mode,
+        judge_model=cli_input.judge_llm,
    )


💡 Verification agent

🧩 Analysis chain

Wire Qualifire API key from CLI/config (if available)

create_report now accepts qualifire_api_key_secret but the call site doesn’t pass it. If CLIInput exposes it, forward it.

report_summary = await create_report( rogue_server_url=args.rogue_server_url, judge_llm=cli_input.judge_llm, results=results, output_report_file=cli_input.output_report_file, judge_llm_api_key_secret=cli_input.judge_llm_api_key, + qualifire_api_key_secret=cli_input.qualifire_api_key, deep_test_mode=cli_input.deep_test_mode, judge_model=cli_input.judge_llm, )

Run to confirm the field name on CLIInput/PartialCLIInput:

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n --type=py -C2 'class\s+CLIInput|class\s+PartialCLIInput' rg -n --type=py -C2 'qualifire.*api.*key' rg -n --type=py -C2 '\bjob_id\b'

Length of output: 32349

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== rogue/models/cli_input.py ===" sed -n '1,240p' rogue/models/cli_input.py || true echo echo "=== rogue/run_cli.py (first 520 lines with numbers) ===" nl -ba -v1 -w4 -s': ' rogue/run_cli.py | sed -n '1,520p' || true echo echo "=== create_report call sites ===" rg -n --type py 'create_report\(' -S || true rg -n --type py 'create_report\(' -S | while IFS=: read -r file lineno rest; do echo "---- $file:$lineno ----" sed -n "$((lineno-3)),$((lineno+3))p" "$file" || true done

Length of output: 3650

Wire Qualifire API key from CLI/config and forward it to create_report

create_report accepts qualifire_api_key_secret but run_cli doesn't pass it and CLIInput/PartialCLIInput do not expose the key — add the field and forward it.

Add qualifire_api_key: SecretStr | None = None to CLIInput and PartialCLIInput (rogue/models/cli_input.py — next to judge_llm_api_key).

Forward it at the create_report call site in rogue/run_cli.py (call around line 361): add qualifire_api_key_secret=cli_input.qualifire_api_key.

Suggested change:

report_summary = await create_report( rogue_server_url=args.rogue_server_url, judge_llm=cli_input.judge_llm, results=results, output_report_file=cli_input.output_report_file, judge_llm_api_key_secret=cli_input.judge_llm_api_key, + qualifire_api_key_secret=cli_input.qualifire_api_key, deep_test_mode=cli_input.deep_test_mode, judge_model=cli_input.judge_llm, )

Update tests that construct CLIInput (rogue/tests/models/test_cli_input.py) if they need to supply the new field.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

deep_test_mode=cli_input.deep_test_mode,

judge_model=cli_input.judge_llm,

)

qualifire_api_key_secret=cli_input.qualifire_api_key,

deep_test_mode=cli_input.deep_test_mode,

judge_model=cli_input.judge_llm,

)

🤖 Prompt for AI Agents

In rogue/run_cli.py around lines 367-369, the call to create_report does not pass the qualifire API key; add qualifire_api_key: SecretStr | None = None to CLIInput and PartialCLIInput in rogue/models/cli_input.py (place it next to judge_llm_api_key), then forward it in the create_report call by adding qualifire_api_key_secret=cli_input.qualifire_api_key; finally update any tests that construct CLIInput (e.g., rogue/tests/models/test_cli_input.py) to include the new field where necessary.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

rogue/run_cli.py (1)

369-371: Wire Qualifire API key (SecretStr) and job_id into create_report call

Currently, create_report isn’t given qualifire_api_key_secret or job_id, so reporting won’t trigger. This was flagged earlier and remains. Pass both.

     report_summary = await create_report(
         rogue_server_url=args.rogue_server_url,
         judge_llm=cli_input.judge_llm,
         results=results,
         output_report_file=cli_input.output_report_file,
         judge_llm_api_key_secret=cli_input.judge_llm_api_key,
+        qualifire_api_key_secret=cli_input.qualifire_api_key,
+        job_id=job_id,
         deep_test_mode=cli_input.deep_test_mode,
         judge_model=cli_input.judge_llm,
     )

Follow-up (outside this hunk): plumb job_id from the evaluation step.

Return job_id from _run_scenarios_with_sdk and run_scenarios:

# signatures
async def _run_scenarios_with_sdk(...) -> tuple[EvaluationResults | None, str | None]:
async def run_scenarios(...) -> tuple[EvaluationResults | None, str | None]:

Inside _run_scenarios_with_sdk, after wait_for_evaluation:

if final_job.results:
    results = EvaluationResults(results=final_job.results)
    evaluation_results_output_path.write_text(
        results.model_dump_json(indent=2, exclude_none=True),
        encoding="utf-8",
    )
    return results, final_job.job_id
else:
    logger.error("Scenario evaluation completed but no results found.")
    return None, final_job.job_id

In run_cli:

results, job_id = await run_scenarios(...)

if not results:
    raise ValueError(f"No scenarios were evaluated for {cli_input.evaluated_agent_url}")

report_summary = await create_report(
    ...,
    qualifire_api_key_secret=cli_input.qualifire_api_key,
    job_id=job_id,
    ...
)

#!/usr/bin/env bash
# Verify SDK and models expose needed fields
set -euo pipefail
rg -n --type=py -C2 'def generate_summary\(' sdks/python/rogue_sdk | sed -n '1,200p'
rg -n --type=py -C2 '\bqualifire_api_key\b|\bjob_id\b' sdks/python/rogue_sdk
rg -n --type=py -C3 'class\s+(CLIInput|PartialCLIInput)\b' rogue/models | sed -n '1,240p'
rg -n --type=py -C2 '\bqualifire.*api.*key\b' rogue/models

🧹 Nitpick comments (1)

rogue/run_cli.py (1)

222-224: Re-raise with bare raise to preserve traceback

raise e loses the original traceback context; use raise.

-    except Exception as e:
-        logger.exception("Failed to generate summary")
-        raise e
+    except Exception:
+        logger.exception("Failed to generate summary")
+        raise

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38ccedd and 4d18d18.

📒 Files selected for processing (1)

rogue/run_cli.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/run_cli.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/run_cli.py

🧬 Code graph analysis (1)

rogue/run_cli.py (3)

rogue/server/api/llm.py (1)

generate_summary (73-159)

sdks/python/rogue_sdk/client.py (1)

generate_summary (147-194)

sdks/python/rogue_sdk/sdk.py (1)

generate_summary (287-342)

🪛 GitHub Actions: Rogue

rogue/run_cli.py

[error] 178-178: Scenario evaluation completed but no results found.

[error] 358-358: No scenarios were evaluated for http://localhost:10001/ (ValueError) during rogue CLI run. Command: 'uv run -m rogue cli --workdir ./examples/tshirt_store_agent/.rogue --evaluated-agent-url http://localhost:10001 --judge-llm openai/gpt-4.1-mini'

🔇 Additional comments (1)

rogue/run_cli.py (1)

209-213: LGTM: SecretStr correctly unwrapped before passing to SDK

This addresses the earlier serialization concern and avoids leaking the secret via logs.

coderabbitai · 2025-09-11T11:10:01Z

rogue/run_cli.py

+    qualifire_api_key_secret: SecretStr | None = None,
+    deep_test_mode: bool = False,
+    judge_model: str | None = None,
 ) -> str:


⚠️ Potential issue

Plumb job_id through create_report to enable Qualifire reporting

Without job_id, server-side reporting is skipped even if qualifire_api_key is provided. Add job_id to the function signature so it can be forwarded to the SDK call.

async def create_report( rogue_server_url: str, judge_llm: str, results: EvaluationResults, output_report_file: Path, judge_llm_api_key_secret: SecretStr | None = None, - qualifire_api_key_secret: SecretStr | None = None, - deep_test_mode: bool = False, - judge_model: str | None = None, + qualifire_api_key_secret: SecretStr | None = None, + job_id: str | None = None, + deep_test_mode: bool = False, + judge_model: str | None = None, ) -> str:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

qualifire_api_key_secret: SecretStr | None = None,

deep_test_mode: bool = False,

judge_model: str | None = None,

) -> str:

async def create_report(

rogue_server_url: str,

judge_llm: str,

results: EvaluationResults,

output_report_file: Path,

judge_llm_api_key_secret: SecretStr | None = None,

qualifire_api_key_secret: SecretStr | None = None,

job_id: str | None = None,

deep_test_mode: bool = False,

judge_model: str | None = None,

) -> str:

🤖 Prompt for AI Agents

In rogue/run_cli.py around lines 191 to 194, create_report currently lacks a job_id parameter so Qualifire reporting is skipped; add a job_id: str | None = None parameter to the function signature and propagate that job_id into the downstream SDK call that submits the report (e.g., pass job_id=job_id when calling the Qualifire client method). Also update any local calls to create_report in this module to pass the new job_id where available, and adjust type hints and tests/imports as needed.

coderabbitai · 2025-09-11T11:10:01Z

rogue/run_cli.py

+        summary, _ = await sdk.generate_summary(
            results=results,
            model=judge_llm,
            api_key=judge_llm_api_key,
+            qualifire_api_key=qualifire_api_key,
+            deep_test=deep_test_mode,
+            judge_model=judge_model,
        )


⚠️ Potential issue

Forward job_id to SDK so BFF reporting actually happens

The SDK/client and server expect job_id for reporting; omitting it disables the new reporting path.

summary, _ = await sdk.generate_summary( results=results, model=judge_llm, api_key=judge_llm_api_key, qualifire_api_key=qualifire_api_key, + job_id=job_id, deep_test=deep_test_mode, judge_model=judge_model, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

summary, _ = await sdk.generate_summary(

results=results,

model=judge_llm,

api_key=judge_llm_api_key,

qualifire_api_key=qualifire_api_key,

deep_test=deep_test_mode,

judge_model=judge_model,

)

summary, _ = await sdk.generate_summary(

results=results,

model=judge_llm,

api_key=judge_llm_api_key,

qualifire_api_key=qualifire_api_key,

job_id=job_id,

deep_test=deep_test_mode,

judge_model=judge_model,

)

🤖 Prompt for AI Agents

In rogue/run_cli.py around lines 214 to 221, the call to sdk.generate_summary omits the job_id parameter so the SDK/server reporting path is not triggered; update the call to forward the current job_id (pass job_id=job_id) into sdk.generate_summary alongside the existing args so the SDK can perform BFF reporting. Ensure the job_id variable in scope is used and run tests to verify reporting is enabled.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

sdks/python/rogue_sdk/types.py (1)

494-499: Fix HttpUrl validator (current code constructs HttpUrl incorrectly)

HttpUrl is a type, not a constructor. Use TypeAdapter to validate strings.

-from pydantic import (
+from pydantic import (
     BaseModel,
     ConfigDict,
     Field,
     HttpUrl,
+    TypeAdapter,
     field_validator,
     model_validator,
 )
@@
 @field_validator("base_url", mode="after")
 def validate_base_url(cls, v: str | HttpUrl) -> HttpUrl:
-        if isinstance(v, str):
-            return HttpUrl(v)
-        return v
+        if isinstance(v, str):
+            return TypeAdapter(HttpUrl).validate_python(v)
+        return v

rogue/server/api/llm.py (2)

48-52: Offload scenario generation to threadpool

LLMService is sync; run it off the event loop.

-        scenarios = LLMService.generate_scenarios(
-            model=request.model,
-            context=request.business_context,
-            llm_provider_api_key=request.api_key,
-        )
+        scenarios = await run_in_threadpool(
+            LLMService.generate_scenarios,
+            model=request.model,
+            context=request.business_context,
+            llm_provider_api_key=request.api_key,
+        )

91-95: Offload summary generation to threadpool

Prevents blocking with provider SDK calls.

-        summary = LLMService.generate_summary_from_results(
-            model=request.model,
-            results=request.results,
-            llm_provider_api_key=request.api_key,
-        )
+        summary = await run_in_threadpool(
+            LLMService.generate_summary_from_results,
+            model=request.model,
+            results=request.results,
+            llm_provider_api_key=request.api_key,
+        )

♻️ Duplicate comments (10)

sdks/python/rogue_sdk/types.py (4)

263-275: Warn on timestamp parse failures (don’t silently fall back)

Silent fallback hides data issues. Log and continue.

-                    except (ValueError, AttributeError):
-                        timestamp = datetime.now(timezone.utc)
+                    except (ValueError, AttributeError) as e:
+                        logger.warning(
+                            "Failed to parse message timestamp; defaulting to now",
+                            extra={"timestamp": msg.timestamp, "error": str(e)},
+                        )
+                        timestamp = datetime.now(timezone.utc)

230-237: Align ApiScenarioResult with server: add expectedOutcome

Field is present server-side; omitting it drops data.

 class ApiScenarioResult(BaseModel):
     """Result of evaluating a single scenario in new API format."""

     description: Optional[str] = None
+    expectedOutcome: Optional[str] = None
     totalConversations: Optional[int] = None
     flaggedConversations: Optional[int] = None
     conversations: List[ApiConversationEvaluation]

239-244: Add summary/metadata fields to ApiEvaluationResult

These are required by the server contract and downstream consumers.

 class ApiEvaluationResult(BaseModel):
     """New API format for evaluation results."""

     scenarios: List[ApiScenarioResult]
+    summary: Optional[str] = None
+    keyFindings: Optional[str] = None
+    recommendation: Optional[str] = None
+    deepTest: bool = False
+    startTime: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
+    judgeModel: Optional[str] = None

245-304: Extend converter signature and map structured summary + metadata

Keep SDK converter in sync with server’s convert_to_api_format behavior.

-def convert_to_api_format(evaluation_results: EvaluationResults) -> ApiEvaluationResult:
+def convert_to_api_format(
+    evaluation_results: EvaluationResults,
+    structured_summary: Optional["StructuredSummary"] = None,
+    deep_test: bool = False,
+    start_time: Optional[datetime] = None,
+    judge_model: Optional[str] = None,
+) -> ApiEvaluationResult:
@@
-    api_scenarios = []
+    if start_time is None:
+        start_time = datetime.now(timezone.utc)
+    api_scenarios: List[ApiScenarioResult] = []
@@
-        api_scenarios.append(
-            ApiScenarioResult(
-                description=result.scenario.scenario,
-                totalConversations=len(api_conversations),
-                flaggedConversations=len(
-                    [c for c in api_conversations if not c.passed],
-                ),
-                conversations=api_conversations,
-            ),
-        )
+        api_scenarios.append(
+            ApiScenarioResult(
+                description=result.scenario.scenario,
+                expectedOutcome=result.scenario.expected_outcome,
+                totalConversations=len(api_conversations),
+                flaggedConversations=sum(1 for c in api_conversations if not c.passed),
+                conversations=api_conversations,
+            )
+        )
@@
-    return ApiEvaluationResult(scenarios=api_scenarios)
+    summary_text = None
+    key_findings_text = None
+    recommendation_text = None
+    if structured_summary:
+        summary_text = structured_summary.overall_summary
+        key_findings_text = "\n".join(f"• {f}" for f in structured_summary.key_findings)
+        recommendation_text = "\n".join(f"• {r}" for r in structured_summary.recommendations)
+
+    return ApiEvaluationResult(
+        scenarios=api_scenarios,
+        summary=summary_text,
+        keyFindings=key_findings_text,
+        recommendation=recommendation_text,
+        deepTest=deep_test,
+        startTime=start_time,
+        judgeModel=judge_model,
+    )

rogue/server/services/qualifire_service.py (1)

26-39: Harden network call and error handling; avoid json() on error bodies

Wrap requests.post, gate header on presence, and prefer response.text on failures.

-        response = requests.post(
-            f"{request.qualifire_url}/api/rogue/v1/report",
-            headers={"X-qualifire-key": request.qualifire_api_key},
-            json=api_evaluation_result.model_dump(mode="json"),
-            timeout=300,
-        )
-
-        if not response.ok:
-            logger.error(
-                "Failed to report summary to Qualifire",
-                extra={"response": response.json()},
-            )
-            raise Exception(f"Failed to report summary to Qualifire: {response.json()}")
+        try:
+            headers = {}
+            if request.qualifire_api_key:
+                headers["X-Qualifire-Key"] = request.qualifire_api_key
+            response = requests.post(
+                f"{request.qualifire_url}/api/rogue/v1/report",
+                headers=headers,
+                json=api_evaluation_result.model_dump(mode="json"),
+                timeout=300,
+            )
+        except requests.RequestException:
+            logger.exception("Network error reporting summary to Qualifire")
+            raise
+
+        if not response.ok:
+            try:
+                body = response.json()
+            except Exception:
+                body = response.text
+            logger.error(
+                "Failed to report summary to Qualifire",
+                extra={"status": response.status_code, "response": body},
+            )
+            raise Exception(f"Failed to report summary to Qualifire: {body}")

rogue/server/api/llm.py (5)

181-185: Results emptiness check is correct now

Using the underlying list avoids the earlier len(EvaluationResults) bug.

99-102: Do not log secrets

Log only presence, not the raw API key.

-        logger.info(
-            "Qualifire API key",
-            extra={"qualifire_api_key": request.qualifire_api_key},
-        )
+        logger.info("Qualifire API key provided", extra={"present": bool(request.qualifire_api_key)})

7-8: Import run_in_threadpool for sync I/O offloading

Avoid blocking the event loop.

-from fastapi import APIRouter, Depends, HTTPException
+from fastapi import APIRouter, Depends, HTTPException
+from fastapi.concurrency import run_in_threadpool

187-198: Offload Qualifire reporting in /report_summary to threadpool

Avoid blocking the request handler.

-        QualifireService.report_summary(
+        await run_in_threadpool(
+            QualifireService.report_summary,
             ReportSummaryRequest(
@@
-        )
+            evaluation_results=EvaluationResults(results=results),
+        )

132-147: Offload Qualifire reporting to threadpool

requests is sync; avoid blocking.

-            QualifireService.report_summary(
+            await run_in_threadpool(
+                QualifireService.report_summary,
                 ReportSummaryRequest(
@@
-            )
+                evaluation_results=request.results,
+            )

🧹 Nitpick comments (6)

sdks/python/rogue_sdk/types.py (3)
255-255: Type annotate local list for mypy clarity

Minor readability and typing aid.
-    api_scenarios = []
+    api_scenarios: List[ApiScenarioResult] = []
444-451: Avoid duplicating StructuredSummary across packages

Prefer a single definition (server models vs SDK). If duplication is required, add explicit JSON contract tests to ensure parity.

80-81: Mark API keys as secrets

Use SecretStr to avoid accidental logging/serialization of raw keys.
-    qualifire_api_key: Optional[str] = None
+    from pydantic import SecretStr
+    qualifire_api_key: Optional[SecretStr] = None
Note: Adjust get_auth_header callers to use .get_secret_value() when needed.
rogue/server/services/qualifire_service.py (1)
10-13: Add return type annotation

Clarify return type for mypy/IDE tooling.
-    def report_summary(
+    def report_summary(
         request: ReportSummaryRequest,
         evaluation_results: EvaluationResults,
-    ):
+    ) -> dict:
rogue/server/api/llm.py (2)
127-131: Reduce verbose/PII-risk logging

Avoid dumping full results/summary objects in logs; log sizes instead.
-        logger.info(
-            "Summary",
-            extra={"summary": summary, "results": request.results},
-        )
+        logger.info(
+            "Summary generated",
+            extra={
+                "overall_summary_len": len(getattr(summary, "overall_summary", "") or ""),
+                "results_count": len(request.results.results),
+            },
+        )
154-159: Preserve HTTPException status; only wrap unknown errors

Current code returns 500 even for deliberate HTTPException.
-    except Exception as e:
+    except HTTPException:
+        raise
+    except Exception as e:
         logger.exception("Failed to generate summary")
         raise HTTPException(
             status_code=500,
             detail=f"Failed to generate summary: {str(e)}",
         )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d18d18 and de3af0b.

📒 Files selected for processing (3)

rogue/server/api/llm.py (4 hunks)
rogue/server/services/qualifire_service.py (1 hunks)
sdks/python/rogue_sdk/types.py (8 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

rogue/server/services/qualifire_service.py
rogue/server/api/llm.py

rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

rogue/server/services/qualifire_service.py
rogue/server/api/llm.py

🧬 Code graph analysis (3)

rogue/server/services/qualifire_service.py (2)

rogue/server/services/api_format_service.py (1)

convert_with_structured_summary (116-141)

sdks/python/rogue_sdk/types.py (2)

EvaluationResults (192-208)

ReportSummaryRequest (501-510)

sdks/python/rogue_sdk/types.py (3)

rogue/server/models/api_format.py (4)

ApiChatMessage (14-19)

ApiConversationEvaluation (22-27)

ApiScenarioResult (30-37)

ApiEvaluationResult (40-49)

rogue/server/services/api_format_service.py (1)

convert_to_api_format (20-113)

packages/tui/internal/tui/evaluation.go (1)

StructuredSummary (96-105)

rogue/server/api/llm.py (7)

sdks/python/rogue_sdk/types.py (4)

EvaluationResults (192-208)

SummaryGenerationRequest (431-441)

ReportSummaryResponse (513-516)

ReportSummaryRequest (501-510)

rogue/server/api/evaluation.py (1)

get_evaluation_service (23-24)

rogue/server/services/evaluation_service.py (2)

EvaluationService (20-234)

get_job (31-33)

rogue/server/models/api_format.py (1)

ServerSummaryGenerationResponse (52-56)

rogue/server/services/llm_service.py (1)

LLMService (166-283)

rogue/server/services/qualifire_service.py (2)

QualifireService (8-40)

report_summary (10-40)

sdks/python/rogue_sdk/client.py (2)

generate_summary (147-194)

request (79-83)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: codestyle

🔇 Additional comments (3)

sdks/python/rogue_sdk/types.py (3)

389-391: LGTM: job metadata fields

deep_test and judge_model on EvaluationJob align with reporting requirements.

501-511: LGTM: ReportSummaryRequest shape

Matches usage in routes/services.

513-517: LGTM: ReportSummaryResponse

Minimal, sufficient for API.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/tui/internal/tui/eval_ui.go (1)
119-120: Update summary check to include StructuredSummary.

The code only checks the plain Summary field when determining if a summary already exists, but should also consider StructuredSummary.
 // Don't regenerate if we already have a summary
-if m.evalState.Summary != "" {
+if m.evalState.Summary != "" || m.evalState.StructuredSummary.OverallSummary != "" {
     return
 }

🧹 Nitpick comments (2)

packages/tui/internal/tui/eval_ui.go (2)
27-31: Consider consolidating summary fields.

Having both Summary (string) and StructuredSummary fields could lead to confusion about which one to use. Consider deprecating the plain Summary field in favor of the structured version, or clearly document when each should be used.

47-49: Consider using the EvalScenario struct directly for unmarshaling.

The anonymous struct duplicates the field definitions from EvalScenario. You could simplify by using the actual type.
 var v struct {
-    Scenarios []struct {
-        Scenario        string `json:"scenario"`
-        ScenarioType    string `json:"scenario_type"`
-        ExpectedOutcome string `json:"expected_outcome"`
-    } `json:"scenarios"`
+    Scenarios []json.RawMessage `json:"scenarios"`
 }
 if json.Unmarshal(b, &v) == nil {
     out := make([]EvalScenario, 0, len(v.Scenarios))
-    for _, s := range v.Scenarios {
-        if s.Scenario != "" {
-            out = append(out, EvalScenario{
-                Scenario:        s.Scenario,
-                ScenarioType:    ScenarioType(s.ScenarioType),
-                ExpectedOutcome: s.ExpectedOutcome,
-            })
+    for _, raw := range v.Scenarios {
+        var s EvalScenario
+        if err := json.Unmarshal(raw, &s); err == nil && s.Scenario != "" {
+            out = append(out, s)
         }
     }
     return out
 }
Note: This approach would require adjusting the JSON tags in EvalScenario to handle the scenario_type field as a string during unmarshaling, potentially with a custom UnmarshalJSON method.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de3af0b and 1e5c5a0.

📒 Files selected for processing (1)

packages/tui/internal/tui/eval_ui.go (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/tui/internal/tui/eval_ui.go (1)

packages/tui/internal/tui/evaluation.go (3)

EvalScenario (44-48)

StructuredSummary (96-105)

ScenarioType (27-27)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: codestyle

🔇 Additional comments (2)

packages/tui/internal/tui/eval_ui.go (2)

17-17: LGTM! Clean transition to structured scenarios.

The change from []string to []EvalScenario improves type safety and enables richer scenario metadata with expected outcomes.

77-77: Resolved — StartEvaluation signature and callers are up to date.
StartEvaluation (packages/tui/internal/tui/evaluation.go) accepts []EvalScenario and the TUI caller in packages/tui/internal/tui/eval_ui.go passes st.Scenarios; no action required.

coderabbitai · 2025-09-11T12:56:07Z

packages/tui/internal/tui/eval_ui.go

+func loadScenariosFromWorkdir() []EvalScenario {
 	wd, _ := os.Getwd()
 	dir := wd
 	for {
 		p := filepath.Join(dir, ".rogue", "scenarios.json")
 		if b, err := os.ReadFile(p); err == nil {
 			var v struct {
 				Scenarios []struct {
-					Scenario string `json:"scenario"`
+					Scenario        string `json:"scenario"`
+					ScenarioType    string `json:"scenario_type"`
+					ExpectedOutcome string `json:"expected_outcome"`
 				} `json:"scenarios"`
 			}
 			if json.Unmarshal(b, &v) == nil {
-				out := make([]string, 0, len(v.Scenarios))
+				out := make([]EvalScenario, 0, len(v.Scenarios))
 				for _, s := range v.Scenarios {
 					if s.Scenario != "" {
-						out = append(out, s.Scenario)
+						out = append(out, EvalScenario{
+							Scenario:        s.Scenario,
+							ScenarioType:    ScenarioType(s.ScenarioType),
+							ExpectedOutcome: s.ExpectedOutcome,
+						})
 					}
 				}
 				return out


🛠️ Refactor suggestion

Add validation for ScenarioType conversion.

The code directly casts s.ScenarioType string to ScenarioType without validation. If the JSON contains an invalid scenario type, this could lead to unexpected behavior.

Add validation when converting the scenario type:

for _, s := range v.Scenarios { if s.Scenario != "" { + // Validate scenario type + scenarioType := ScenarioType(s.ScenarioType) + // Consider adding a validation method like scenarioType.IsValid() + // or at least document valid values out = append(out, EvalScenario{ Scenario: s.Scenario, - ScenarioType: ScenarioType(s.ScenarioType), + ScenarioType: scenarioType, ExpectedOutcome: s.ExpectedOutcome, }) } }

Consider implementing a validation method for ScenarioType in the evaluation.go file to ensure only valid scenario types are accepted.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func loadScenariosFromWorkdir() []EvalScenario {

wd, _ := os.Getwd()

dir := wd

for {

p := filepath.Join(dir, ".rogue", "scenarios.json")

if b, err := os.ReadFile(p); err == nil {

var v struct {

Scenarios []struct {

Scenario string `json:"scenario"`

Scenario string `json:"scenario"`

ScenarioType string `json:"scenario_type"`

ExpectedOutcome string `json:"expected_outcome"`

} `json:"scenarios"`

}

if json.Unmarshal(b, &v) == nil {

out := make([]string, 0, len(v.Scenarios))

out := make([]EvalScenario, 0, len(v.Scenarios))

for _, s := range v.Scenarios {

if s.Scenario != "" {

out = append(out, s.Scenario)

out = append(out, EvalScenario{

Scenario: s.Scenario,

ScenarioType: ScenarioType(s.ScenarioType),

ExpectedOutcome: s.ExpectedOutcome,

})

}

}

return out

func loadScenariosFromWorkdir() []EvalScenario {

wd, _ := os.Getwd()

dir := wd

for {

p := filepath.Join(dir, ".rogue", "scenarios.json")

if b, err := os.ReadFile(p); err == nil {

var v struct {

Scenarios []struct {

Scenario string `json:"scenario"`

ScenarioType string `json:"scenario_type"`

ExpectedOutcome string `json:"expected_outcome"`

} `json:"scenarios"`

}

if json.Unmarshal(b, &v) == nil {

out := make([]EvalScenario, 0, len(v.Scenarios))

for _, s := range v.Scenarios {

if s.Scenario != "" {

// Validate scenario type

scenarioType := ScenarioType(s.ScenarioType)

// Consider adding a validation method like scenarioType.IsValid()

// or at least document valid values

out = append(out, EvalScenario{

Scenario: s.Scenario,

ScenarioType: scenarioType,

ExpectedOutcome: s.ExpectedOutcome,

})

}

}

return out

🤖 Prompt for AI Agents

In packages/tui/internal/tui/eval_ui.go around lines 39 to 63, the code blindly casts s.ScenarioType string to ScenarioType which can create invalid values if JSON contains unexpected strings; update the code to validate the conversion by adding/using a parser/validator (e.g., ParseScenarioType or ScenarioType.IsValid) in evaluation.go that returns an error or bool and then call it here, only append the EvalScenario when the parsed ScenarioType is valid (otherwise skip the entry or handle/log the error) so that only recognized scenario types are accepted.

….com/qualifire-dev/rogue-private into feature/FIRE-751-rogue-report-to-bff

Revert "Revert "wip on the new response""

3574a98

This reverts commit bb3fb1c.

wip

ae68ff9

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

--wip-- [skip ci]

354cbd0

yuval-qf reviewed Sep 4, 2025

View reviewed changes

rogue/ui/components/report_generator.py Outdated Show resolved Hide resolved

yuval-qf reviewed Sep 4, 2025

View reviewed changes

drorIvry added 2 commits September 7, 2025 12:25

wip

dc2f6e9

wip

d682161

coderabbitai bot reviewed Sep 7, 2025

View reviewed changes

rogue/server/services/api_format_service.py Show resolved Hide resolved

server now respond with the results

1d7e28a

coderabbitai bot reviewed Sep 7, 2025

View reviewed changes

rogue/ui/components/report_generator.py Show resolved Hide resolved

rogue/ui/components/report_generator.py Show resolved Hide resolved

rogue/ui/components/report_generator.py Show resolved Hide resolved

refactor the detailed breakdown

6e0d9ee

coderabbitai bot reviewed Sep 7, 2025

View reviewed changes

packages/tui/internal/tui/app.go Outdated Show resolved Hide resolved

packages/tui/internal/tui/report_view.go Show resolved Hide resolved

drorIvry added 2 commits September 8, 2025 15:04

Hotfix - change the token input count

c30193d

report is working when precondifuring the api key

45c037f

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

fixed paste

7036bb0

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

drorIvry added 3 commits September 9, 2025 15:19

Merge branch 'main' into feature/FIRE-751-rogue-report-to-bff

b74ba33

wip

64ff464

fixed report_summary

2f23574

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

drorIvry added 2 commits September 10, 2025 14:04

tui report save

5cb2a0c

tui report save

d9ce9e2

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

tui report save

c5134b3

osher-qualifire approved these changes Sep 11, 2025

View reviewed changes

drorIvry and others added 4 commits September 11, 2025 13:49

Update rogue/run_cli.py

38ccedd

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

ci

564f4a9

Merge branch 'feature/FIRE-751-rogue-report-to-bff' of https://github…

57f85d8

….com/qualifire-dev/rogue-private into feature/FIRE-751-rogue-report-to-bff

ci

4d18d18

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

ci

de3af0b

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

drorIvry and others added 3 commits September 11, 2025 15:45

ci

1e5c5a0

ci

53755b7

Merge branch 'main' into feature/FIRE-751-rogue-report-to-bff

f09edfe

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

drorIvry added 2 commits September 11, 2025 15:57

ci

e679d82

Merge branch 'feature/FIRE-751-rogue-report-to-bff' of https://github…

7693f96

….com/qualifire-dev/rogue-private into feature/FIRE-751-rogue-report-to-bff

drorIvry merged commit d060023 into main Sep 11, 2025
7 checks passed

drorIvry deleted the feature/FIRE-751-rogue-report-to-bff branch September 11, 2025 13:02

This was referenced Sep 17, 2025

feature/FIRE 779 get the api key from env #46

Merged

feature/cli report results FIRE 778 #47

Merged

This was referenced Oct 7, 2025

feature/interview tui FIRE 811 #52

Merged

FIRE-815 | Bugfix | Reduce rogue import time on startup #55

Merged

feature/scrolling and wrapping issues #59

Merged

This was referenced Oct 20, 2025

FIRE-831 | Rogue Server | MCP transport #103

Merged

FIRE-812 | Rogue-TUI | MCP Support #115

Merged

Feature: mimic the qualifire integration in python sdk #84

Open

coderabbitai bot mentioned this pull request Nov 10, 2025

feature/support bedrock #133

Merged

21 tasks

This was referenced Jan 1, 2026

feature/red teaming #141

Merged

feature/delete gradio #142

Merged

This was referenced Feb 3, 2026

feature/red team report FIRE 1004 #155

Merged

feature/add azure models #157

Merged



		# Conversion functions for new API format
		def convert_to_api_format(evaluation_results: EvaluationResults) -> ApiEvaluationResult:

Conversation

drorIvry commented Sep 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

linear bot commented Sep 4, 2025

Uh oh!

coderabbitai bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (3 passed)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuval-qf Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuval-qf left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

drorIvry commented Sep 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 4, 2025 •

edited

Loading