Skip to content

feature/cli report results FIRE 778#47

Merged
drorIvry merged 14 commits intomainfrom
feature/cli-report-results-FIRE-778
Sep 22, 2025
Merged

feature/cli report results FIRE 778#47
drorIvry merged 14 commits intomainfrom
feature/cli-report-results-FIRE-778

Conversation

@drorIvry
Copy link
Contributor

@drorIvry drorIvry commented Sep 18, 2025

  • rename rogue
  • wip

Summary by CodeRabbit

  • New Features

    • CLI now supports a Qualifire API key option, with a warning shown if omitted to indicate results won’t be persisted.
    • Default judge model updated to openai/gpt-5.
    • Python SDK exposes additional summary-related types for easier integration.
  • Documentation

    • Expanded README with hero image, status badge, reorganized Quick Start, detailed mode arguments and usage examples, config file guidance, and a comprehensive example workflow.
  • Chores

    • Python SDK version bumped to 0.1.3.
    • CI updated to force-reinstall SDK and streamline installation steps.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds Qualifire API key support and job_id propagation to CLI reporting flow, updates CLI input model, exports new SDK summary types, bumps Python SDK version, adjusts CI install commands, and expands README documentation with detailed mode arguments, config, and examples.

Changes

Cohort / File(s) Summary
Documentation
README.md
Adds header image and tests badge; expands Quick Start; introduces detailed Mode Arguments with code blocks; documents new CLI flags; adds Config File section and Notes; enriches Examples and CLI workflows; formatting tweaks.
CLI pipeline updates
rogue/run_cli.py
Adds --qualifire-api-key; threads qualifire_api_key_secret and optional job_id through run_scenarios, _run_scenarios_with_sdk, and create_report; updates return types to Tuple[EvaluationResults | None, str | None]; changes default judge model to openai/gpt-5; warns when API key missing; refactors summary generation to use early-derived key.
CLI input model
rogue/models/cli_input.py
Adds optional field qualifire_api_key: SecretStr | None to PartialCLIInput.
Python SDK public exports
sdks/python/rogue_sdk/__init__.py
Exposes new summary-related symbols in __all__: StructuredSummary, SummaryGenerationRequest, SummaryGenerationResponse, ReportSummaryResponse, ReportSummaryRequest.
Python SDK packaging
sdks/python/pyproject.toml
Bumps version from 0.1.0 to 0.1.3.
CI workflows
.github/workflows/rogue.yml, .github/workflows/test.yml
Ensures SDK install uses --force-reinstall; updates server install to uv sync --dev; whitespace tweak before run step.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant CLI as rogue run_cli
  participant SDK as Python SDK
  participant S as Rogue Server
  participant Q as Qualifire Service

  U->>CLI: Run CLI with options (may include --qualifire-api-key)
  CLI->>SDK: run_scenarios(...)
  SDK->>S: Execute scenarios
  S-->>SDK: EvaluationResults + job_id?
  SDK-->>CLI: Results, job_id (optional)

  alt qualifire_api_key provided
    CLI->>SDK: generate_summary(results, job_id, api_key)
    SDK->>Q: Persist/report summary
    Q-->>SDK: Summary/ACK
    SDK-->>CLI: Summary created
  else no API key
    CLI-->>U: Warn: persistence will not occur
    CLI->>SDK: generate_summary(results, job_id, api_key=None)
    SDK-->>CLI: Local summary only
  end

  CLI-->>U: Output report path / summary info
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • yuval-qf
  • osher-qualifire

Poem

I thump my paws on terminal ground,
A key of Qualifire now newly found.
Job IDs hop along the trail,
Reports persist without a fail.
Docs grown tall like carrot rows—
CI burrows cleanly flows.
Happy hops—the summary glows! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feature/cli report results FIRE 778" directly references CLI report results and matches the PR’s primary changes around CLI reporting and related flags, so it is relevant and informative to reviewers; it does include branch-prefix noise ("feature/") and a ticket identifier ("FIRE 778") which are acceptable but slightly reduce clarity for changelog/history reading.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/cli-report-results-FIRE-778

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 054dbd9 and f03c277.

📒 Files selected for processing (1)
  • .github/workflows/test.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/test.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: codestyle

Comment @coderabbitai help to get the list of available commands and usage tips.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
rogue/models/cli_input.py (1)

12-12: Defaults capture workdir at class definition time (won’t follow a custom workdir).

input_scenarios_file = workdir / ... and output_report_file = workdir / ... are bound once at import, not per-instance. If a caller sets workdir, these paths won’t update.

Apply this diff to make them instance-derived:

-    input_scenarios_file: Path = workdir / "scenarios.json"
-    output_report_file: Path = workdir / "report.md"
+    input_scenarios_file: Path | None = None
+    output_report_file: Path | None = None

Add a per-instance initializer to CLIInput (outside the selected lines):

from typing import Any  # add to imports

def model_post_init(self, __context: Any) -> None:
    if self.input_scenarios_file is None:
        self.input_scenarios_file = self.workdir / "scenarios.json"
    if self.output_report_file is None:
        self.output_report_file = self.workdir / "report.md"

Also applies to: 18-19

rogue/run_cli.py (1)

377-385: Pass qualifire_api_key_secret to create_report

The create_report function signature was modified to accept qualifire_api_key_secret, but the call site doesn't pass this parameter. This means the Qualifire API key will always be None even if the user provides it via CLI.

Apply this diff to pass the qualifire_api_key_secret parameter:

     logger.info("Creating report")
     report_summary = await create_report(
         rogue_server_url=args.rogue_server_url,
         judge_llm=cli_input.judge_llm,
         results=results,
         output_report_file=cli_input.output_report_file,
         judge_llm_api_key_secret=cli_input.judge_llm_api_key,
+        qualifire_api_key_secret=cli_input.qualifire_api_key,
         deep_test_mode=cli_input.deep_test_mode,
         judge_model=cli_input.judge_llm,
     )
README.md (1)

362-370: Documentation lists non-existent OpenAI models

The README lists several OpenAI models that don't exist (gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, etc.). This could confuse users trying to configure the tool.

Update the model list to reflect actual available OpenAI models. The currently available models should be:

  • gpt-4o
  • gpt-4o-mini
  • gpt-4-turbo
  • gpt-3.5-turbo

Remove the non-existent model names to avoid user confusion.

🧹 Nitpick comments (5)
rogue/models/cli_input.py (2)

24-35: Harden file I/O and validation error handling.

read_text() and JSON validation can fail (permissions, encoding, invalid JSON). Wrap with try/except and return actionable errors.

Apply this diff:

-        return Scenarios.model_validate_json(self.input_scenarios_file.read_text())
+        try:
+            content = self.input_scenarios_file.read_text(encoding="utf-8")
+        except OSError as e:
+            raise ValueError(
+                f"Failed to read input scenarios file: {self.input_scenarios_file}"
+            ) from e
+        try:
+            return Scenarios.model_validate_json(content)
+        except Exception as e:
+            raise ValueError(
+                f"Invalid or un-parseable JSON in scenarios file: {self.input_scenarios_file}"
+            ) from e

61-67: Type-hint cleanup: remove type: ignore and annotate model_post_init.

Keep mypy happy and align with guidelines.

Apply this diff:

-    business_context_file: Path = None  # type: ignore # fixed in model_post_init
-    input_scenarios_file: Path = None  # type: ignore # fixed in model_post_init
-    output_report_file: Path = None  # type: ignore # fixed in model_post_init
+    business_context_file: Path | None = None  # fixed in model_post_init
+    input_scenarios_file: Path | None = None  # fixed in model_post_init
+    output_report_file: Path | None = None  # fixed in model_post_init
@@
-    def model_post_init(self, __context):
+    def model_post_init(self, __context: Any) -> None:

Note: ensure from typing import Any is imported once at the top (module-level).

rogue/run_cli.py (1)

209-213: Remove duplicate qualifire_api_key extraction

The qualifire_api_key is extracted twice from qualifire_api_key_secret: once at lines 209-213 and again at lines 223-227. The second extraction inside the try block is redundant since the value was already extracted before the try block.

Apply this diff to remove the duplicate extraction:

 async def create_report(
     rogue_server_url: str,
     judge_llm: str,
     results: EvaluationResults,
     output_report_file: Path,
     judge_llm_api_key_secret: SecretStr | None = None,
     qualifire_api_key_secret: SecretStr | None = None,
     deep_test_mode: bool = False,
     judge_model: str | None = None,
 ) -> str:
     judge_llm_api_key = (
         judge_llm_api_key_secret.get_secret_value()
         if judge_llm_api_key_secret
         else None
     )
 
     qualifire_api_key = (
         qualifire_api_key_secret.get_secret_value()
         if qualifire_api_key_secret
         else None
     )
 
     # Use SDK for summary generation (server-based)
     sdk_config = RogueClientConfig(
         base_url=rogue_server_url,
         timeout=600.0,
     )
     sdk = RogueSDK(sdk_config)
 
     try:
-        qualifire_api_key = (
-            qualifire_api_key_secret.get_secret_value()
-            if qualifire_api_key_secret
-            else None
-        )
         summary, _ = await sdk.generate_summary(
             results=results,
             model=judge_llm,
             api_key=judge_llm_api_key,
             qualifire_api_key=qualifire_api_key,
             deep_test=deep_test_mode,
             judge_model=judge_model,
         )

Also applies to: 223-227

README.md (2)

3-3: Add alt text to the tracking pixel image

The image should have alternate text for accessibility, even if it's a tracking pixel.

Apply this diff:

-![](https://pixel.qualifire.ai/api/record/rogue)
+![Qualifire Analytics](https://pixel.qualifire.ai/api/record/rogue)

137-137: Wrap bare URL in angle brackets

Bare URLs should be wrapped in angle brackets for better Markdown compatibility.

Apply this diff:

-- `--rogue-server-url URL` - Rogue server URL (default: http://localhost:8000)
+- `--rogue-server-url URL` - Rogue server URL (default: <http://localhost:8000>)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc4beb9 and 0d34ed0.

📒 Files selected for processing (3)
  • README.md (4 hunks)
  • rogue/models/cli_input.py (1 hunks)
  • rogue/run_cli.py (4 hunks)
🔥 Files not summarized due to errors (1)
  • rogue/models/cli_input.py: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
📓 Path-based instructions (2)
{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

  • rogue/models/cli_input.py
  • rogue/run_cli.py
rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

  • rogue/models/cli_input.py
  • rogue/run_cli.py
🪛 markdownlint-cli2 (0.17.2)
README.md

3-3: Images should have alternate text (alt text)

(MD045, no-alt-text)


137-137: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: codestyle

# Set defaults for required fields that are missing
if data.get("judge_llm") is None:
data["judge_llm"] = "openai/o4-mini"
data["judge_llm"] = "openai/gpt-5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Verify that the openai/gpt-5 model exists

The default judge LLM model has been changed to openai/gpt-5. This model doesn't appear to be a valid OpenAI model name (as of my knowledge cutoff). This could cause runtime failures for users who don't explicitly specify a judge LLM model.

Consider using a valid model like openai/gpt-4o or openai/gpt-4o-mini as the default:

-        data["judge_llm"] = "openai/gpt-5"
+        data["judge_llm"] = "openai/gpt-4o-mini"

🌐 Web query:

What are the current valid OpenAI GPT model names available through their API?

💡 Result:

Short answer — common GPT models currently offered in the OpenAI API (families and typical variants):

  • GPT-4 family: gpt-4, gpt-4-turbo. [1]
  • GPT-4o family: gpt-4o, gpt-4o-mini (and realtime/voice/vision variants where available). [1][2]
  • GPT-4.1 family: gpt-4.1, gpt-4.1-mini (newer 4.1 releases). [4]
  • GPT-3.5 family: gpt-3.5-turbo and its context-window variants (e.g., 16k/4k variants). [1]

Notes: OpenAI frequently adds/removes model snapshots and introduces variant names (mini, turbo, realtime, nano, etc.), so check the official API models list or the OpenAI docs/models endpoint for the authoritative, up‑to‑date names before integrating. [1][4]

Sources:
[1] OpenAI Help Center — How can I access GPT-4, GPT-4 Turbo, GPT-4o, and GPT-4o mini?
[2] OpenAI blog — GPT‑4o mini announcement.
[4] OpenAI Help Center / Model release notes — GPT‑4.1 & related releases.


Replace invalid default judge LLM "openai/gpt-5" with a supported OpenAI model

openai/gpt-5 is not a recognized OpenAI model and will cause runtime failures for users who don't override the default.

File: rogue/run_cli.py Lines: 290-290

-        data["judge_llm"] = "openai/gpt-5"
+        data["judge_llm"] = "openai/gpt-4o-mini"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
data["judge_llm"] = "openai/gpt-5"
data["judge_llm"] = "openai/gpt-4o-mini"
🤖 Prompt for AI Agents
In rogue/run_cli.py around line 290, the default judge LLM is set to the invalid
model string "openai/gpt-5"; replace it with a supported OpenAI model identifier
(for example "openai/gpt-4") so the default will work at runtime, and ensure any
related tests or documentation that reference the old default are updated
accordingly.

Comment on lines 392 to 397
if cli_input.qualifire_api_key is None:
console.print(
"""[red]Qualifire API key is not set. This report will not persist, it if highly recommended set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red] """, # noqa: E501
)
return 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider making the exit code configurable or warning-only

Returning exit code 1 when the Qualifire API key is missing might break existing CI/CD pipelines where users don't require report persistence. While the warning is helpful, forcing a non-zero exit code for a non-critical feature might be disruptive.

Consider one of these alternatives:

  1. Keep the warning but return 0 (success) if the evaluation itself passed
  2. Add a CLI flag to control whether missing Qualifire key should fail the process
  3. Only fail if explicitly required via configuration

Example of option 1:

     if cli_input.qualifire_api_key is None:
         console.print(
             """[red]Qualifire API key is not set. This report will not persist, it if highly recommended set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red] """,  # noqa: E501
         )
-        return 1
 
     return get_exit_code(results)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if cli_input.qualifire_api_key is None:
console.print(
"""[red]Qualifire API key is not set. This report will not persist, it if highly recommended set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red] """, # noqa: E501
)
return 1
if cli_input.qualifire_api_key is None:
console.print(
"""[red]Qualifire API key is not set. This report will not persist, it if highly recommended set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red] """, # noqa: E501
)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
rogue/run_cli.py (4)

34-39: Fix argparse Enum type/choices mismatch (runtime parse bug)

type=AuthType returns an Enum but choices=[e.value for e in AuthType] are strings, so valid inputs will be rejected.

-    parser.add_argument(
-        "--evaluated-agent-auth-type",
-        required=False,
-        type=AuthType,
-        choices=[e.value for e in AuthType],
+    parser.add_argument(
+        "--evaluated-agent-auth-type",
+        required=False,
+        type=str,
+        choices=[e.value for e in AuthType],
         help="How to authenticate with the evaluated agent (if needed)."
         f"Valid options are: {[e.value for e in AuthType]}",
     )

17-28: Add missing --workdir flag (prevents AttributeError later)

get_cli_input and run_cli use args.workdir but the parser doesn’t define it.

 def set_cli_args(parser: ArgumentParser) -> None:
+    parser.add_argument(
+        "--workdir",
+        type=Path,
+        default=Path(".") / ".rogue",
+        help="Working directory for CLI artifacts (scenarios.json, report.md, etc.)",
+    )
     parser.add_argument(
         "--config-file",
         type=Path,
         help="Path to config file",
     )

347-360: Use str() for HttpUrl rather than .encoded_string()

HttpUrl/Url in Pydantic doesn’t guarantee an encoded_string() method; this will raise at runtime. Use str(url).

-    get_agent_card(cli_input.evaluated_agent_url.encoded_string())
+    get_agent_card(str(cli_input.evaluated_agent_url))

-        evaluated_agent_url=cli_input.evaluated_agent_url.encoded_string(),
+        evaluated_agent_url=str(cli_input.evaluated_agent_url),

343-345: Avoid logging secrets in debug output

model_dump() may serialize secret fields; don’t include API keys/credentials in logs.

-    logger.debug("Running CLI", extra=cli_input.model_dump())
+    logger.debug(
+        "Running CLI",
+        extra=cli_input.model_dump(exclude={"judge_llm_api_key", "evaluated_agent_credentials", "qualifire_api_key"}),
+    )
♻️ Duplicate comments (2)
rogue/models/cli_input.py (2)

22-22: Resolved: Qualifire API key now accepted in CLIInput

Adding qualifire_api_key here unblocks CLI usage. Looks good.


65-65: Resolved: Mirrored field in PartialCLIInput

Including qualifire_api_key in the partial model fixes prior validation/merge issues.

🧹 Nitpick comments (5)
rogue/models/cli_input.py (2)

61-66: Tighten typing for file-path optionals (drop type: ignore)

Use explicit optional types to satisfy mypy without ignores.

-    business_context_file: Path = None  # type: ignore # fixed in model_post_init
-    input_scenarios_file: Path = None  # type: ignore # fixed in model_post_init
-    output_report_file: Path = None  # type: ignore # fixed in model_post_init
+    business_context_file: Path | None = None  # set in model_post_init
+    input_scenarios_file: Path | None = None  # set in model_post_init
+    output_report_file: Path | None = None  # set in model_post_init

67-67: Annotate model_post_init signature

Add the missing type for __context to satisfy type-checkers.

-    def model_post_init(self, __context):
+    def model_post_init(self, __context: object) -> None:
rogue/run_cli.py (3)

3-3: Prefer built-in generics over typing.Tuple

Modern Python supports tuple[...]. Drop the import and update hints.

-from typing import Tuple
...
-) -> Tuple[EvaluationResults | None, str | None]:
+) -> tuple[EvaluationResults | None, str | None]:
...
-) -> Tuple[EvaluationResults | None, str | None]:
+) -> tuple[EvaluationResults | None, str | None]:

Also applies to: 108-108, 146-146


234-236: Preserve traceback when re-raising

Use bare raise after logging to keep the original traceback.

-    except Exception as e:
+    except Exception:
         logger.exception("Failed to generate summary")
-        raise e
+        raise

392-396: Fix grammar in user-facing warning

Tweak the message.

-            """[red]Qualifire API key is not set. This report will not persist, it if highly recommended set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red] """,
+            """[red]Qualifire API key is not set. This report will not persist. It is highly recommended to set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red] """,
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d34ed0 and 74bf5b1.

📒 Files selected for processing (3)
  • rogue/models/cli_input.py (2 hunks)
  • rogue/run_cli.py (10 hunks)
  • rogue/server/api/llm.py (0 hunks)
💤 Files with no reviewable changes (1)
  • rogue/server/api/llm.py
🧰 Additional context used
📓 Path-based instructions (2)
{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

  • rogue/models/cli_input.py
  • rogue/run_cli.py
rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

  • rogue/models/cli_input.py
  • rogue/run_cli.py
🧬 Code graph analysis (1)
rogue/run_cli.py (2)
sdks/python/rogue_sdk/types.py (1)
  • EvaluationResults (192-208)
rogue/server/services/qualifire_service.py (1)
  • report_summary (10-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: codestyle
🔇 Additional comments (2)
rogue/run_cli.py (2)

371-385: Plumbing looks correct: key + job_id threaded to summary generation

Passing qualifire_api_key_secret and job_id into create_report and down to the SDK is correct.

Please confirm the SDK’s generate_summary signature matches (results, model, api_key, qualifire_api_key, deep_test, judge_model, job_id) in the version pinned by this repo.


160-185: SDK eval flow and results write-out LGTM

Health check, job run, wait, serialize results, and return (results, job_id) look good.

Comment on lines 286 to 289
# Set defaults for required fields that are missing
if data.get("judge_llm") is None:
data["judge_llm"] = "openai/o4-mini"
data["judge_llm"] = "openai/gpt-5"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Validate default judge LLM model name

"openai/gpt-5" is likely invalid and will break defaults for users. Please verify against official docs and choose a supported default (e.g., "openai/gpt-4o-mini").

-    if data.get("judge_llm") is None:
-        data["judge_llm"] = "openai/gpt-5"
+    if data.get("judge_llm") is None:
+        data["judge_llm"] = "openai/gpt-4o-mini"

To confirm the current valid model names:


🌐 Web query:

What OpenAI GPT model IDs are currently valid for the API, and is "openai/gpt-5" available?

💡 Result:

Short answer: Yes — GPT‑5 is available in the API, but the model ID is gpt-5 (and related IDs), not openai/gpt-5.

Key valid OpenAI GPT model IDs (examples from OpenAI docs/announcements):

  • GPT‑5 family: gpt-5, gpt-5-mini, gpt-5-nano (API release described by OpenAI). [1][2]
  • GPT‑4.1 family: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano (API). [3]
  • GPT‑4o / GPT‑4 families and earlier: gpt-4o, gpt-4, gpt-4o-mini, gpt-3.5-turbo, etc., remain in use per OpenAI docs/notes. [3][4]

Notes:

  • OpenAI publishes exact, canonical model IDs in their product pages and release notes — use those (e.g., "gpt-5"), not a prefixed ID like "openai/gpt-5". [1][3]
  • For the most up-to-date complete list and any availability/region or rate-limit details, check OpenAI’s model release notes and the GPT‑5 announcement pages. [3][4]

Sources:
[1] OpenAI — Introducing GPT‑5 for developers (API release details).
[2] OpenAI — GPT‑5 System Card / Model notes.
[3] OpenAI — Introducing GPT‑4.1 in the API (model IDs/pricing).
[4] OpenAI — Model release notes / Help Center.


Replace invalid default model ID "openai/gpt-5" with canonical "gpt-5".
rogue/run_cli.py (lines 286–289): data["judge_llm"] currently defaults to "openai/gpt-5" — change it to "gpt-5" because OpenAI model IDs use the canonical form (no "openai/" prefix).

🤖 Prompt for AI Agents
In rogue/run_cli.py around lines 286 to 289, the default judge_llm value is set
to the invalid model ID "openai/gpt-5"; update the assignment so the default
uses the canonical OpenAI model identifier "gpt-5" instead (replace
"openai/gpt-5" with "gpt-5") to ensure the configured default is valid.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
rogue/run_cli.py (5)

3-3: Prefer builtin generics; drop unused Tuple import.

Use tuple[...] (PEP 585) and remove from typing import Tuple.

Apply:

-from typing import Tuple

Then update the two annotations at Lines 108 and 146 (see comments there).


78-84: CLI flag addition looks good; clarify help text.

Minor copy tweak to explicitly name the env var.

-        help="API key to use when reporting summary to Qualifire. "
-        "Can be left unset if env is used.",
+        help="API key used when reporting summary to Qualifire. "
+        "Can be left unset when using the QUALIFIRE_API_KEY environment variable.",

146-146: Same: use builtin tuple[...].

-) -> Tuple[EvaluationResults | None, str | None]:
+) -> tuple[EvaluationResults | None, str | None]:

392-396: Good: no longer forcing non‑zero exit without key; fix warning grammar.

Message has a grammatical error; tighten copy.

-            """[red]Qualifire API key is not set. This report will not persist, it if highly recommended set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red] """,  # noqa: E501
+            """[red]Qualifire API key is not set. This report will not persist. It is highly recommended to set it using the --qualifire-api-key flag or the QUALIFIRE_API_KEY environment variable.[/red]""",  # noqa: E501

108-108: Return type hint: use builtin tuple[...]

Modernize typing per PEP 585 — callers unpack two values (results, job_id) at rogue/run_cli.py:357, so change is safe.

-) -> Tuple[EvaluationResults | None, str | None]:
+) -> tuple[EvaluationResults | None, str | None]:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74bf5b1 and 1546e28.

📒 Files selected for processing (2)
  • rogue/models/cli_input.py (1 hunks)
  • rogue/run_cli.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • rogue/models/cli_input.py
🧰 Additional context used
📓 Path-based instructions (2)
{rogue,tests,examples}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

{rogue,tests,examples}/**/*.py: Format Python code with Black
Organize imports according to isort conventions
Use type hints for all function signatures
Follow PEP 8 naming (snake_case for variables and functions, PascalCase for classes)
Use try/except blocks for code that may raise exceptions
Ensure code passes flake8 linting
Ensure code passes mypy type checks per .mypy.ini
Address Bandit-reported security issues in Python code

Files:

  • rogue/run_cli.py
rogue/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place application/package code under the rogue/ directory

Files:

  • rogue/run_cli.py
🧬 Code graph analysis (1)
rogue/run_cli.py (2)
sdks/python/rogue_sdk/types.py (1)
  • EvaluationResults (192-208)
rogue/server/services/qualifire_service.py (1)
  • report_summary (10-40)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: codestyle
🔇 Additional comments (2)
rogue/run_cli.py (2)

211-216: Resolved prior duplication; single derivation of Qualifire key.

This addresses the earlier “redundant API key assignment” report. Good.


384-385: API key now plumbed into create_report — fixes prior bug.

This resolves the “API key not passed to report function” issue.

Comment on lines 286 to +288
# Set defaults for required fields that are missing
if data.get("judge_llm") is None:
data["judge_llm"] = "openai/o4-mini"
data["judge_llm"] = "openai/gpt-5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Default judge model likely invalid; avoid “openai/gpt-5” as a default.

High risk of runtime failure if this ID isn’t recognized by your router/SDK. Prefer a widely available default (e.g., openai/gpt-4o-mini) or validate against the server’s supported models.

-    if data.get("judge_llm") is None:
-        data["judge_llm"] = "openai/gpt-5"
+    if data.get("judge_llm") is None:
+        data["judge_llm"] = "openai/gpt-4o-mini"

🌐 Web query:

What are the current, valid OpenAI Chat/Omni model IDs for API usage in September 2025, and is "openai/gpt-5" a supported identifier? Please cite official OpenAI docs.

💡 Result:

Short answer

  • Current official API model IDs for the GPT‑5 family (Sept 2025): gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-chat-latest, and related “thinking” variants (e.g., gpt-5-thinking, gpt-5-thinking-nano / GPT‑5 Pro variants per OpenAI docs). [1][2][3]

  • "openai/gpt-5" is not the documented/supported model identifier in OpenAI’s official docs; use the IDs above (e.g., "gpt-5" or "gpt-5-chat-latest") as shown in OpenAI’s API / release pages. [1][2]

Sources

  1. OpenAI — “Introducing GPT‑5 for developers” (model names & API availability). [link].
  2. OpenAI — “GPT‑5 System Card” (model labels including gpt-5-thinking, gpt-5-thinking-nano). [link].
  3. OpenAI Help Center — “API Model Availability by Usage Tier and Verification Status” / Model Release Notes (lists gpt-5, gpt-5-mini and access). [link].

Do not default judge_llm to "openai/gpt-5"; use a documented model ID or validate at runtime.

rogue/run_cli.py lines 286–288 — remove the default "openai/gpt-5" (not a documented API model ID). Set a documented ID (e.g., "gpt-5-chat-latest" or "gpt-5") or add startup validation/fallback to a server-supported model.

🤖 Prompt for AI Agents
In rogue/run_cli.py around lines 286 to 288, the code sets a hard default
judge_llm to the undocumented "openai/gpt-5"; remove that literal default and
either set a documented model ID (for example "gpt-5-chat-latest" or a
project-specific approved ID) or implement runtime validation with a fallback:
check the configured judge_llm and if absent/invalid call a validator (or query
the server-supported models) and set a safe documented default or raise a clear
startup error; update the message/error to guide users to choose a supported
model.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
sdks/python/rogue_sdk/__init__.py (1)

76-81: Exports resolve; optional: derive all from types to avoid drift

Confirmed: sdks/python/rogue_sdk/types.py defines StructuredSummary, SummaryGenerationRequest, SummaryGenerationResponse, ReportSummaryRequest, ReportSummaryResponse and sdks/python/rogue_sdk/init.py imports them via from .types import * and lists them in all. Optional: replace the static all with the suggested pattern (keep explicit base_exports, then extend with names from types.all) to prevent future drift.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbdaadf and 863f21b.

📒 Files selected for processing (1)
  • sdks/python/rogue_sdk/__init__.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cursor Bugbot
🔇 Additional comments (1)
sdks/python/rogue_sdk/__init__.py (1)

51-52: Expose types submodule at package root — LGTM

This enables rogue_sdk.types for namespace usage and IDE discovery. No functional concerns.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
.github/workflows/rogue.yml (5)

29-31: Avoid shell activation; target the venv explicitly with uv for reliability.

Shell activation is unnecessary and can be brittle in CI. Point uv at the venv’s python instead.

-        run: source .venv/bin/activate && uv pip install -e sdks/python --force-reinstall
+        run: uv pip --python .venv/bin/python install -e sdks/python --force-reinstall

44-55: Second EXIT trap overrides the first; only one process gets cleaned up.

Define a single EXIT trap that handles both PIDs to prevent orphaned processes and reduce flakiness.

-          trap 'echo "🛑 Stopping agent..."; kill $AGENT_PID' EXIT
+          trap 'echo "🛑 Stopping processes..."; [[ -n "${AGENT_PID:-}" ]] && kill "$AGENT_PID"; [[ -n "${ROGUE_PID:-}" ]] && kill "$ROGUE_PID"' EXIT
@@
-          trap 'echo "🛑 Stopping rogue server..."; kill $ROGUE_PID' EXIT
+          # (single EXIT trap above handles both AGENT_PID and ROGUE_PID)

35-66: Gate secret-dependent steps for forked PRs to avoid failures.

OPENAI_API_KEY isn’t available on forked PRs; the agent/server run will fail. Skip this section for forks.

-      - name: Run rogue
+      - name: Run rogue
+        if: ${{ github.event_name != 'pull_request' || github.event.pull_request.head.repo.fork == false }}
         shell: bash
         env:
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

If you prefer to skip the entire job for forks, put the same if: at the job level.


1-9: Optional: add a concurrency group to cancel in-progress runs on new commits.

Prevents queued/duplicate “Rogue” jobs from stepping on each other.

 name: Rogue
+
+concurrency:
+  group: rogue-${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true

60-60: Nit: extraneous blank line.

Harmless, but you can drop it to keep the script tight.

.github/workflows/test.yml (2)

36-38: Avoid shell activation; point uv at the venv directly.

Keeps steps shell-agnostic and a bit faster.

-        run: source .venv/bin/activate && uv pip install -e sdks/python --force-reinstall
+        run: uv pip --python .venv/bin/python install -e sdks/python --force-reinstall

10-21: Optional: add a concurrency group for matrix runs.

Cancels older in-flight matrices on new pushes to the same ref.

 jobs:
   test:
     runs-on: ubuntu-latest
+    concurrency:
+      group: test-${{ github.workflow }}-${{ github.ref }}
+      cancel-in-progress: true
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 863f21b and 1ac39c1.

📒 Files selected for processing (2)
  • .github/workflows/rogue.yml (2 hunks)
  • .github/workflows/test.yml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
.github/**

📄 CodeRabbit inference engine (AGENTS.md)

Keep CI/CD workflow files under .github/

Files:

  • .github/workflows/test.yml
  • .github/workflows/rogue.yml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.github/actions/run-tests/action.yml (2)

14-14: Make test runner resilient to missing/cross‑platform venvs.

Sourcing .venv/bin/activate assumes a POSIX venv exists. The one‑liner below picks the venv Python on Linux/macOS or Windows, and falls back to system Python.

Apply this diff (also includes the coverage fix above):

-      run: source .venv/bin/activate && pytest --junitxml=test-results.xml --cov=rogue --cov-report=xml
+      run: V=python; if [ -x .venv/bin/python ]; then V=.venv/bin/python; elif [ -x .venv/Scripts/python.exe ]; then V=.venv/Scripts/python.exe; fi; "$V" -m pytest --junitxml=test-results.xml --cov=. --cov-report=xml

16-26: Always upload reports even on test failures.

Add if: ${{ always() }} so artifacts are uploaded when pytest fails.

Suggested snippet:

    - name: "Upload Test Results"
      if: ${{ always() }}
      uses: actions/upload-artifact@v4
      with:
        name: junit-results-${{ inputs.python-version }}
        path: test-results.xml

    - name: "Upload Coverage Results"
      if: ${{ always() }}
      uses: actions/upload-artifact@v4
      with:
        name: coverage-report-${{ inputs.python-version }}
        path: coverage.xml
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ac39c1 and 054dbd9.

📒 Files selected for processing (1)
  • .github/actions/run-tests/action.yml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
.github/**

📄 CodeRabbit inference engine (AGENTS.md)

Keep CI/CD workflow files under .github/

Files:

  • .github/actions/run-tests/action.yml
🔇 Additional comments (1)
.github/actions/run-tests/action.yml (1)

14-14: Confirmed — pytest and pytest-cov are installed upstream; no change required.
.github/workflows/test.yml creates a venv and runs uv pip install -e . before calling ./.github/actions/run-tests; pyproject.toml's dev group includes pytest and pytest-cov (uv.lock contains pytest-cov).

- name: "Run tests"
shell: bash
run: source .venv/bin/activate && uv run --dev pytest --junitxml=test-results.xml --cov=rogue --cov-report=xml
run: source .venv/bin/activate && pytest --junitxml=test-results.xml --cov=rogue --cov-report=xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Decouple coverage target from package rename (use project-wide coverage).

Hardcoding --cov=rogue will break or silently zero out coverage after the rename. Run coverage on the repo instead.

Apply this diff:

-      run: source .venv/bin/activate && pytest --junitxml=test-results.xml --cov=rogue --cov-report=xml
+      run: V=python; if [ -x .venv/bin/python ]; then V=.venv/bin/python; elif [ -x .venv/Scripts/python.exe ]; then V=.venv/Scripts/python.exe; fi; "$V" -m pytest --junitxml=test-results.xml --cov=. --cov-report=xml
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
run: source .venv/bin/activate && pytest --junitxml=test-results.xml --cov=rogue --cov-report=xml
run: V=python; if [ -x .venv/bin/python ]; then V=.venv/bin/python; elif [ -x .venv/Scripts/python.exe ]; then V=.venv/Scripts/python.exe; fi; "$V" -m pytest --junitxml=test-results.xml --cov=. --cov-report=xml
🤖 Prompt for AI Agents
In .github/actions/run-tests/action.yml around line 14, the pytest invocation
hardcodes --cov=rogue which will break after a package rename; change the
coverage target to run over the whole repository (for example replace
--cov=rogue with --cov=. or remove the --cov option and use --cov=. so coverage
is collected project-wide) so coverage metrics don't depend on a specific
package name.

@drorIvry drorIvry merged commit 682cb5c into main Sep 22, 2025
4 of 9 checks passed
@drorIvry drorIvry deleted the feature/cli-report-results-FIRE-778 branch September 22, 2025 08:41
This was referenced Sep 29, 2025
@coderabbitai coderabbitai bot mentioned this pull request Dec 30, 2025
21 tasks
@coderabbitai coderabbitai bot mentioned this pull request Feb 2, 2026
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants