Skip to content

fix(run_helpers): reduce false positives in edit intent inference#13

Closed
RyderFreeman4Logos wants to merge 3 commits intomainfrom
fix/infer-edit-false-positives-clean
Closed

fix(run_helpers): reduce false positives in edit intent inference#13
RyderFreeman4Logos wants to merge 3 commits intomainfrom
fix/infer-edit-false-positives-clean

Conversation

@RyderFreeman4Logos
Copy link
Owner

Summary

  • Rewrote infer_task_edit_requirement() from substring matching to token-based matching
  • Fixes false positives: "prefix the" no longer matches "fix the", "explain the implementation" no longer matches "implement"
  • Filters empty tokens from emoji/bullet prefixes (e.g. "✅ Fix the bug")
  • Adds polite prefix skipping (please, can you, could you) for ambiguous verbs
  • 19 tests covering false positives, polite prefixes, emoji/bullet prefixes, and all existing behavior

Background

Clean resubmission of #12. The original PR went through 1 round of iterative review with @codex (P2: empty token bug). Fix commits have been consolidated into a single logical commit here.

See #12 for the full review discussion.

Test plan

  • cargo clippy -p cli-sub-agent — clean
  • cargo test -p cli-sub-agent -- infer_edit — 19/19 passed
  • just test — all passed
  • csa review --branch main — reviewed
  • @codex review

🤖 Generated with Claude Code

[MOTIVATION]
infer_task_edit_requirement() used broad substring matching that caused
false positives: "prefix the" matched "fix the", "explain the implementation"
matched "implement", and "check for updates" matched "update". Prompts
starting with emoji/bullets produced empty tokens that broke verb detection.

[IMPLEMENTATION DETAILS]
- Rewrote to token-based matching: tokenize once, all rules use tokens
- Filter empty tokens from punctuation-only input (emoji/bullet prefixes)
- Phase 0: explicit read-only substring markers (unchanged)
- Phase 1: unambiguous multi-word token sequences (write+code, create+file)
- Phase 2: "fix" + determiner (the/a/this/that/my/our) as token pair
- Phase 3: standalone word markers (refactor, rename, modify, patch)
- Phase 4: ambiguous verbs as first meaningful word, skipping polite
  prefixes (please, can, could, would, you)
- 19 tests covering false positives, polite prefixes, emoji/bullet
  prefixes, and all existing behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RyderFreeman4Logos
Copy link
Owner Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 342f057523

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 344 to 348
// Phase 4: Ambiguous verbs — only match as the first meaningful word.
// Skips polite prefixes like "please", "can you", "could you".
let skip_prefixes: &[&str] = &["please", "can", "could", "would", "you"];
let first_verb = tokens.iter().find(|t| !skip_prefixes.contains(t)).copied();
let verb_markers = ["edit", "update", "implement", "fix"];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle filler words before verb detection

Because Phase 4 only inspects the first non-skip token, prompts like “Please also update the config” or “Please just implement X” will return None (the first non-skip token is “also/just”), even though the edit intent is clear. This is a regression from the previous substring matching and can mislead failover decisions in main.rs (where infer_task_edit_requirement feeds decide_failover), potentially choosing a read-only tool for an edit request. Consider skipping common filler adverbs or scanning for the first verb marker after polite prefixes rather than a single token.

Useful? React with 👍 / 👎.

[MOTIVATION]
Codex bot review (PR #13) found that filler adverbs like "also", "just"
blocked first-verb detection: "Please also update the config" returned
None instead of Some(true).

[IMPLEMENTATION DETAILS]
- Added filler adverbs (also, just, now, then, quickly) to skip list
- Added 2 regression tests for filler word scenarios
- 21 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RyderFreeman4Logos
Copy link
Owner Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e88f4fe923

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 346 to 350
let skip_prefixes: &[&str] = &[
"please", "can", "could", "would", "you", // polite prefixes
"also", "just", "now", "then", "quickly", // filler adverbs
];
let first_verb = tokens.iter().find(|t| !skip_prefixes.contains(t)).copied();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat edit intent as ambiguous for common lead-ins

Because phase 4 only checks the first non-skipped token, prompts like “We need to update the config” or “Should we implement X?” now return None: first_verb becomes we/should (not in skip_prefixes), so update/implement are never examined. This is a regression from the prior substring matcher and can misclassify edit tasks as ambiguous; in main.rs that None falls back to can_tool_edit_existing for the current tool, which can be false for read-only tools and drive failover decisions that keep non-edit tools for an edit-required task.

Useful? React with 👍 / 👎.

[MOTIVATION]
Prompts like "We need to update the config" or "Should we implement X?"
were returning None because pronouns and modal verbs (we, should, need,
to) were not in the skip_prefixes list, causing the first non-skipped
token to be a non-verb word instead of the actual edit-intent verb.

[IMPLEMENTATION DETAILS]
- Added pronouns (we, i), modal (should, shall), lead-ins (need, to,
  go, let, lets, us) to skip_prefixes
- Added 4 regression tests (25 total): we need to update, should we
  implement, let's fix, I need to edit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RyderFreeman4Logos
Copy link
Owner Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76c84073e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +344 to +355
// Phase 4: Ambiguous verbs — only match as the first meaningful word.
// Skips polite prefixes and filler adverbs before the verb.
let skip_prefixes: &[&str] = &[
"please", "can", "could", "would", "should", "shall", "you", // modals & polite
"we", "i", "lets", "let", "us", "need", "to", "go", // pronouns & lead-ins
"also", "just", "now", "then", "quickly", // filler adverbs
];
let first_verb = tokens.iter().find(|t| !skip_prefixes.contains(t)).copied();
let verb_markers = ["edit", "update", "implement", "fix"];
if let Some(verb) = first_verb {
if verb_markers.contains(&verb) {
return Some(true);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle edit verbs that appear after a subject

Phase 4 only checks the first non‑prefix token for edit|update|implement|fix, so sentences where the verb appears later (e.g., "The task is to update the config" or "Review the logs and update the config") now return None. That’s a regression from the previous substring check and can cause failover to treat an edit request as read‑only when the current tool can’t edit, since infer_task_edit_requirement is used to decide edit capability. Consider scanning for these verbs beyond just the first meaningful token, or expanding the prefix list to cover common subjects so explicit edit intent isn’t missed.

Useful? React with 👍 / 👎.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design, not a bug. The function returns Option<bool>None means "ambiguous, let caller decide".

Scanning all tokens for verb markers would reintroduce the original false-positive problem (e.g., "check for updates" → matches "update", "explain the implementation" → matches "implement"). Phase 4 intentionally only checks the first meaningful verb to avoid this.

For "Review the logs and update the config" — mixed-intent prompts are explicitly tested as None (infer_edit_review_and_fix_is_ambiguous). The caller in main.rs uses can_tool_edit_existing as a secondary signal when None is returned.

For "The task is to update the config" — this phrasing is an extreme edge case. Adding the, task, is etc. to skip_prefixes would effectively turn this into a full NLP parser, which is outside the scope of this lightweight heuristic.

@RyderFreeman4Logos
Copy link
Owner Author

Superseded by #14 (clean squash resubmission after iterative review)

RyderFreeman4Logos added a commit that referenced this pull request Feb 13, 2026
…pability

[MOTIVATION]
Two related issues: (1) spawn_tool() didn't set stdin, causing claude-code to
hang waiting for stdin input (Issue #13); (2) all prompts passed via argv,
risking ARG_MAX overflow for large prompts (Issue #12).

[IMPLEMENTATION DETAILS]
- Add PromptTransport enum (Argv/Stdin) to csa-core with per-tool capability
  declarations: codex/gemini/claude support both, opencode argv-only
- Modify build_command() to return (Command, Option<Vec<u8>>) — selects
  stdin transport when prompt > 100KB and tool supports it
- Modify spawn_tool() to accept stdin_data: Option<Vec<u8>> — pipes data
  with 30s timeout when Some, uses Stdio::null() when None
- Update pipeline.rs to thread stdin_data through to spawn_tool()
- Add comprehensive tests: transport capability matrix, stdin null/piped
  behavior, long prompt routing, opencode argv-only warning

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RyderFreeman4Logos added a commit that referenced this pull request Feb 13, 2026
…AUTO fallback (#4,#12,#13) (#64)

* fix(process,executor): unified stdin strategy with PromptTransport capability

[MOTIVATION]
Two related issues: (1) spawn_tool() didn't set stdin, causing claude-code to
hang waiting for stdin input (Issue #13); (2) all prompts passed via argv,
risking ARG_MAX overflow for large prompts (Issue #12).

[IMPLEMENTATION DETAILS]
- Add PromptTransport enum (Argv/Stdin) to csa-core with per-tool capability
  declarations: codex/gemini/claude support both, opencode argv-only
- Modify build_command() to return (Command, Option<Vec<u8>>) — selects
  stdin transport when prompt > 100KB and tool supports it
- Modify spawn_tool() to accept stdin_data: Option<Vec<u8>> — pipes data
  with 30s timeout when Some, uses Stdio::null() when None
- Update pipeline.rs to thread stdin_data through to spawn_tool()
- Add comprehensive tests: transport capability matrix, stdin null/piped
  behavior, long prompt routing, opencode argv-only warning

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(config): add defaults.tool with resolve_tool() for AUTO fallback (#4)

[MOTIVATION]
When csa review/debate is called outside a CSA session (e.g., directly
from Claude Code), auto tool detection fails because CSA_TOOL and
CSA_PARENT_TOOL env vars are absent. This causes "AUTO review tool
selection failed" with no graceful fallback path.

[IMPLEMENTATION DETAILS]
- Add `tool: Option<String>` to `DefaultsConfig` in GlobalConfig
- Add `resolve_tool()` in run_helpers.rs: detection → config → None
- Update review_cmd, debate_cmd, main.rs to use resolve_tool()
- Two-level fallback: detect_parent_tool() → config.defaults.tool → any-available
- detect_parent_tool() stays unchanged (pure detection only)
- Add 5 tests covering resolve_tool logic and config deserialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs(skills): clarify mktd approve option with mktsk reference

[MOTIVATION]
The Phase 4 APPROVE option in the mktd skill did not clearly indicate
that the next step is to use the mktsk skill for execution, causing
potential confusion about the post-approval workflow.

[IMPLEMENTATION DETAILS]
- Changed "Approve" to "Approve, 使用 mktsk" in Phase 4 User Options

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(review): address P1 doc language + P2 test gap from local review

[MOTIVATION]
Pre-PR local review (main...HEAD) identified two issues:
- P1: Chinese text in mktd SKILL.md violates Meta 003 communication rule
- P2: Missing long-prompt stdin transport tests for Gemini CLI and Claude Code

[IMPLEMENTATION DETAILS]
- Changed mktd approve option label to English
- Added long-prompt stdin tests for Gemini CLI and Claude Code executors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(executor): retain -p flag for gemini/claude in stdin transport mode

[MOTIVATION]
Local review P1: when long prompts go through stdin transport, the -p
flag was entirely skipped for Gemini CLI and Claude Code. Without -p,
these tools enter interactive mode, causing automation to hang.

[IMPLEMENTATION DETAILS]
- In stdin mode, still emit -p (without prompt arg) for gemini/claude
- Codex/opencode don't need extra flags for stdin reading
- Updated tests to assert -p presence in stdin transport mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant