fix: improve smoke test prompt for reliable tool calling#6281
Merged
Conversation
The previous prompt 'please list files in the current directory' was ambiguous and didn't explicitly require tool usage. Models like qwen/qwen3-coder and z-ai/glm-4.6 would sometimes respond with text describing what they would do instead of actually calling the tool. The new prompt explicitly instructs the model to immediately call the shell tool without asking for confirmation, which should improve reliability for models with weaker tool-calling capabilities.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the smoke test prompt to explicitly require immediate tool usage, addressing flakiness issues with models that have weaker tool-calling capabilities (particularly qwen/qwen3-coder and z-ai/glm-4.6).
Key Changes
- Modified the test prompt from an ambiguous request to an explicit command requiring immediate tool execution
- Changed "please list files in the current directory" to "Immediately call the shell tool to run 'ls -la'. Do not ask for confirmation."
* main: fix: adding more open models (#6300) docs: add goose for vs code extension (#6262) feat(code-mode): use server names for MCP extensions (#6284) docs: agent skills compatibility note (#6299) docs: clarify GOOSE_TERMINAL requires ~/.zshenv for zsh users (#6297) feat: add OpenAI Codex CLI provider (#6263) docs: fix Resources menu (#6292) Remove Advent of AI announcement banner (#6291) Add blog post: How We Use goose to Maintain goose (#6289)
grok was not liking this
michaelneale
approved these changes
Dec 31, 2025
Collaborator
michaelneale
left a comment
There was a problem hiding this comment.
waiting for semgrep
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improves the smoke test prompt to be more explicit about requiring immediate tool usage, which should reduce flakiness for models like
qwen/qwen3-coderandz-ai/glm-4.6.Problem
The previous prompt "please list files in the current directory" was ambiguous. Models with weaker tool-calling capabilities would sometimes respond with text describing what they would do instead of actually calling the tool:
This caused ~50% failure rate for qwen and GLM models in CI.
Solution
Changed the prompt to explicitly instruct immediate tool usage:
Testing
The smoke tests will validate this change on the PR itself. Looking for improved pass rates on:
openrouter: qwen/qwen3-coderopenrouter: z-ai/glm-4.6Related
TSK-710