Conversation
There was a problem hiding this comment.
Pull request overview
Refactors provider smoke-testing scripts into a shared library and splits “normal” vs “code_execution” test modes into separate entrypoints, updating the PR smoke-test workflow to use the new code-exec script.
Changes:
- Introduces
scripts/test_providers_lib.shto centralize provider/model config, env/CLI availability checks, parallel execution, and result reporting. - Updates
scripts/test_providers.shto use the shared library and changes the normal-mode test prompt/verification logic. - Adds
scripts/test_providers_code_exec.shand updates.github/workflows/pr-smoke-test.ymlto call it (and remove agentic/Node setup from the code-exec job).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| scripts/test_providers_lib.sh | New shared library for provider smoke tests (config, filtering, parallel runner, reporting). |
| scripts/test_providers_code_exec.sh | New code-execution-mode smoke test runner using the shared library. |
| scripts/test_providers.sh | Refactored normal-mode provider tests to use the shared library and a new prompt/assertion strategy. |
| .github/workflows/pr-smoke-test.yml | Adjusts the code-exec smoke-test job to run the new script and removes unnecessary Node/agentic setup there. |
scripts/test_providers_lib.sh
Outdated
| [[ "$line" =~ ^#.*$ || -z "$line" ]] && continue | ||
| local provider="${line%% -> *}" | ||
| if is_provider_available "$provider"; then | ||
| providers+=("$(eval echo "\"$line\"")") |
There was a problem hiding this comment.
Using eval to expand the provider config line allows command substitution via environment variables (e.g., a malicious AZURE_OPENAI_DEPLOYMENT_NAME), so this should be replaced with a non-eval expansion (handle azure_openai explicitly or do safe ${VAR} substitution).
| providers+=("$(eval echo "\"$line\"")") | |
| local expanded_line="${line//\$\{AZURE_OPENAI_DEPLOYMENT_NAME\}/${AZURE_OPENAI_DEPLOYMENT_NAME}}" | |
| providers+=("$expanded_line") |
scripts/test_providers_lib.sh
Outdated
| local test_fn="$1" | ||
|
|
||
| RESULTS_DIR=$(mktemp -d) | ||
| trap "rm -rf $RESULTS_DIR ${CLEANUP_DIR:-}" EXIT |
There was a problem hiding this comment.
The EXIT trap runs rm -rf $RESULTS_DIR ${CLEANUP_DIR:-} without quoting, so a value containing spaces/globs (or an empty/unset var edge case) can delete the wrong paths; quote variables and guard against empty values before rm -rf.
| trap "rm -rf $RESULTS_DIR ${CLEANUP_DIR:-}" EXIT | |
| trap 'if [ -n "${RESULTS_DIR:-}" ]; then rm -rf -- "$RESULTS_DIR"; fi; if [ -n "${CLEANUP_DIR:-}" ]; then rm -rf -- "$CLEANUP_DIR"; fi' EXIT |
scripts/test_providers.sh
Outdated
| if grep -q "TEST-CONTENT-ABC123" "$output_file"; then | ||
| echo "success|model read and uppercased file content" > "$result_file" | ||
| else | ||
| echo "✗ FAILED: Test failed - $FAILURE_MSG" | ||
| RESULTS+=("✗ ${provider}: ${model}") | ||
| HARD_FAILURES+=("${provider}: ${model}") | ||
| echo "failure|model did not return uppercased file content" > "$result_file" | ||
| fi |
There was a problem hiding this comment.
This non-agentic check can false-pass if the model simply prints the known constant (without using tools), since the expected output is deterministic; consider using per-run random content and/or asserting the text_editor | developer tool-call marker in the output.
| export GOOSE_PROVIDER="$provider" | ||
| export GOOSE_MODEL="$model" | ||
| cd "$testdir" && "$SCRIPT_DIR/target/debug/goose" run --text "$prompt" --with-builtin "$BUILTINS" 2>&1 | ||
| export PATH="" | ||
| cd "$testdir" && "$GOOSE_BIN" run --text "$prompt" --with-builtin "$BUILTINS" 2>&1 |
There was a problem hiding this comment.
Setting PATH to empty here can make agentic providers fail to resolve their CLI executables (since resolution depends on PATH + extra search paths), which can cause environment-dependent flakes; keep PATH or set a minimal PATH like /usr/local/bin:/usr/bin:/bin.
There was a problem hiding this comment.
the PATH="" was a mistake
* main: fix text editor view broken (#7167) docs: White label guide (#6857) Add PATH detection back to developer extension (#7161) docs: pin version in ci/cd (#7168) Desktop: - No Custom Headers field for custom OpenAI-compatible providers (#6681) feat: edit model and extensions of a recipe from GUI (#6804) feat: MCP support for agentic CLI providers (#6972)
| PROVIDER_CONFIG=" | ||
| openrouter -> google/gemini-2.5-pro|anthropic/claude-sonnet-4.5|qwen/qwen3-coder:exacto|z-ai/glm-4.6:exacto|nvidia/nemotron-3-nano-30b-a3b | ||
| xai -> grok-3 | ||
| openai -> gpt-4o|gpt-4o-mini|gpt-3.5-turbo|gpt-5 | ||
| anthropic -> claude-sonnet-4-5-20250929|claude-opus-4-5-20251101 | ||
| google -> gemini-2.5-pro|gemini-2.5-flash|gemini-3-pro-preview|gemini-3-flash-preview | ||
| tetrate -> claude-sonnet-4-20250514 | ||
| databricks -> databricks-claude-sonnet-4|gemini-2-5-flash|gpt-4o | ||
| azure_openai -> ${AZURE_OPENAI_DEPLOYMENT_NAME} | ||
| aws_bedrock -> us.anthropic.claude-sonnet-4-5-20250929-v1:0 |
There was a problem hiding this comment.
PROVIDER_CONFIG expands ${AZURE_OPENAI_DEPLOYMENT_NAME} before .env is loaded, so running locally with .env will leave the Azure model empty (and may produce no Azure test cases even when the env vars are set); load .env before building PROVIDER_CONFIG, or defer expansion (e.g., build the Azure line after env load).
| echo "hello" > "$testdir/hello.txt" | ||
| local prompt="Run 'ls' to list files in the current directory." | ||
|
|
||
| # Run goose | ||
| ( | ||
| export GOOSE_PROVIDER="$provider" | ||
| export GOOSE_MODEL="$model" | ||
| cd "$testdir" && "$GOOSE_BIN" run --text "$prompt" --with-builtin "$BUILTINS" 2>&1 | ||
| ) > "$output_file" 2>&1 | ||
|
|
||
| # Verify: code_execution tool must be called | ||
| # Matches: "execute | code_execution", "get_function_details | code_execution", | ||
| # "tool call | execute", "tool calls | execute" | ||
| if grep -qE "(execute \| code_execution)|(get_function_details \| code_execution)|(tool calls? \| execute)" "$output_file"; then |
There was a problem hiding this comment.
The code-exec smoke test prompt doesn't explicitly require using the code_execution/execute tool, so models may respond with a textual explanation (or use a different tool) and cause flaky CI failures; make the prompt force a tool call (similar to the normal-mode script’s “Do not ask for confirmation” style) and/or assert on the ls output as well as the tool-call log.
There was a problem hiding this comment.
the session only has code_execution tool, so it is fine
DOsinga
left a comment
There was a problem hiding this comment.
this is a good refactor!
I do wonder though whether at some point we should split the provider test into something we want to run for all providers and then a longer one that does specific and more complicated things but only for one reliable provider. the more we ask for slightly flaky providers, the more they will be flaky, plus of course it will take longer.
* origin/main: (33 commits) fix: replace panic with proper error handling in get_tokenizer (#7175) Lifei/smoke test for developer (#7174) fix text editor view broken (#7167) docs: White label guide (#6857) Add PATH detection back to developer extension (#7161) docs: pin version in ci/cd (#7168) Desktop: - No Custom Headers field for custom OpenAI-compatible providers (#6681) feat: edit model and extensions of a recipe from GUI (#6804) feat: MCP support for agentic CLI providers (#6972) docs: keyring fallback to secrets.yaml (#7165) feat: load provider/model specified inside the recipe config (#6884) fix ask-ai bot hitting tool call limits (#7162) fix flatpak icon (#7154) [docs] Skills Marketplace UI Improvements (#7158) More no-window flags (#7122) feat: Allow overriding default bat themes using environment variables (#7140) Make the system prompt smaller (#6991) Pre release script (#7145) Spelling (#7137) feat(mcp): upgrade rmcp to 0.15.0 and advertise MCP Apps UI extension capability (#6927) ...
…provenance * origin/main: (68 commits) Upgraded npm packages for latest security updates (#7183) docs: reasoning effort levels for Codex provider (#6798) Fix speech local (#7181) chore: add .gooseignore to .gitignore (#6826) Improve error message logging from electron (#7130) chore(deps): bump jsonwebtoken from 9.3.1 to 10.3.0 (#6924) docs: standalone mcp apps and apps extension (#6791) workflow: auto-update cli-commands on release (#6755) feat(apps): Integrate AppRenderer from @mcp-ui/client SDK (#7013) fix(MCP): decode resource content (#7155) feat: reasoning_content in API for reasoning models (#6322) Fix/configure add provider custom headers (#7157) fix: handle keyring fallback as success (#7177) Update process-wrap to 9.0.3 (9.0.2 is yanked) (#7176) feat: support extra field in chatcompletion tool_calls for gemini openai compat (#6184) fix: replace panic with proper error handling in get_tokenizer (#7175) Lifei/smoke test for developer (#7174) fix text editor view broken (#7167) docs: White label guide (#6857) Add PATH detection back to developer extension (#7161) ... # Conflicts: # .github/workflows/nightly.yml
Summary
Type of Change
AI Assistance
Testing
Related Issues
Relates to #ISSUE_ID
Discussion: LINK (if any)
Screenshots/Demos (for UX changes)
Before:
After: