-
Notifications
You must be signed in to change notification settings - Fork 2.8k
More providers for testing #6849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+248
−40
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -41,6 +41,7 @@ fi | |||||
| SCRIPT_DIR=$(pwd) | ||||||
|
|
||||||
| # Format: "provider -> model1|model2|model3" | ||||||
| # Base providers that are always tested (with appropriate env vars) | ||||||
| PROVIDERS=( | ||||||
| "openrouter -> google/gemini-2.5-pro|anthropic/claude-sonnet-4.5|qwen/qwen3-coder:exacto|z-ai/glm-4.6:exacto|nvidia/nemotron-3-nano-30b-a3b" | ||||||
| "xai -> grok-3" | ||||||
|
|
@@ -50,20 +51,132 @@ PROVIDERS=( | |||||
| "tetrate -> claude-sonnet-4-20250514" | ||||||
| ) | ||||||
|
|
||||||
| # In CI, only run Databricks tests if DATABRICKS_HOST and DATABRICKS_TOKEN are set | ||||||
| # Locally, always run Databricks tests | ||||||
| if [ -n "$CI" ]; then | ||||||
| if [ -n "$DATABRICKS_HOST" ] && [ -n "$DATABRICKS_TOKEN" ]; then | ||||||
| echo "✓ Including Databricks tests" | ||||||
| PROVIDERS+=("databricks -> databricks-claude-sonnet-4|gemini-2-5-flash|gpt-4o") | ||||||
| else | ||||||
| echo "⚠️ Skipping Databricks tests (DATABRICKS_HOST and DATABRICKS_TOKEN required in CI)" | ||||||
| fi | ||||||
| else | ||||||
| # Conditionally add providers based on environment variables | ||||||
|
|
||||||
| # Databricks: requires DATABRICKS_HOST and DATABRICKS_TOKEN | ||||||
| if [ -n "$DATABRICKS_HOST" ] && [ -n "$DATABRICKS_TOKEN" ]; then | ||||||
| echo "✓ Including Databricks tests" | ||||||
| PROVIDERS+=("databricks -> databricks-claude-sonnet-4|gemini-2-5-flash|gpt-4o") | ||||||
| else | ||||||
| echo "⚠️ Skipping Databricks tests (DATABRICKS_HOST and DATABRICKS_TOKEN required)" | ||||||
| fi | ||||||
|
|
||||||
| # Azure OpenAI: requires AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_DEPLOYMENT_NAME | ||||||
| if [ -n "$AZURE_OPENAI_ENDPOINT" ] && [ -n "$AZURE_OPENAI_DEPLOYMENT_NAME" ]; then | ||||||
| echo "✓ Including Azure OpenAI tests" | ||||||
| PROVIDERS+=("azure_openai -> ${AZURE_OPENAI_DEPLOYMENT_NAME}") | ||||||
| else | ||||||
| echo "⚠️ Skipping Azure OpenAI tests (AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_DEPLOYMENT_NAME required)" | ||||||
| fi | ||||||
|
|
||||||
| # AWS Bedrock: requires AWS credentials (profile or keys) and AWS_REGION | ||||||
| if [ -n "$AWS_REGION" ] && { [ -n "$AWS_PROFILE" ] || [ -n "$AWS_ACCESS_KEY_ID" ]; }; then | ||||||
| echo "✓ Including AWS Bedrock tests" | ||||||
| PROVIDERS+=("aws_bedrock -> us.anthropic.claude-sonnet-4-5-20250929-v1:0") | ||||||
| else | ||||||
| echo "⚠️ Skipping AWS Bedrock tests (AWS_REGION and AWS_PROFILE or AWS credentials required)" | ||||||
| fi | ||||||
|
|
||||||
| # GCP Vertex AI: requires GCP_PROJECT_ID | ||||||
| if [ -n "$GCP_PROJECT_ID" ]; then | ||||||
| echo "✓ Including GCP Vertex AI tests" | ||||||
| PROVIDERS+=("gcp_vertex_ai -> gemini-2.5-pro") | ||||||
| else | ||||||
| echo "⚠️ Skipping GCP Vertex AI tests (GCP_PROJECT_ID required)" | ||||||
| fi | ||||||
|
|
||||||
| # Snowflake: requires SNOWFLAKE_HOST and SNOWFLAKE_TOKEN | ||||||
| if [ -n "$SNOWFLAKE_HOST" ] && [ -n "$SNOWFLAKE_TOKEN" ]; then | ||||||
| echo "✓ Including Snowflake tests" | ||||||
| PROVIDERS+=("snowflake -> claude-sonnet-4-5") | ||||||
| else | ||||||
| echo "⚠️ Skipping Snowflake tests (SNOWFLAKE_HOST and SNOWFLAKE_TOKEN required)" | ||||||
| fi | ||||||
|
|
||||||
| # Venice: requires VENICE_API_KEY | ||||||
| if [ -n "$VENICE_API_KEY" ]; then | ||||||
| echo "✓ Including Venice tests" | ||||||
| PROVIDERS+=("venice -> llama-3.3-70b") | ||||||
| else | ||||||
| echo "⚠️ Skipping Venice tests (VENICE_API_KEY required)" | ||||||
| fi | ||||||
|
|
||||||
| # LiteLLM: requires LITELLM_API_KEY (and optionally LITELLM_HOST) | ||||||
| if [ -n "$LITELLM_API_KEY" ]; then | ||||||
| echo "✓ Including LiteLLM tests" | ||||||
| PROVIDERS+=("litellm -> gpt-4o-mini") | ||||||
| else | ||||||
| echo "⚠️ Skipping LiteLLM tests (LITELLM_API_KEY required)" | ||||||
| fi | ||||||
|
|
||||||
| # Ollama: requires OLLAMA_HOST (or uses default localhost:11434) | ||||||
| if [ -n "$OLLAMA_HOST" ] || command -v ollama &> /dev/null; then | ||||||
| echo "✓ Including Ollama tests" | ||||||
| PROVIDERS+=("ollama -> qwen3") | ||||||
| else | ||||||
| echo "⚠️ Skipping Ollama tests (OLLAMA_HOST required or ollama must be installed)" | ||||||
| fi | ||||||
|
|
||||||
| # SageMaker TGI: requires AWS credentials and SAGEMAKER_ENDPOINT_NAME | ||||||
| if [ -n "$SAGEMAKER_ENDPOINT_NAME" ] && [ -n "$AWS_REGION" ]; then | ||||||
| echo "✓ Including SageMaker TGI tests" | ||||||
| PROVIDERS+=("sagemaker_tgi -> sagemaker-tgi-endpoint") | ||||||
| else | ||||||
| echo "⚠️ Skipping SageMaker TGI tests (SAGEMAKER_ENDPOINT_NAME and AWS_REGION required)" | ||||||
| fi | ||||||
|
|
||||||
| # GitHub Copilot: requires OAuth setup (check for cached token) | ||||||
| if [ -n "$GITHUB_COPILOT_TOKEN" ] || [ -f "$HOME/.config/goose/github_copilot_token.json" ]; then | ||||||
| echo "✓ Including GitHub Copilot tests" | ||||||
| PROVIDERS+=("github_copilot -> gpt-4.1") | ||||||
| else | ||||||
| echo "⚠️ Skipping GitHub Copilot tests (OAuth setup required - run 'goose configure' first)" | ||||||
| fi | ||||||
|
|
||||||
| # ChatGPT Codex: requires OAuth setup | ||||||
| if [ -n "$CHATGPT_CODEX_TOKEN" ] || [ -f "$HOME/.config/goose/chatgpt_codex_token.json" ]; then | ||||||
| echo "✓ Including ChatGPT Codex tests" | ||||||
| PROVIDERS+=("chatgpt_codex -> gpt-5.1-codex") | ||||||
| else | ||||||
| echo "⚠️ Skipping ChatGPT Codex tests (OAuth setup required - run 'goose configure' first)" | ||||||
| fi | ||||||
|
|
||||||
| # CLI-based providers (require the CLI tool to be installed) | ||||||
|
|
||||||
| # Claude Code CLI: requires 'claude' CLI tool | ||||||
| if command -v claude &> /dev/null; then | ||||||
| echo "✓ Including Claude Code CLI tests" | ||||||
| PROVIDERS+=("claude-code -> claude-sonnet-4-20250514") | ||||||
| else | ||||||
| echo "⚠️ Skipping Claude Code CLI tests ('claude' CLI tool required)" | ||||||
| fi | ||||||
|
|
||||||
| # Codex CLI: requires 'codex' CLI tool | ||||||
| if command -v codex &> /dev/null; then | ||||||
| echo "✓ Including Codex CLI tests" | ||||||
| PROVIDERS+=("codex -> gpt-5.2-codex") | ||||||
| else | ||||||
| echo "⚠️ Skipping Codex CLI tests ('codex' CLI tool required)" | ||||||
| fi | ||||||
|
|
||||||
| # Gemini CLI: requires 'gemini' CLI tool | ||||||
| if command -v gemini &> /dev/null; then | ||||||
| echo "✓ Including Gemini CLI tests" | ||||||
| PROVIDERS+=("gemini-cli -> gemini-2.5-pro") | ||||||
| else | ||||||
| echo "⚠️ Skipping Gemini CLI tests ('gemini' CLI tool required)" | ||||||
| fi | ||||||
|
|
||||||
| # Cursor Agent: requires 'cursor-agent' CLI tool | ||||||
| if command -v cursor-agent &> /dev/null; then | ||||||
| echo "✓ Including Cursor Agent tests" | ||||||
| PROVIDERS+=("cursor-agent -> auto") | ||||||
| else | ||||||
| echo "⚠️ Skipping Cursor Agent tests ('cursor-agent' CLI tool required)" | ||||||
| fi | ||||||
|
|
||||||
| echo "" | ||||||
|
|
||||||
| # Configure mode-specific settings | ||||||
| if [ "$CODE_EXEC_MODE" = true ]; then | ||||||
| echo "Mode: code_execution (JS batching)" | ||||||
|
|
@@ -111,52 +224,147 @@ should_skip_provider() { | |||||
| return 1 | ||||||
| } | ||||||
|
|
||||||
| RESULTS=() | ||||||
| HARD_FAILURES=() | ||||||
| # Create temp directory for results | ||||||
| RESULTS_DIR=$(mktemp -d) | ||||||
| trap "rm -rf $RESULTS_DIR" EXIT | ||||||
|
|
||||||
| # Maximum parallel jobs (default: number of CPU cores, or override with MAX_PARALLEL) | ||||||
| MAX_PARALLEL=${MAX_PARALLEL:-$(sysctl -n hw.ncpu 2>/dev/null || nproc 2>/dev/null || echo 8)} | ||||||
| echo "Running tests with up to $MAX_PARALLEL parallel jobs" | ||||||
| echo "" | ||||||
|
|
||||||
| # Function to run a single test | ||||||
| run_test() { | ||||||
| local provider="$1" | ||||||
| local model="$2" | ||||||
| local result_file="$3" | ||||||
| local output_file="$4" | ||||||
|
|
||||||
| local testdir=$(mktemp -d) | ||||||
| echo "hello" > "$testdir/hello.txt" | ||||||
|
|
||||||
| # Run the test and capture output | ||||||
| ( | ||||||
| export GOOSE_PROVIDER="$provider" | ||||||
| export GOOSE_MODEL="$model" | ||||||
| cd "$testdir" && "$SCRIPT_DIR/target/release/goose" run --text "Immediately use the shell tool to run 'ls'. Do not ask for confirmation." --with-builtin "$BUILTINS" 2>&1 | ||||||
| ) > "$output_file" 2>&1 | ||||||
|
|
||||||
| # Check result | ||||||
| if grep -qE "$SUCCESS_PATTERN" "$output_file"; then | ||||||
| echo "success" > "$result_file" | ||||||
| else | ||||||
| echo "failure" > "$result_file" | ||||||
| fi | ||||||
|
|
||||||
| rm -rf "$testdir" | ||||||
| } | ||||||
|
|
||||||
| # Build list of all provider/model combinations | ||||||
| JOBS=() | ||||||
| job_index=0 | ||||||
| for provider_config in "${PROVIDERS[@]}"; do | ||||||
| # Split on " -> " to get provider and models | ||||||
| PROVIDER="${provider_config%% -> *}" | ||||||
| MODELS_STR="${provider_config#* -> }" | ||||||
|
|
||||||
| # Skip provider if it's in SKIP_PROVIDERS | ||||||
| if should_skip_provider "$PROVIDER"; then | ||||||
| echo "⊘ Skipping provider: ${PROVIDER} (SKIP_PROVIDERS)" | ||||||
| echo "---" | ||||||
| continue | ||||||
| fi | ||||||
|
|
||||||
| # Split models on "|" | ||||||
| IFS='|' read -ra MODELS <<< "$MODELS_STR" | ||||||
| for MODEL in "${MODELS[@]}"; do | ||||||
| export GOOSE_PROVIDER="$PROVIDER" | ||||||
| export GOOSE_MODEL="$MODEL" | ||||||
| TESTDIR=$(mktemp -d) | ||||||
| echo "hello" > "$TESTDIR/hello.txt" | ||||||
| echo "Provider: ${PROVIDER}" | ||||||
| echo "Model: ${MODEL}" | ||||||
| echo "" | ||||||
| TMPFILE=$(mktemp) | ||||||
| (cd "$TESTDIR" && "$SCRIPT_DIR/target/release/goose" run --text "Immediately use the shell tool to run 'ls'. Do not ask for confirmation." --with-builtin "$BUILTINS" 2>&1) | tee "$TMPFILE" | ||||||
| echo "" | ||||||
| if grep -qE "$SUCCESS_PATTERN" "$TMPFILE"; then | ||||||
| echo "✓ SUCCESS: Test passed - $SUCCESS_MSG" | ||||||
| RESULTS+=("✓ ${PROVIDER}: ${MODEL}") | ||||||
| else | ||||||
| if is_allowed_failure "$PROVIDER" "$MODEL"; then | ||||||
| echo "⚠ FLAKY: Test failed but model is in allowed failures list - $FAILURE_MSG" | ||||||
| RESULTS+=("⚠ ${PROVIDER}: ${MODEL} (flaky)") | ||||||
| else | ||||||
| echo "✗ FAILED: Test failed - $FAILURE_MSG" | ||||||
| RESULTS+=("✗ ${PROVIDER}: ${MODEL}") | ||||||
| HARD_FAILURES+=("${PROVIDER}: ${MODEL}") | ||||||
| fi | ||||||
| JOBS+=("$PROVIDER|$MODEL|$job_index") | ||||||
| ((job_index++)) | ||||||
| done | ||||||
| done | ||||||
|
|
||||||
| total_jobs=${#JOBS[@]} | ||||||
| echo "Starting $total_jobs tests..." | ||||||
| echo "" | ||||||
|
|
||||||
| # Run first test sequentially if any jobs exist | ||||||
| if [ $total_jobs -gt 0 ]; then | ||||||
| echo "Running first test sequentially..." | ||||||
| first_job="${JOBS[0]}" | ||||||
| IFS='|' read -r provider model idx <<< "$first_job" | ||||||
|
|
||||||
| result_file="$RESULTS_DIR/result_$idx" | ||||||
| output_file="$RESULTS_DIR/output_$idx" | ||||||
| meta_file="$RESULTS_DIR/meta_$idx" | ||||||
| echo "$provider|$model" > "$meta_file" | ||||||
|
Comment on lines
+295
to
+296
|
||||||
| meta_file="$RESULTS_DIR/meta_$idx" | |
| echo "$provider|$model" > "$meta_file" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running one alone before the rest concurrently is kind of silly, but solves two small pain points. If you immediately start N concurrent sessions, then:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's also useful since most likely if you break something it is broken for all providers. this way it tells you that immediately maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the faster inner loop - I think is reasonable