Behavioral evals framework. #16047

gundermanc · 2026-01-07T01:28:33Z

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

gemini-code-assist · 2026-01-07T01:28:51Z

Summary of Changes

Hello @gundermanc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust behavioral evaluation framework for the Gemini CLI. It provides a structured and efficient way to define, run, and validate the expected behavior of the CLI, particularly focusing on its interactions with language models and integrated tools. The framework includes helper functions for test creation, a dedicated script for execution, and clear documentation to guide developers in extending the evaluation suite. This enhancement aims to significantly improve the reliability and maintainability of the CLI by ensuring its behavior consistently aligns with design expectations.

Highlights

New Behavioral Evals Framework: A new system for defining and running behavioral evaluations for the Gemini CLI has been introduced, allowing for more robust testing of model and tool interactions.
Core Test Helper Utilities: A central test-helper.ts file provides essential utilities like runEval and conditionalDescribe to streamline the creation of new evaluation tests.
runEval Function: This new function simplifies running individual evaluation cases, enabling developers to easily specify prompts, parameters, and custom assertion logic.
conditionalDescribe Function: Evaluations can now be run conditionally using conditionalDescribe, which executes tests only when the RUN_EVALS environment variable is set to true.
Comprehensive Documentation: New documentation (evals.md) has been added, detailing how to create and execute behavioral evaluations, and is accessible via the sidebar.
New test:evals Script: A dedicated npm run test:evals script has been added to package.json for easily executing the new behavioral evaluations.
Example Evaluation: An example evaluation (save_memory.eval.ts) is included, demonstrating how to test specific tool calls and validate model output within the new framework.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-01-07T01:31:36Z

Size Change: -2 B (0%)

Total Size: 22.3 MB

ℹ️ View Unchanged

Filename	Size	Change
`./bundle/gemini.js`	22.2 MB	-2 B (0%)
`./bundle/sandbox-macos-permissive-closed.sb`	1.03 kB	0 B
`./bundle/sandbox-macos-permissive-open.sb`	890 B	0 B
`./bundle/sandbox-macos-permissive-proxied.sb`	1.31 kB	0 B
`./bundle/sandbox-macos-restrictive-closed.sb`	3.29 kB	0 B
`./bundle/sandbox-macos-restrictive-open.sb`	3.36 kB	0 B
`./bundle/sandbox-macos-restrictive-proxied.sb`	3.56 kB	0 B

_{compressed-size-action}

gemini-code-assist

Code Review

This pull request introduces a new framework for behavioral evaluations, including a test runner, a sample evaluation, and documentation. The core change is the addition of evals/test-helper.ts, which provides the infrastructure for these evaluations. However, this new helper file almost entirely duplicates the existing integration-tests/test-helper.ts. This introduces a critical maintainability issue that should be addressed by refactoring to reuse the existing test infrastructure code instead of copying it.

gemini-code-assist · 2026-01-07T01:32:23Z

evals/test-helper.ts

+/**
+ * @license
+ * Copyright 2025 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { expect, describe } from 'vitest';
+import { execSync, spawn } from 'node:child_process';
+import { mkdirSync, writeFileSync, readFileSync } from 'node:fs';
+import { join, dirname } from 'node:path';
+import { fileURLToPath } from 'node:url';
+import { env } from 'node:process';
+import { DEFAULT_GEMINI_MODEL } from '../packages/core/src/config/models.js';
+import fs from 'node:fs';
+import * as os from 'node:os';
+import { GEMINI_DIR } from '../packages/core/src/utils/paths.js';
+
+export const conditionalDescribe = process.env.RUN_EVALS
+  ? describe
+  : describe.skip;
+
+export interface EvalCase {
+  name: string;
+  params?: Record<string, any>;
+  prompt: string;
+  assert: (rig: TestRig, result: string) => Promise<void>;
+  log?: boolean;
+}
+
+export async function runEval(evalCase: EvalCase) {
+  const rig = new TestRig();
+  try {
+    await rig.setup(evalCase.name, evalCase.params);
+    const result = await rig.run({ args: evalCase.prompt });
+    await evalCase.assert(rig, result);
+  } finally {
+    if (evalCase.log) {
+      await logToFile(evalCase.name, JSON.stringify(rig.readToolLogs(), null, 2));
+    }
+    await rig.cleanup();
+  }
+}
+
+async function logToFile(name: string, content: string) {
+  const logDir = 'evals/logs';
+  await fs.promises.mkdir(logDir, { recursive: true });
+  const sanitizedName = name.replace(/[^a-z0-9]/gi, '_').toLowerCase();
+  const logFile = `${logDir}/${sanitizedName}.log`;
+  await fs.promises.writeFile(logFile, content);
+}
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const BUNDLE_PATH = join(__dirname, '..', 'bundle/gemini.js');
+
+
+// Get timeout based on environment
+function getDefaultTimeout() {
+  if (env['CI']) return 60000; // 1 minute in CI
+  return 15000; // 15s locally
+}
+
+export async function poll(
+  predicate: () => boolean,
+  timeout: number,
+  interval: number,
+): Promise<boolean> {
+  const startTime = Date.now();
+  let attempts = 0;
+  while (Date.now() - startTime < timeout) {
+    attempts++;
+    const result = predicate();
+    if (env['VERBOSE'] === 'true' && attempts % 5 === 0) {
+      console.log(
+        `Poll attempt ${attempts}: ${result ? 'success' : 'waiting...'}`,
+      );
+    }
+    if (result) {
+      return true;
+    }
+    await new Promise((resolve) => setTimeout(resolve, interval));
+  }
+  if (env['VERBOSE'] === 'true') {
+    console.log(`Poll timed out after ${attempts} attempts`);
+  }
+  return false;
+}
+
+function sanitizeTestName(name: string) {
+  return name
+    .toLowerCase()
+    .replace(/[^a-z0-9]/g, '-')
+    .replace(/-+/g, '-');
+}
+
+// Helper to create detailed error messages
+export function createToolCallErrorMessage(
+  expectedTools: string | string[],
+  foundTools: string[],
+  result: string,
+) {
+  const expectedStr = Array.isArray(expectedTools)
+    ? expectedTools.join(' or ')
+    : expectedTools;
+  return (
+    `Expected to find ${expectedStr} tool call(s). ` +
+    `Found: ${foundTools.length > 0 ? foundTools.join(', ') : 'none'}. ` +
+    `Output preview: ${result ? result.substring(0, 200) + '...' : 'no output'}`
+  );
+}
+
+// Helper to print debug information when tests fail
+export function printDebugInfo(
+  rig: TestRig,
+  result: string,
+  context: Record<string, unknown> = {},
+) {
+  console.error('Test failed - Debug info:');
+  console.error('Result length:', result.length);
+  console.error('Result (first 500 chars):', result.substring(0, 500));
+  console.error(
+    'Result (last 500 chars):',
+    result.substring(result.length - 500),
+  );
+
+  // Print any additional context provided
+  Object.entries(context).forEach(([key, value]) => {
+    console.error(`${key}:`, value);
+  });
+
+  // Check what tools were actually called
+  const allTools = rig.readToolLogs();
+  console.error(
+    'All tool calls found:',
+    allTools.map((t) => t.toolRequest.name),
+  );
+
+  return allTools;
+}
+
+// Helper to validate model output and warn about unexpected content
+export function validateModelOutput(
+  result: string,
+  expectedContent: string | (string | RegExp)[] | null = null,
+  testName = '',
+) {
+  // First, check if there's any output at all (this should fail the test if missing)
+  if (!result || result.trim().length === 0) {
+    throw new Error('Expected LLM to return some output');
+  }
+
+  // If expectedContent is provided, check for it and warn if missing
+  if (expectedContent) {
+    const contents = Array.isArray(expectedContent)
+      ? expectedContent
+      : [expectedContent];
+    const missingContent = contents.filter((content) => {
+      if (typeof content === 'string') {
+        return !result.toLowerCase().includes(content.toLowerCase());
+      } else if (content instanceof RegExp) {
+        return !content.test(result);
+      }
+      return false;
+    });
+
+    if (missingContent.length > 0) {
+      console.warn(
+        `Warning: LLM did not include expected content in response: ${missingContent.join(
+          ', ',
+        )}.`,
+        'This is not ideal but not a test failure.',
+      );
+      console.warn(
+        'The tool was called successfully, which is the main requirement.',
+      );
+      console.warn('Expected content:', expectedContent);
+      console.warn('Actual output:', result);
+      return false;
+    } else if (env['VERBOSE'] === 'true') {
+      console.log(`${testName}: Model output validated successfully.`);
+    }
+    return true;
+  }
+
+  return true;
+}
+
+interface ParsedLog {
+  attributes?: {
+    'event.name'?: string;
+    function_name?: string;
+    function_args?: string;
+    success?: boolean;
+    duration_ms?: number;
+    request_text?: string;
+    hook_event_name?: string;
+    hook_name?: string;
+    hook_input?: Record<string, unknown>;
+    hook_output?: Record<string, unknown>;
+    exit_code?: number;
+    stdout?: string;
+    stderr?: string;
+    error?: string;
+  };
+  scopeMetrics?: {
+    metrics: {
+      descriptor: {
+        name: string;
+      };
+    }[];
+  }[];
+}
+
+export class TestRig {
+  testDir: string | null = null;
+  testName?: string;
+  _lastRunStdout?: string;
+  // Path to the copied fake responses file for this test.
+  fakeResponsesPath?: string;
+  // Original fake responses file path for rewriting goldens in record mode.
+  originalFakeResponsesPath?: string;
+
+  setup(
+    testName: string,
+    options: {
+      settings?: Record<string, unknown>;
+      fakeResponsesPath?: string;
+    } = {},
+  ) {
+    this.testName = testName;
+    const sanitizedName = sanitizeTestName(testName);
+    const testFileDir =
+      env['INTEGRATION_TEST_FILE_DIR'] || join(os.tmpdir(), 'gemini-cli-tests');
+    this.testDir = join(testFileDir, sanitizedName);
+    mkdirSync(this.testDir, { recursive: true });
+    if (options.fakeResponsesPath) {
+      this.fakeResponsesPath = join(this.testDir, 'fake-responses.json');
+      this.originalFakeResponsesPath = options.fakeResponsesPath;
+      if (process.env['REGENERATE_MODEL_GOLDENS'] !== 'true') {
+        fs.copyFileSync(options.fakeResponsesPath, this.fakeResponsesPath);
+      }
+    }
+
+    // Create a settings file to point the CLI to the local collector
+    const geminiDir = join(this.testDir, GEMINI_DIR);
+    mkdirSync(geminiDir, { recursive: true });
+    // In sandbox mode, use an absolute path for telemetry inside the container
+    // The container mounts the test directory at the same path as the host
+    const telemetryPath = join(this.testDir, 'telemetry.log'); // Always use test directory for telemetry
+
+    const settings = {
+      general: {
+        // Nightly releases sometimes becomes out of sync with local code and
+        // triggers auto-update, which causes tests to fail.
+        disableAutoUpdate: true,
+        previewFeatures: false,
+      },
+      telemetry: {
+        enabled: true,
+        target: 'local',
+        otlpEndpoint: '',
+        outfile: telemetryPath,
+      },
+      security: {
+        auth: {
+          selectedType: 'gemini-api-key',
+        },
+      },
+      ui: {
+        useAlternateBuffer: true,
+      },
+      model: DEFAULT_GEMINI_MODEL,
+      // Don't show the IDE connection dialog when running from VsCode
+      ide: { enabled: false, hasSeenNudge: true },
+      ...options.settings, // Allow tests to override/add settings
+    };
+    writeFileSync(
+      join(geminiDir, 'settings.json'),
+      JSON.stringify(settings, null, 2),
+    );
+  }
+
+  createFile(fileName: string, content: string) {
+    const filePath = join(this.testDir!, fileName);
+    writeFileSync(filePath, content);
+    return filePath;
+  }
+
+  mkdir(dir: string) {
+    mkdirSync(join(this.testDir!, dir), { recursive: true });
+  }
+
+  sync() {
+    // ensure file system is done before spawning
+    execSync('sync', { cwd: this.testDir! });
+  }
+
+  /**
+   * The command and args to use to invoke Gemini CLI. Allows us to switch
+   * between using the bundled gemini.js (the default) and using the installed
+   * 'gemini' (used to verify npm bundles).
+   */
+  private _getCommandAndArgs(extraInitialArgs: string[] = []): {
+    command: string;
+    initialArgs: string[];
+  } {
+    const isNpmReleaseTest =
+      env['INTEGRATION_TEST_USE_INSTALLED_GEMINI'] === 'true';
+    const command = isNpmReleaseTest ? 'gemini' : 'node';
+    const initialArgs = isNpmReleaseTest
+      ? extraInitialArgs
+      : [BUNDLE_PATH, ...extraInitialArgs];
+    if (this.fakeResponsesPath) {
+      if (process.env['REGENERATE_MODEL_GOLDENS'] === 'true') {
+        initialArgs.push('--record-responses', this.fakeResponsesPath);
+      } else {
+        initialArgs.push('--fake-responses', this.fakeResponsesPath);
+      }
+    }
+    return { command, initialArgs };
+  }
+
+  run(options: {
+    args?: string | string[];
+    stdin?: string;
+    stdinDoesNotEnd?: boolean;
+    yolo?: boolean;
+  }): Promise<string> {
+    const yolo = options.yolo !== false;
+    const { command, initialArgs } = this._getCommandAndArgs(
+      yolo ? ['--yolo'] : [],
+    );
+    const commandArgs = [...initialArgs];
+    const execOptions: {
+      cwd: string;
+      encoding: 'utf-8';
+      input?: string;
+    } = {
+      cwd: this.testDir!,
+      encoding: 'utf-8',
+    };
+
+    if (options.args) {
+      if (Array.isArray(options.args)) {
+        commandArgs.push(...options.args);
+      } else {
+        commandArgs.push(options.args);
+      }
+    }
+
+    if (options.stdin) {
+      execOptions.input = options.stdin;
+    }
+
+    const child = spawn(command, commandArgs, {
+      cwd: this.testDir!,
+      stdio: 'pipe',
+      env: env,
+    });
+
+    let stdout = '';
+    let stderr = '';
+
+    // Handle stdin if provided
+    if (execOptions.input) {
+      child.stdin!.write(execOptions.input);
+    }
+
+    if (!options.stdinDoesNotEnd) {
+      child.stdin!.end();
+    }
+
+    child.stdout!.on('data', (data: Buffer) => {
+      stdout += data;
+      if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
+        process.stdout.write(data);
+      }
+    });
+
+    child.stderr!.on('data', (data: Buffer) => {
+      stderr += data;
+      if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
+        process.stderr.write(data);
+      }
+    });
+
+    const promise = new Promise<string>((resolve, reject) => {
+      child.on('close', (code: number) => {
+        if (code === 0) {
+          // Store the raw stdout for Podman telemetry parsing
+          this._lastRunStdout = stdout;
+
+          let result = stdout;
+
+          // Check if this is a JSON output test - if so, don't include stderr
+          // as it would corrupt the JSON
+          const isJsonOutput =
+            commandArgs.includes('--output-format') &&
+            commandArgs.includes('json');
+
+          // If we have stderr output and it's not a JSON test, include that also
+          if (stderr && !isJsonOutput) {
+            result += `\n\nStdErr:\n${stderr}`;
+          }
+
+          resolve(result);
+        } else {
+          reject(new Error(`Process exited with code ${code}:\n${stderr}`));
+        }
+      });
+    });
+
+    return promise;
+  }
+
+  runCommand(
+    args: string[],
+    options: { stdin?: string } = {},
+  ): Promise<string> {
+    const { command, initialArgs } = this._getCommandAndArgs();
+    const commandArgs = [...initialArgs, ...args];
+
+    const child = spawn(command, commandArgs, {
+      cwd: this.testDir!,
+      stdio: 'pipe',
+    });
+
+    let stdout = '';
+    let stderr = '';
+
+    if (options.stdin) {
+      child.stdin!.write(options.stdin);
+      child.stdin!.end();
+    }
+
+    child.stdout!.on('data', (data: Buffer) => {
+      stdout += data;
+      if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
+        process.stdout.write(data);
+      }
+    });
+
+    child.stderr!.on('data', (data: Buffer) => {
+      stderr += data;
+      if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
+        process.stderr.write(data);
+      }
+    });
+
+    const promise = new Promise<string>((resolve, reject) => {
+      child.on('close', (code: number) => {
+        if (code === 0) {
+          this._lastRunStdout = stdout;
+          let result = stdout;
+          if (stderr) {
+            result += `\n\nStdErr:\n${stderr}`;
+          }
+          resolve(result);
+        } else {
+          reject(new Error(`Process exited with code ${code}:\n${stderr}`));
+        }
+      });
+    });
+
+    return promise;
+  }
+
+  readFile(fileName: string) {
+    const filePath = join(this.testDir!, fileName);
+    const content = readFileSync(filePath, 'utf-8');
+    if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
+      console.log(`--- FILE: ${filePath} ---`);
+      console.log(content);
+      console.log(`--- END FILE: ${filePath} ---`);
+    }
+    return content;
+  }
+
+  async cleanup() {
+    if (
+      process.env['REGENERATE_MODEL_GOLDENS'] === 'true' &&
+      this.fakeResponsesPath
+    ) {
+      fs.copyFileSync(this.fakeResponsesPath, this.originalFakeResponsesPath!);
+    }
+    // Clean up test directory
+    if (this.testDir && !env['KEEP_OUTPUT']) {
+      try {
+        fs.rmSync(this.testDir, { recursive: true, force: true });
+      } catch (error) {
+        // Ignore cleanup errors
+        if (env['VERBOSE'] === 'true') {
+          console.warn('Cleanup warning:', (error as Error).message);
+        }
+      }
+    }
+  }
+
+  async waitForTelemetryReady() {
+    // Telemetry is always written to the test directory
+    const logFilePath = join(this.testDir!, 'telemetry.log');
+
+    if (!logFilePath) return;
+
+    // Wait for telemetry file to exist and have content
+    await poll(
+      () => {
+        if (!fs.existsSync(logFilePath)) return false;
+        try {
+          const content = readFileSync(logFilePath, 'utf-8');
+          // Check if file has meaningful content (at least one complete JSON object)
+          return content.includes('"scopeMetrics"');
+        } catch {
+          return false;
+        }
+      },
+      2000, // 2 seconds max - reduced since telemetry should flush on exit now
+      100, // check every 100ms
+    );
+  }
+
+  async waitForTelemetryEvent(eventName: string, timeout?: number) {
+    if (!timeout) {
+      timeout = getDefaultTimeout();
+    }
+
+    await this.waitForTelemetryReady();
+
+    return poll(
+      () => {
+        const logs = this._readAndParseTelemetryLog();
+        return logs.some(
+          (logData) =>
+            logData.attributes &&
+            logData.attributes['event.name'] === `gemini_cli.${eventName}`,
+        );
+      },
+      timeout,
+      100,
+    );
+  }
+
+  async waitForToolCall(
+    toolName: string,
+    timeout?: number,
+    matchArgs?: (args: string) => boolean,
+  ) {
+    // Use environment-specific timeout
+    if (!timeout) {
+      timeout = getDefaultTimeout();
+    }
+
+    // Wait for telemetry to be ready before polling for tool calls
+    await this.waitForTelemetryReady();
+
+    return poll(
+      () => {
+        const toolLogs = this.readToolLogs();
+        return toolLogs.some(
+          (log) =>
+            log.toolRequest.name === toolName &&
+            (matchArgs?.call(this, log.toolRequest.args) ?? true),
+        );
+      },
+      timeout,
+      100,
+    );
+  }
+
+  async expectToolCallSuccess(
+    toolNames: string[],
+    timeout?: number,
+    matchArgs?: (args: string) => boolean,
+  ) {
+    // Use environment-specific timeout
+    if (!timeout) {
+      timeout = getDefaultTimeout();
+    }
+
+    // Wait for telemetry to be ready before polling for tool calls
+    await this.waitForTelemetryReady();
+
+    const success = await poll(
+      () => {
+        const toolLogs = this.readToolLogs();
+        return toolNames.some((name) =>
+          toolLogs.some(
+            (log) =>
+              log.toolRequest.name === name &&
+              log.toolRequest.success &&
+              (matchArgs?.call(this, log.toolRequest.args) ?? true),
+          ),
+        );
+      },
+      timeout,
+      100,
+    );
+
+    expect(
+      success,
+      `Expected to find successful toolCalls for ${JSON.stringify(toolNames)}`,
+    ).toBe(true);
+  }
+
+  async waitForAnyToolCall(toolNames: string[], timeout?: number) {
+    // Use environment-specific timeout
+    if (!timeout) {
+      timeout = getDefaultTimeout();
+    }
+
+    // Wait for telemetry to be ready before polling for tool calls
+    await this.waitForTelemetryReady();
+
+    return poll(
+      () => {
+        const toolLogs = this.readToolLogs();
+        return toolNames.some((name) =>
+          toolLogs.some((log) => log.toolRequest.name === name),
+        );
+      },
+      timeout,
+      100,
+    );
+  }
+
+  _parseToolLogsFromStdout(stdout: string) {
+    const logs: {
+      timestamp: number;
+      toolRequest: {
+        name: string;
+        args: string;
+        success: boolean;
+        duration_ms: number;
+      };
+    }[] = [];
+
+    // The console output from Podman is JavaScript object notation, not JSON
+    // Look for tool call events in the output
+    // Updated regex to handle tool names with hyphens and underscores
+    const toolCallPattern =
+      /body:\s*'Tool call:\s*([\w-]+)\..*?Success:\s*(\w+)\..*?Duration:\s*(\d+)ms\.'/g;
+    const matches = [...stdout.matchAll(toolCallPattern)];
+
+    for (const match of matches) {
+      const toolName = match[1];
+      const success = match[2] === 'true';
+      const duration = parseInt(match[3], 10);
+
+      // Try to find function_args nearby
+      const matchIndex = match.index || 0;
+      const contextStart = Math.max(0, matchIndex - 500);
+      const contextEnd = Math.min(stdout.length, matchIndex + 500);
+      const context = stdout.substring(contextStart, contextEnd);
+
+      // Look for function_args in the context
+      let args = '{}';
+      const argsMatch = context.match(/function_args:\s*'([^']+)'/);
+      if (argsMatch) {
+        args = argsMatch[1];
+      }
+
+      // Also try to find function_name to double-check
+      // Updated regex to handle tool names with hyphens and underscores
+      const nameMatch = context.match(/function_name:\s*'([\w-]+)'/);
+      const actualToolName = nameMatch ? nameMatch[1] : toolName;
+
+      logs.push({
+        timestamp: Date.now(),
+        toolRequest: {
+          name: actualToolName,
+          args: args,
+          success: success,
+          duration_ms: duration,
+        },
+      });
+    }
+
+    // If no matches found with the simple pattern, try the JSON parsing approach
+    // in case the format changes
+    if (logs.length === 0) {
+      const lines = stdout.split(os.EOL);
+      let currentObject = '';
+      let inObject = false;
+      let braceDepth = 0;
+
+      for (const line of lines) {
+        if (!inObject && line.trim() === '{') {
+          inObject = true;
+          braceDepth = 1;
+          currentObject = line + '\n';
+        } else if (inObject) {
+          currentObject += line + '\n';
+
+          // Count braces
+          for (const char of line) {
+            if (char === '{') braceDepth++;
+            else if (char === '}') braceDepth--;
+          }
+
+          // If we've closed all braces, try to parse the object
+          if (braceDepth === 0) {
+            inObject = false;
+            try {
+              const obj = JSON.parse(currentObject);
+
+              // Check for tool call in different formats
+              if (
+                obj.body &&
+                obj.body.includes('Tool call:') &&
+                obj.attributes
+              ) {
+                const bodyMatch = obj.body.match(/Tool call: (\w+)\./);
+                if (bodyMatch) {
+                  logs.push({
+                    timestamp: obj.timestamp || Date.now(),
+                    toolRequest: {
+                      name: bodyMatch[1],
+                      args: obj.attributes.function_args || '{}',
+                      success: obj.attributes.success !== false,
+                      duration_ms: obj.attributes.duration_ms || 0,
+                    },
+                  });
+                }
+              } else if (
+                obj.attributes &&
+                obj.attributes['event.name'] === 'gemini_cli.tool_call'
+              ) {
+                logs.push({
+                  timestamp: obj.attributes['event.timestamp'],
+                  toolRequest: {
+                    name: obj.attributes.function_name,
+                    args: obj.attributes.function_args,
+                    success: obj.attributes.success,
+                    duration_ms: obj.attributes.duration_ms,
+                  },
+                });
+              }
+            } catch {
+              // Not valid JSON
+            }
+            currentObject = '';
+          }
+        }
+      }
+    }
+
+    return logs;
+  }
+
+  private _readAndParseTelemetryLog(): ParsedLog[] {
+    // Telemetry is always written to the test directory
+    const logFilePath = join(this.testDir!, 'telemetry.log');
+
+    if (!logFilePath || !fs.existsSync(logFilePath)) {
+      return [];
+    }
+
+    const content = readFileSync(logFilePath, 'utf-8');
+
+    // Split the content into individual JSON objects
+    // They are separated by "}\n{"
+    const jsonObjects = content
+      .split(/}\n{/)
+      .map((obj, index, array) => {
+        // Add back the braces we removed during split
+        if (index > 0) obj = '{' + obj;
+        if (index < array.length - 1) obj = obj + '}';
+        return obj.trim();
+      })
+      .filter((obj) => obj);
+
+    const logs: ParsedLog[] = [];
+
+    for (const jsonStr of jsonObjects) {
+      try {
+        const logData = JSON.parse(jsonStr);
+        logs.push(logData);
+      } catch (e) {
+        // Skip objects that aren't valid JSON
+        if (env['VERBOSE'] === 'true') {
+          console.error('Failed to parse telemetry object:', e);
+        }
+      }
+    }
+
+    return logs;
+  }
+
+  readToolLogs() {
+    const parsedLogs = this._readAndParseTelemetryLog();
+    const logs: {
+      toolRequest: {
+        name: string;
+        args: string;
+        success: boolean;
+        duration_ms: number;
+      };
+    }[] = [];
+
+    for (const logData of parsedLogs) {
+      // Look for tool call logs
+      if (
+        logData.attributes &&
+        logData.attributes['event.name'] === 'gemini_cli.tool_call'
+      ) {
+        const toolName = logData.attributes.function_name!;
+        logs.push({
+          toolRequest: {
+            name: toolName,
+            args: logData.attributes.function_args ?? '{}',
+            success: logData.attributes.success ?? false,
+            duration_ms: logData.attributes.duration_ms ?? 0,
+          },
+        });
+      }
+    }
+
+    return logs;
+  }
+
+  readAllApiRequest(): ParsedLog[] {
+    const logs = this._readAndParseTelemetryLog();
+    const apiRequests = logs.filter(
+      (logData) =>
+        logData.attributes &&
+        logData.attributes['event.name'] === 'gemini_cli.api_request',
+    );
+    return apiRequests;
+  }
+
+  readLastApiRequest(): ParsedLog | null {
+    const logs = this._readAndParseTelemetryLog();
+    const apiRequests = logs.filter(
+      (logData) =>
+        logData.attributes &&
+        logData.attributes['event.name'] === 'gemini_cli.api_request',
+    );
+    return apiRequests.pop() || null;
+  }
+
+  async waitForMetric(metricName: string, timeout?: number) {
+    await this.waitForTelemetryReady();
+
+    const fullName = metricName.startsWith('gemini_cli.')
+      ? metricName
+      : `gemini_cli.${metricName}`;
+
+    return poll(
+      () => {
+        const logs = this._readAndParseTelemetryLog();
+        for (const logData of logs) {
+          if (logData.scopeMetrics) {
+            for (const scopeMetric of logData.scopeMetrics) {
+              for (const metric of scopeMetric.metrics) {
+                if (metric.descriptor.name === fullName) {
+                  return true;
+                }
+              }
+            }
+          }
+        }
+        return false;
+      },
+      timeout ?? getDefaultTimeout(),
+      100,
+    );
+  }
+
+  readMetric(metricName: string): Record<string, unknown> | null {
+    const logs = this._readAndParseTelemetryLog();
+    for (const logData of logs) {
+      if (logData.scopeMetrics) {
+        for (const scopeMetric of logData.scopeMetrics) {
+          for (const metric of scopeMetric.metrics) {
+            if (metric.descriptor.name === `gemini_cli.${metricName}`) {
+              return metric;
+            }
+          }
+        }
+      }
+    }
+    return null;
+  }
+
+  readHookLogs() {
+    const parsedLogs = this._readAndParseTelemetryLog();
+    const logs: {
+      hookCall: {
+        hook_event_name: string;
+        hook_name: string;
+        hook_input: Record<string, unknown>;
+        hook_output: Record<string, unknown>;
+        exit_code: number;
+        stdout: string;
+        stderr: string;
+        duration_ms: number;
+        success: boolean;
+        error: string;
+      };
+    }[] = [];
+
+    for (const logData of parsedLogs) {
+      // Look for tool call logs
+      if (
+        logData.attributes &&
+        logData.attributes['event.name'] === 'gemini_cli.hook_call'
+      ) {
+        logs.push({
+          hookCall: {
+            hook_event_name: logData.attributes.hook_event_name ?? '',
+            hook_name: logData.attributes.hook_name ?? '',
+            hook_input: logData.attributes.hook_input ?? {},
+            hook_output: logData.attributes.hook_output ?? {},
+            exit_code: logData.attributes.exit_code ?? 0,
+            stdout: logData.attributes.stdout ?? '',
+            stderr: logData.attributes.stderr ?? '',
+            duration_ms: logData.attributes.duration_ms ?? 0,
+            success: logData.attributes.success ?? false,
+            error: logData.attributes.error ?? '',
+          },
+        });
+      }
+    }
+
+    return logs;
+  }
+
+  async pollCommand(
+    commandFn: () => Promise<void>,
+    predicateFn: () => boolean,
+    timeout: number = 30000,
+    interval: number = 1000,
+  ) {
+    const startTime = Date.now();
+    while (Date.now() - startTime < timeout) {
+      await commandFn();
+      // Give it a moment to process
+      await new Promise((resolve) => setTimeout(resolve, 500));
+      if (predicateFn()) {
+        return;
+      }
+      await new Promise((resolve) => setTimeout(resolve, interval));
+    }
+    throw new Error(`pollCommand timed out after ${timeout}ms`);
+  }
+}


This file introduces a significant amount of duplicated code from integration-tests/test-helper.ts. The TestRig class and numerous utility functions (poll, sanitizeTestName, validateModelOutput, etc.) are almost identical copies.

This duplication creates a major maintenance challenge. Any bug fixes or enhancements to the test infrastructure will need to be manually synchronized across both files, which is error-prone and inefficient.

To address this, please refactor this file to import the shared components like TestRig and its helper functions directly from integration-tests/test-helper.ts. The new evaluation-specific logic (runEval, conditionalDescribe) can then be built using these imported components. This will promote code reuse, reduce maintenance overhead, and ensure consistency between your integration tests and behavioral evaluations.

Behavioral evals framework.

4f6466c

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

Ensure that we only run the evals tests.

748dae5

jacob314 added the status/need-issue Pull requests that need to have an associated issue. label Jan 7, 2026

gemini-cli bot added priority/p1 Important and should be addressed in the near term. and removed status/need-issue Pull requests that need to have an associated issue. labels Jan 7, 2026

De-dupe test helpers

6088fdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Behavioral evals framework. #16047

Behavioral evals framework. #16047

gundermanc commented Jan 7, 2026

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Behavioral evals framework. #16047

Are you sure you want to change the base?

Behavioral evals framework. #16047

Conversation

gundermanc commented Jan 7, 2026

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Jan 7, 2026 •

edited

Loading