Skip to content

Conversation

@gundermanc
Copy link
Member

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

  • Updated relevant documentation and README (if needed)
  • Added/updated tests (if needed)
  • Noted breaking changes (if any)
  • Validated on required platforms/methods:
    • MacOS
      • npm run
      • npx
      • Docker
      • Podman
      • Seatbelt
    • Windows
      • npm run
      • npx
      • Docker
    • Linux
      • npm run
      • npx
      • Docker

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @gundermanc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust behavioral evaluation framework for the Gemini CLI. It provides a structured and efficient way to define, run, and validate the expected behavior of the CLI, particularly focusing on its interactions with language models and integrated tools. The framework includes helper functions for test creation, a dedicated script for execution, and clear documentation to guide developers in extending the evaluation suite. This enhancement aims to significantly improve the reliability and maintainability of the CLI by ensuring its behavior consistently aligns with design expectations.

Highlights

  • New Behavioral Evals Framework: A new system for defining and running behavioral evaluations for the Gemini CLI has been introduced, allowing for more robust testing of model and tool interactions.
  • Core Test Helper Utilities: A central test-helper.ts file provides essential utilities like runEval and conditionalDescribe to streamline the creation of new evaluation tests.
  • runEval Function: This new function simplifies running individual evaluation cases, enabling developers to easily specify prompts, parameters, and custom assertion logic.
  • conditionalDescribe Function: Evaluations can now be run conditionally using conditionalDescribe, which executes tests only when the RUN_EVALS environment variable is set to true.
  • Comprehensive Documentation: New documentation (evals.md) has been added, detailing how to create and execute behavioral evaluations, and is accessible via the sidebar.
  • New test:evals Script: A dedicated npm run test:evals script has been added to package.json for easily executing the new behavioral evaluations.
  • Example Evaluation: An example evaluation (save_memory.eval.ts) is included, demonstrating how to test specific tool calls and validate model output within the new framework.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Size Change: -2 B (0%)

Total Size: 22.3 MB

ℹ️ View Unchanged
Filename Size Change
./bundle/gemini.js 22.2 MB -2 B (0%)
./bundle/sandbox-macos-permissive-closed.sb 1.03 kB 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-closed.sb 3.29 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B

compressed-size-action

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new framework for behavioral evaluations, including a test runner, a sample evaluation, and documentation. The core change is the addition of evals/test-helper.ts, which provides the infrastructure for these evaluations. However, this new helper file almost entirely duplicates the existing integration-tests/test-helper.ts. This introduces a critical maintainability issue that should be addressed by refactoring to reuse the existing test infrastructure code instead of copying it.

Comment on lines 1 to 945
/**
* @license
* Copyright 2025 Google LLC
* SPDX-License-Identifier: Apache-2.0
*/

import { expect, describe } from 'vitest';
import { execSync, spawn } from 'node:child_process';
import { mkdirSync, writeFileSync, readFileSync } from 'node:fs';
import { join, dirname } from 'node:path';
import { fileURLToPath } from 'node:url';
import { env } from 'node:process';
import { DEFAULT_GEMINI_MODEL } from '../packages/core/src/config/models.js';
import fs from 'node:fs';
import * as os from 'node:os';
import { GEMINI_DIR } from '../packages/core/src/utils/paths.js';

export const conditionalDescribe = process.env.RUN_EVALS
? describe
: describe.skip;

export interface EvalCase {
name: string;
params?: Record<string, any>;
prompt: string;
assert: (rig: TestRig, result: string) => Promise<void>;
log?: boolean;
}

export async function runEval(evalCase: EvalCase) {
const rig = new TestRig();
try {
await rig.setup(evalCase.name, evalCase.params);
const result = await rig.run({ args: evalCase.prompt });
await evalCase.assert(rig, result);
} finally {
if (evalCase.log) {
await logToFile(evalCase.name, JSON.stringify(rig.readToolLogs(), null, 2));
}
await rig.cleanup();
}
}

async function logToFile(name: string, content: string) {
const logDir = 'evals/logs';
await fs.promises.mkdir(logDir, { recursive: true });
const sanitizedName = name.replace(/[^a-z0-9]/gi, '_').toLowerCase();
const logFile = `${logDir}/${sanitizedName}.log`;
await fs.promises.writeFile(logFile, content);
}

const __dirname = dirname(fileURLToPath(import.meta.url));
const BUNDLE_PATH = join(__dirname, '..', 'bundle/gemini.js');


// Get timeout based on environment
function getDefaultTimeout() {
if (env['CI']) return 60000; // 1 minute in CI
return 15000; // 15s locally
}

export async function poll(
predicate: () => boolean,
timeout: number,
interval: number,
): Promise<boolean> {
const startTime = Date.now();
let attempts = 0;
while (Date.now() - startTime < timeout) {
attempts++;
const result = predicate();
if (env['VERBOSE'] === 'true' && attempts % 5 === 0) {
console.log(
`Poll attempt ${attempts}: ${result ? 'success' : 'waiting...'}`,
);
}
if (result) {
return true;
}
await new Promise((resolve) => setTimeout(resolve, interval));
}
if (env['VERBOSE'] === 'true') {
console.log(`Poll timed out after ${attempts} attempts`);
}
return false;
}

function sanitizeTestName(name: string) {
return name
.toLowerCase()
.replace(/[^a-z0-9]/g, '-')
.replace(/-+/g, '-');
}

// Helper to create detailed error messages
export function createToolCallErrorMessage(
expectedTools: string | string[],
foundTools: string[],
result: string,
) {
const expectedStr = Array.isArray(expectedTools)
? expectedTools.join(' or ')
: expectedTools;
return (
`Expected to find ${expectedStr} tool call(s). ` +
`Found: ${foundTools.length > 0 ? foundTools.join(', ') : 'none'}. ` +
`Output preview: ${result ? result.substring(0, 200) + '...' : 'no output'}`
);
}

// Helper to print debug information when tests fail
export function printDebugInfo(
rig: TestRig,
result: string,
context: Record<string, unknown> = {},
) {
console.error('Test failed - Debug info:');
console.error('Result length:', result.length);
console.error('Result (first 500 chars):', result.substring(0, 500));
console.error(
'Result (last 500 chars):',
result.substring(result.length - 500),
);

// Print any additional context provided
Object.entries(context).forEach(([key, value]) => {
console.error(`${key}:`, value);
});

// Check what tools were actually called
const allTools = rig.readToolLogs();
console.error(
'All tool calls found:',
allTools.map((t) => t.toolRequest.name),
);

return allTools;
}

// Helper to validate model output and warn about unexpected content
export function validateModelOutput(
result: string,
expectedContent: string | (string | RegExp)[] | null = null,
testName = '',
) {
// First, check if there's any output at all (this should fail the test if missing)
if (!result || result.trim().length === 0) {
throw new Error('Expected LLM to return some output');
}

// If expectedContent is provided, check for it and warn if missing
if (expectedContent) {
const contents = Array.isArray(expectedContent)
? expectedContent
: [expectedContent];
const missingContent = contents.filter((content) => {
if (typeof content === 'string') {
return !result.toLowerCase().includes(content.toLowerCase());
} else if (content instanceof RegExp) {
return !content.test(result);
}
return false;
});

if (missingContent.length > 0) {
console.warn(
`Warning: LLM did not include expected content in response: ${missingContent.join(
', ',
)}.`,
'This is not ideal but not a test failure.',
);
console.warn(
'The tool was called successfully, which is the main requirement.',
);
console.warn('Expected content:', expectedContent);
console.warn('Actual output:', result);
return false;
} else if (env['VERBOSE'] === 'true') {
console.log(`${testName}: Model output validated successfully.`);
}
return true;
}

return true;
}

interface ParsedLog {
attributes?: {
'event.name'?: string;
function_name?: string;
function_args?: string;
success?: boolean;
duration_ms?: number;
request_text?: string;
hook_event_name?: string;
hook_name?: string;
hook_input?: Record<string, unknown>;
hook_output?: Record<string, unknown>;
exit_code?: number;
stdout?: string;
stderr?: string;
error?: string;
};
scopeMetrics?: {
metrics: {
descriptor: {
name: string;
};
}[];
}[];
}

export class TestRig {
testDir: string | null = null;
testName?: string;
_lastRunStdout?: string;
// Path to the copied fake responses file for this test.
fakeResponsesPath?: string;
// Original fake responses file path for rewriting goldens in record mode.
originalFakeResponsesPath?: string;

setup(
testName: string,
options: {
settings?: Record<string, unknown>;
fakeResponsesPath?: string;
} = {},
) {
this.testName = testName;
const sanitizedName = sanitizeTestName(testName);
const testFileDir =
env['INTEGRATION_TEST_FILE_DIR'] || join(os.tmpdir(), 'gemini-cli-tests');
this.testDir = join(testFileDir, sanitizedName);
mkdirSync(this.testDir, { recursive: true });
if (options.fakeResponsesPath) {
this.fakeResponsesPath = join(this.testDir, 'fake-responses.json');
this.originalFakeResponsesPath = options.fakeResponsesPath;
if (process.env['REGENERATE_MODEL_GOLDENS'] !== 'true') {
fs.copyFileSync(options.fakeResponsesPath, this.fakeResponsesPath);
}
}

// Create a settings file to point the CLI to the local collector
const geminiDir = join(this.testDir, GEMINI_DIR);
mkdirSync(geminiDir, { recursive: true });
// In sandbox mode, use an absolute path for telemetry inside the container
// The container mounts the test directory at the same path as the host
const telemetryPath = join(this.testDir, 'telemetry.log'); // Always use test directory for telemetry

const settings = {
general: {
// Nightly releases sometimes becomes out of sync with local code and
// triggers auto-update, which causes tests to fail.
disableAutoUpdate: true,
previewFeatures: false,
},
telemetry: {
enabled: true,
target: 'local',
otlpEndpoint: '',
outfile: telemetryPath,
},
security: {
auth: {
selectedType: 'gemini-api-key',
},
},
ui: {
useAlternateBuffer: true,
},
model: DEFAULT_GEMINI_MODEL,
// Don't show the IDE connection dialog when running from VsCode
ide: { enabled: false, hasSeenNudge: true },
...options.settings, // Allow tests to override/add settings
};
writeFileSync(
join(geminiDir, 'settings.json'),
JSON.stringify(settings, null, 2),
);
}

createFile(fileName: string, content: string) {
const filePath = join(this.testDir!, fileName);
writeFileSync(filePath, content);
return filePath;
}

mkdir(dir: string) {
mkdirSync(join(this.testDir!, dir), { recursive: true });
}

sync() {
// ensure file system is done before spawning
execSync('sync', { cwd: this.testDir! });
}

/**
* The command and args to use to invoke Gemini CLI. Allows us to switch
* between using the bundled gemini.js (the default) and using the installed
* 'gemini' (used to verify npm bundles).
*/
private _getCommandAndArgs(extraInitialArgs: string[] = []): {
command: string;
initialArgs: string[];
} {
const isNpmReleaseTest =
env['INTEGRATION_TEST_USE_INSTALLED_GEMINI'] === 'true';
const command = isNpmReleaseTest ? 'gemini' : 'node';
const initialArgs = isNpmReleaseTest
? extraInitialArgs
: [BUNDLE_PATH, ...extraInitialArgs];
if (this.fakeResponsesPath) {
if (process.env['REGENERATE_MODEL_GOLDENS'] === 'true') {
initialArgs.push('--record-responses', this.fakeResponsesPath);
} else {
initialArgs.push('--fake-responses', this.fakeResponsesPath);
}
}
return { command, initialArgs };
}

run(options: {
args?: string | string[];
stdin?: string;
stdinDoesNotEnd?: boolean;
yolo?: boolean;
}): Promise<string> {
const yolo = options.yolo !== false;
const { command, initialArgs } = this._getCommandAndArgs(
yolo ? ['--yolo'] : [],
);
const commandArgs = [...initialArgs];
const execOptions: {
cwd: string;
encoding: 'utf-8';
input?: string;
} = {
cwd: this.testDir!,
encoding: 'utf-8',
};

if (options.args) {
if (Array.isArray(options.args)) {
commandArgs.push(...options.args);
} else {
commandArgs.push(options.args);
}
}

if (options.stdin) {
execOptions.input = options.stdin;
}

const child = spawn(command, commandArgs, {
cwd: this.testDir!,
stdio: 'pipe',
env: env,
});

let stdout = '';
let stderr = '';

// Handle stdin if provided
if (execOptions.input) {
child.stdin!.write(execOptions.input);
}

if (!options.stdinDoesNotEnd) {
child.stdin!.end();
}

child.stdout!.on('data', (data: Buffer) => {
stdout += data;
if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
process.stdout.write(data);
}
});

child.stderr!.on('data', (data: Buffer) => {
stderr += data;
if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
process.stderr.write(data);
}
});

const promise = new Promise<string>((resolve, reject) => {
child.on('close', (code: number) => {
if (code === 0) {
// Store the raw stdout for Podman telemetry parsing
this._lastRunStdout = stdout;

let result = stdout;

// Check if this is a JSON output test - if so, don't include stderr
// as it would corrupt the JSON
const isJsonOutput =
commandArgs.includes('--output-format') &&
commandArgs.includes('json');

// If we have stderr output and it's not a JSON test, include that also
if (stderr && !isJsonOutput) {
result += `\n\nStdErr:\n${stderr}`;
}

resolve(result);
} else {
reject(new Error(`Process exited with code ${code}:\n${stderr}`));
}
});
});

return promise;
}

runCommand(
args: string[],
options: { stdin?: string } = {},
): Promise<string> {
const { command, initialArgs } = this._getCommandAndArgs();
const commandArgs = [...initialArgs, ...args];

const child = spawn(command, commandArgs, {
cwd: this.testDir!,
stdio: 'pipe',
});

let stdout = '';
let stderr = '';

if (options.stdin) {
child.stdin!.write(options.stdin);
child.stdin!.end();
}

child.stdout!.on('data', (data: Buffer) => {
stdout += data;
if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
process.stdout.write(data);
}
});

child.stderr!.on('data', (data: Buffer) => {
stderr += data;
if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
process.stderr.write(data);
}
});

const promise = new Promise<string>((resolve, reject) => {
child.on('close', (code: number) => {
if (code === 0) {
this._lastRunStdout = stdout;
let result = stdout;
if (stderr) {
result += `\n\nStdErr:\n${stderr}`;
}
resolve(result);
} else {
reject(new Error(`Process exited with code ${code}:\n${stderr}`));
}
});
});

return promise;
}

readFile(fileName: string) {
const filePath = join(this.testDir!, fileName);
const content = readFileSync(filePath, 'utf-8');
if (env['KEEP_OUTPUT'] === 'true' || env['VERBOSE'] === 'true') {
console.log(`--- FILE: ${filePath} ---`);
console.log(content);
console.log(`--- END FILE: ${filePath} ---`);
}
return content;
}

async cleanup() {
if (
process.env['REGENERATE_MODEL_GOLDENS'] === 'true' &&
this.fakeResponsesPath
) {
fs.copyFileSync(this.fakeResponsesPath, this.originalFakeResponsesPath!);
}
// Clean up test directory
if (this.testDir && !env['KEEP_OUTPUT']) {
try {
fs.rmSync(this.testDir, { recursive: true, force: true });
} catch (error) {
// Ignore cleanup errors
if (env['VERBOSE'] === 'true') {
console.warn('Cleanup warning:', (error as Error).message);
}
}
}
}

async waitForTelemetryReady() {
// Telemetry is always written to the test directory
const logFilePath = join(this.testDir!, 'telemetry.log');

if (!logFilePath) return;

// Wait for telemetry file to exist and have content
await poll(
() => {
if (!fs.existsSync(logFilePath)) return false;
try {
const content = readFileSync(logFilePath, 'utf-8');
// Check if file has meaningful content (at least one complete JSON object)
return content.includes('"scopeMetrics"');
} catch {
return false;
}
},
2000, // 2 seconds max - reduced since telemetry should flush on exit now
100, // check every 100ms
);
}

async waitForTelemetryEvent(eventName: string, timeout?: number) {
if (!timeout) {
timeout = getDefaultTimeout();
}

await this.waitForTelemetryReady();

return poll(
() => {
const logs = this._readAndParseTelemetryLog();
return logs.some(
(logData) =>
logData.attributes &&
logData.attributes['event.name'] === `gemini_cli.${eventName}`,
);
},
timeout,
100,
);
}

async waitForToolCall(
toolName: string,
timeout?: number,
matchArgs?: (args: string) => boolean,
) {
// Use environment-specific timeout
if (!timeout) {
timeout = getDefaultTimeout();
}

// Wait for telemetry to be ready before polling for tool calls
await this.waitForTelemetryReady();

return poll(
() => {
const toolLogs = this.readToolLogs();
return toolLogs.some(
(log) =>
log.toolRequest.name === toolName &&
(matchArgs?.call(this, log.toolRequest.args) ?? true),
);
},
timeout,
100,
);
}

async expectToolCallSuccess(
toolNames: string[],
timeout?: number,
matchArgs?: (args: string) => boolean,
) {
// Use environment-specific timeout
if (!timeout) {
timeout = getDefaultTimeout();
}

// Wait for telemetry to be ready before polling for tool calls
await this.waitForTelemetryReady();

const success = await poll(
() => {
const toolLogs = this.readToolLogs();
return toolNames.some((name) =>
toolLogs.some(
(log) =>
log.toolRequest.name === name &&
log.toolRequest.success &&
(matchArgs?.call(this, log.toolRequest.args) ?? true),
),
);
},
timeout,
100,
);

expect(
success,
`Expected to find successful toolCalls for ${JSON.stringify(toolNames)}`,
).toBe(true);
}

async waitForAnyToolCall(toolNames: string[], timeout?: number) {
// Use environment-specific timeout
if (!timeout) {
timeout = getDefaultTimeout();
}

// Wait for telemetry to be ready before polling for tool calls
await this.waitForTelemetryReady();

return poll(
() => {
const toolLogs = this.readToolLogs();
return toolNames.some((name) =>
toolLogs.some((log) => log.toolRequest.name === name),
);
},
timeout,
100,
);
}

_parseToolLogsFromStdout(stdout: string) {
const logs: {
timestamp: number;
toolRequest: {
name: string;
args: string;
success: boolean;
duration_ms: number;
};
}[] = [];

// The console output from Podman is JavaScript object notation, not JSON
// Look for tool call events in the output
// Updated regex to handle tool names with hyphens and underscores
const toolCallPattern =
/body:\s*'Tool call:\s*([\w-]+)\..*?Success:\s*(\w+)\..*?Duration:\s*(\d+)ms\.'/g;
const matches = [...stdout.matchAll(toolCallPattern)];

for (const match of matches) {
const toolName = match[1];
const success = match[2] === 'true';
const duration = parseInt(match[3], 10);

// Try to find function_args nearby
const matchIndex = match.index || 0;
const contextStart = Math.max(0, matchIndex - 500);
const contextEnd = Math.min(stdout.length, matchIndex + 500);
const context = stdout.substring(contextStart, contextEnd);

// Look for function_args in the context
let args = '{}';
const argsMatch = context.match(/function_args:\s*'([^']+)'/);
if (argsMatch) {
args = argsMatch[1];
}

// Also try to find function_name to double-check
// Updated regex to handle tool names with hyphens and underscores
const nameMatch = context.match(/function_name:\s*'([\w-]+)'/);
const actualToolName = nameMatch ? nameMatch[1] : toolName;

logs.push({
timestamp: Date.now(),
toolRequest: {
name: actualToolName,
args: args,
success: success,
duration_ms: duration,
},
});
}

// If no matches found with the simple pattern, try the JSON parsing approach
// in case the format changes
if (logs.length === 0) {
const lines = stdout.split(os.EOL);
let currentObject = '';
let inObject = false;
let braceDepth = 0;

for (const line of lines) {
if (!inObject && line.trim() === '{') {
inObject = true;
braceDepth = 1;
currentObject = line + '\n';
} else if (inObject) {
currentObject += line + '\n';

// Count braces
for (const char of line) {
if (char === '{') braceDepth++;
else if (char === '}') braceDepth--;
}

// If we've closed all braces, try to parse the object
if (braceDepth === 0) {
inObject = false;
try {
const obj = JSON.parse(currentObject);

// Check for tool call in different formats
if (
obj.body &&
obj.body.includes('Tool call:') &&
obj.attributes
) {
const bodyMatch = obj.body.match(/Tool call: (\w+)\./);
if (bodyMatch) {
logs.push({
timestamp: obj.timestamp || Date.now(),
toolRequest: {
name: bodyMatch[1],
args: obj.attributes.function_args || '{}',
success: obj.attributes.success !== false,
duration_ms: obj.attributes.duration_ms || 0,
},
});
}
} else if (
obj.attributes &&
obj.attributes['event.name'] === 'gemini_cli.tool_call'
) {
logs.push({
timestamp: obj.attributes['event.timestamp'],
toolRequest: {
name: obj.attributes.function_name,
args: obj.attributes.function_args,
success: obj.attributes.success,
duration_ms: obj.attributes.duration_ms,
},
});
}
} catch {
// Not valid JSON
}
currentObject = '';
}
}
}
}

return logs;
}

private _readAndParseTelemetryLog(): ParsedLog[] {
// Telemetry is always written to the test directory
const logFilePath = join(this.testDir!, 'telemetry.log');

if (!logFilePath || !fs.existsSync(logFilePath)) {
return [];
}

const content = readFileSync(logFilePath, 'utf-8');

// Split the content into individual JSON objects
// They are separated by "}\n{"
const jsonObjects = content
.split(/}\n{/)
.map((obj, index, array) => {
// Add back the braces we removed during split
if (index > 0) obj = '{' + obj;
if (index < array.length - 1) obj = obj + '}';
return obj.trim();
})
.filter((obj) => obj);

const logs: ParsedLog[] = [];

for (const jsonStr of jsonObjects) {
try {
const logData = JSON.parse(jsonStr);
logs.push(logData);
} catch (e) {
// Skip objects that aren't valid JSON
if (env['VERBOSE'] === 'true') {
console.error('Failed to parse telemetry object:', e);
}
}
}

return logs;
}

readToolLogs() {
const parsedLogs = this._readAndParseTelemetryLog();
const logs: {
toolRequest: {
name: string;
args: string;
success: boolean;
duration_ms: number;
};
}[] = [];

for (const logData of parsedLogs) {
// Look for tool call logs
if (
logData.attributes &&
logData.attributes['event.name'] === 'gemini_cli.tool_call'
) {
const toolName = logData.attributes.function_name!;
logs.push({
toolRequest: {
name: toolName,
args: logData.attributes.function_args ?? '{}',
success: logData.attributes.success ?? false,
duration_ms: logData.attributes.duration_ms ?? 0,
},
});
}
}

return logs;
}

readAllApiRequest(): ParsedLog[] {
const logs = this._readAndParseTelemetryLog();
const apiRequests = logs.filter(
(logData) =>
logData.attributes &&
logData.attributes['event.name'] === 'gemini_cli.api_request',
);
return apiRequests;
}

readLastApiRequest(): ParsedLog | null {
const logs = this._readAndParseTelemetryLog();
const apiRequests = logs.filter(
(logData) =>
logData.attributes &&
logData.attributes['event.name'] === 'gemini_cli.api_request',
);
return apiRequests.pop() || null;
}

async waitForMetric(metricName: string, timeout?: number) {
await this.waitForTelemetryReady();

const fullName = metricName.startsWith('gemini_cli.')
? metricName
: `gemini_cli.${metricName}`;

return poll(
() => {
const logs = this._readAndParseTelemetryLog();
for (const logData of logs) {
if (logData.scopeMetrics) {
for (const scopeMetric of logData.scopeMetrics) {
for (const metric of scopeMetric.metrics) {
if (metric.descriptor.name === fullName) {
return true;
}
}
}
}
}
return false;
},
timeout ?? getDefaultTimeout(),
100,
);
}

readMetric(metricName: string): Record<string, unknown> | null {
const logs = this._readAndParseTelemetryLog();
for (const logData of logs) {
if (logData.scopeMetrics) {
for (const scopeMetric of logData.scopeMetrics) {
for (const metric of scopeMetric.metrics) {
if (metric.descriptor.name === `gemini_cli.${metricName}`) {
return metric;
}
}
}
}
}
return null;
}

readHookLogs() {
const parsedLogs = this._readAndParseTelemetryLog();
const logs: {
hookCall: {
hook_event_name: string;
hook_name: string;
hook_input: Record<string, unknown>;
hook_output: Record<string, unknown>;
exit_code: number;
stdout: string;
stderr: string;
duration_ms: number;
success: boolean;
error: string;
};
}[] = [];

for (const logData of parsedLogs) {
// Look for tool call logs
if (
logData.attributes &&
logData.attributes['event.name'] === 'gemini_cli.hook_call'
) {
logs.push({
hookCall: {
hook_event_name: logData.attributes.hook_event_name ?? '',
hook_name: logData.attributes.hook_name ?? '',
hook_input: logData.attributes.hook_input ?? {},
hook_output: logData.attributes.hook_output ?? {},
exit_code: logData.attributes.exit_code ?? 0,
stdout: logData.attributes.stdout ?? '',
stderr: logData.attributes.stderr ?? '',
duration_ms: logData.attributes.duration_ms ?? 0,
success: logData.attributes.success ?? false,
error: logData.attributes.error ?? '',
},
});
}
}

return logs;
}

async pollCommand(
commandFn: () => Promise<void>,
predicateFn: () => boolean,
timeout: number = 30000,
interval: number = 1000,
) {
const startTime = Date.now();
while (Date.now() - startTime < timeout) {
await commandFn();
// Give it a moment to process
await new Promise((resolve) => setTimeout(resolve, 500));
if (predicateFn()) {
return;
}
await new Promise((resolve) => setTimeout(resolve, interval));
}
throw new Error(`pollCommand timed out after ${timeout}ms`);
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This file introduces a significant amount of duplicated code from integration-tests/test-helper.ts. The TestRig class and numerous utility functions (poll, sanitizeTestName, validateModelOutput, etc.) are almost identical copies.

This duplication creates a major maintenance challenge. Any bug fixes or enhancements to the test infrastructure will need to be manually synchronized across both files, which is error-prone and inefficient.

To address this, please refactor this file to import the shared components like TestRig and its helper functions directly from integration-tests/test-helper.ts. The new evaluation-specific logic (runEval, conditionalDescribe) can then be built using these imported components. This will promote code reuse, reduce maintenance overhead, and ensure consistency between your integration tests and behavioral evaluations.

@jacob314 jacob314 added the status/need-issue Pull requests that need to have an associated issue. label Jan 7, 2026
@gemini-cli gemini-cli bot added priority/p1 Important and should be addressed in the near term. and removed status/need-issue Pull requests that need to have an associated issue. labels Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority/p1 Important and should be addressed in the near term.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants