Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions agents-docs/content/typescript-sdk/evaluations.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: Evaluations
sidebarTitle: Evaluations
description: Manage evaluators programmatically with the TypeScript SDK
icon: LuFlaskConical
keywords: evaluations, evaluators, batch evaluation
---

The TypeScript SDK provides an **EvaluationClient** that talks to the Evaluations API so you can manage evaluators, evaluation suite configs, trigger batch evaluations, and read results—all from code.

For full endpoint details and request/response shapes, see the [Evaluations API reference](/api-reference/evaluations).

## Setup: create a client

Create an evaluation client with your tenant ID, project ID, API base URL, and optional API key.

```typescript
import { EvaluationClient } from "@inkeep/agents-sdk";

const client = new EvaluationClient({
tenantId: process.env.INKEEP_TENANT_ID!,
projectId: process.env.INKEEP_PROJECT_ID!,
apiUrl: "https://api.inkeep.com",
apiKey: process.env.INKEEP_API_KEY,
});
```

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `tenantId` | string | Yes | Your tenant (organization) ID |
| `projectId` | string | Yes | Your project ID |
| `apiUrl` | string | Yes | API base URL (e.g. `https://api.inkeep.com` or your self-hosted URL) |
| `apiKey` | string | No | Bearer token for authenticated requests. Omit for unauthenticated or custom auth. |

Use `client` in the examples below (e.g. `client.createEvaluator(...)`).

## Evaluators

Evaluators define how to score agent outputs (e.g. with a prompt and model, optional pass criteria).

### Creating an evaluator

Pass an object with `name`, `description`, `prompt`, `schema` (JSON schema for the evaluator output), and `model` (model identifier and optional provider options). Optionally include `passCriteria` to define pass/fail conditions on the schema fields.

```typescript
const evaluator = await client.createEvaluator({
name: "Helpfulness",
description: "Scores how helpful the agent response is (0-1)",
prompt: `You are an expert evaluator. Score how helpful the assistant's response is to the user on a scale of 0.0 to 1.0.
Consider clarity, relevance, and completeness. Respond with a JSON object with a "score" field.`,
schema: {
type: "object",
properties: {
score: { type: "number", description: "Helpfulness score from 0 to 1" },
},
required: ["score"],
},
model: {
model: "gpt-4o-mini",
providerOptions: {},
},
passCriteria: {
operator: "and",
conditions: [{ field: "score", operator: ">=", value: 0.8 }],
},
});
```

## Evaluation suite configs

Suite configs group evaluators and optional agent filters and sample rates. They are used by **continuous tests** (evaluation run configs) to decide which conversations to evaluate automatically.

### Creating an evaluation suite config

Pass **evaluatorIds** (required, at least one) and optionally **sampleRate** (0–1) and **filters** (e.g. `agentIds` to restrict which agents’ conversations are evaluated). The suite can then be attached to a continuous test (evaluation run config).

```typescript
const suiteConfig = await client.createEvaluationSuiteConfig({
evaluatorIds: ["eval-helpfulness", "eval-accuracy"],
sampleRate: 0.1,
filters: {
agentIds: ["agent-support-bot"],
},
});
```

| Option | Type | Required | Description |
|--------|------|----------|-------------|
| `evaluatorIds` | string[] | Yes | At least one evaluator ID to run in this suite |
| `sampleRate` | number | No | Fraction of matching conversations to evaluate (0–1). Omit to evaluate all. |
| `filters` | object | No | Restrict which conversations are in scope, e.g. `{ agentIds: ["agent-id"] }` |

## Batch evaluation

Trigger a one-off batch evaluation over conversations, optionally filtered by conversation IDs or date range:

```typescript
const result = await client.triggerBatchEvaluation({
evaluatorIds: ["eval-1", "eval-2"],
name: "Weekly quality check",
dateRange: {
startDate: "2025-02-01",
endDate: "2025-02-07",
},
});
// result: { message, evaluationJobConfigId, evaluatorIds }
```

| Option | Type | Required | Description |
|--------|------|----------|-------------|
| `evaluatorIds` | string[] | Yes | IDs of evaluators to run |
| `name` | string | No | Name for the job (defaults to a timestamped name) |
| `conversationIds` | string[] | No | Limit to these conversations |
| `dateRange` | Object with `startDate` and `endDate` (YYYY-MM-DD) | No | Limit to conversations in this date range |

To list results by job or run config, use the [Evaluations API](/api-reference/evaluations) (e.g. get evaluation results by job config ID or by run config ID).

## Related

- [Evaluations API reference](/api-reference/evaluations) — Full list of evaluation endpoints and schemas
- [Visual Builder: Evaluations](/visual-builder/evaluations) — Configure evaluators, batch evaluations, and continuous tests in the UI
1 change: 1 addition & 0 deletions agents-docs/content/typescript-sdk/meta.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"context-fetchers",
"structured-outputs",
"data-operations",
"evaluations",
"(observability)",
"external-agents",
"workspace-configuration",
Expand Down
91 changes: 91 additions & 0 deletions agents-docs/content/visual-builder/evaluations.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Evaluations
sidebarTitle: Evaluations
description: Configure evaluators, batch evaluations, and continuous tests in the Visual Builder
icon: LuFlaskConical
keywords: evaluations, evaluators, batch evaluations, continuous tests, datasets
---

The Visual Builder lets you define, manage, and run evaluations. You define evaluators (how to score agents), then run them in two ways: **batch evaluations** (one-time jobs over selected conversations) and **continuous tests** (automatic evaluation on a sample of live conversations).

## Where to find evaluations

1. Open your project in the Visual Builder.
2. In the project sidebar, go to **Evaluations** for evaluators, batch jobs, and continuous tests.

<Note>
You need **Edit** permission on the project to create or change evaluators, batch evaluations, and continuous tests. See [Access control](/visual-builder/access-control) for roles and permissions.
</Note>

## Evaluators

Evaluators define how agent responses are scored. Each evaluator has a **prompt**, a **schema**, and optionally a **pass criteria** to produce a score or structured output.

### Creating an evaluator

1. Go to **Evaluations** and open the **Evaluators** tab.
2. Click **New evaluator**.
3. Fill in:
- **Name** and optional **Description**
- **Prompt** — instructions for the model (e.g. what to score and how)
- **Schema** — JSON schema for structured output (e.g. numeric score, categories)
- **Model** — the model used to run the evaluator
- **Pass criteria** (optional) — conditions on numeric schema fields that define pass/fail (e.g. `score >= 0.8`)

4. Save. The evaluator is then available for batch evaluations and continuous tests.

### Example Evaluator

<Image
src="/images/evaluator-example.png"
alt="Evaluator form in the Visual Builder showing name, prompt, schema, model, and pass criteria fields"
/>

### Editing or deleting

From the Evaluators list, open an evaluator to view or edit it, or use the delete action.

## Batch evaluations

Batch evaluations run selected evaluators over a set of conversations once. You choose which evaluators to run and over what date range.

### Creating a batch evaluation

1. Go to **Evaluations** and open the **Batch Evaluations** tab.
2. Click **New batch evaluation**.
3. Select one or more **Evaluators**.
4. Narrow the scope by **Date range** — only conversations within that range
5. Start the job. A new batch evaluation job is created and runs asynchronously.

### Viewing results

From the Batch Evaluations list, open a job to see its **results**: per-conversation evaluation outputs, pass/fail if pass criteria are set, and status. You can filter and inspect individual results.

## Continuous tests

Continuous tests evaluate a sample of **live** conversations automatically. You specify which evaluators to run, which agents (optional), and a **sample rate** (e.g. 10% of conversations).

### Creating a continuous test

1. Go to **Evaluations** and open the **Continuous Tests** tab.
2. Click **New continuous test**.
3. Set **Name** and optional **Description**.
4. The config is **Active** so it runs on new conversations.
5. Choose **Evaluators** to run.
6. Optionally restrict by **Agents** (only evaluate conversations for those agents).
7. Set **Sample rate** (0–1) to evaluate a fraction of matching conversations.
8. Save. Once active, matching conversations will be evaluated according to the sample rate.

### Viewing results

From the Continuous Tests list, open a config to see **Results** for that run config: all evaluation results triggered by that continuous test, with filters.

## Summary

| Area | Purpose |
|------|---------|
| **Evaluators** | Define how to score agent outputs (prompt, model, schema, pass criteria). |
| **Batch evaluations** | Run evaluators once over a scoped set of conversations (date range). |
| **Continuous tests** | Automatically run evaluators on a sample of live conversations. |

For programmatic access to the same concepts, see [TypeScript SDK: Evaluations](/typescript-sdk/evaluations) and the [Evaluations API reference](/api-reference/evaluations).
1 change: 1 addition & 0 deletions agents-docs/content/visual-builder/meta.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
"context-fetchers",
"access-control",
"skills",
"evaluations",
"..."
]
}
Binary file added agents-docs/public/images/evaluator-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.