-
Notifications
You must be signed in to change notification settings - Fork 101
evaluation docs #2123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+214
−0
Merged
evaluation docs #2123
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
ff25add
eval docs
shagun-singh-inkeep 65f3b1c
Update agents-docs/content/typescript-sdk/evaluations.mdx
shagun-singh-inkeep 4d88e8a
Update agents-docs/content/visual-builder/evaluations.mdx
shagun-singh-inkeep 5d466d2
Update agents-docs/content/visual-builder/evaluations.mdx
shagun-singh-inkeep 1c55dca
claude
shagun-singh-inkeep f398b46
claude
shagun-singh-inkeep File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| --- | ||
| title: Evaluations | ||
| sidebarTitle: Evaluations | ||
| description: Manage evaluators programmatically with the TypeScript SDK | ||
| icon: LuFlaskConical | ||
| keywords: evaluations, evaluators, batch evaluation | ||
| --- | ||
|
|
||
| The TypeScript SDK provides an **EvaluationClient** that talks to the Evaluations API so you can manage evaluators, evaluation suite configs, trigger batch evaluations, and read results—all from code. | ||
|
|
||
| For full endpoint details and request/response shapes, see the [Evaluations API reference](/api-reference/evaluations). | ||
|
|
||
| ## Setup: create a client | ||
|
|
||
| Create an evaluation client with your tenant ID, project ID, API base URL, and optional API key. | ||
|
|
||
| ```typescript | ||
| import { EvaluationClient } from "@inkeep/agents-sdk"; | ||
|
|
||
| const client = new EvaluationClient({ | ||
| tenantId: process.env.INKEEP_TENANT_ID!, | ||
| projectId: process.env.INKEEP_PROJECT_ID!, | ||
| apiUrl: "https://api.inkeep.com", | ||
| apiKey: process.env.INKEEP_API_KEY, | ||
| }); | ||
| ``` | ||
|
|
||
| | Parameter | Type | Required | Description | | ||
| |-----------|------|----------|-------------| | ||
| | `tenantId` | string | Yes | Your tenant (organization) ID | | ||
| | `projectId` | string | Yes | Your project ID | | ||
| | `apiUrl` | string | Yes | API base URL (e.g. `https://api.inkeep.com` or your self-hosted URL) | | ||
| | `apiKey` | string | No | Bearer token for authenticated requests. Omit for unauthenticated or custom auth. | | ||
|
|
||
| Use `client` in the examples below (e.g. `client.createEvaluator(...)`). | ||
|
|
||
| ## Evaluators | ||
|
|
||
| Evaluators define how to score agent outputs (e.g. with a prompt and model, optional pass criteria). | ||
|
|
||
| ### Creating an evaluator | ||
|
|
||
| Pass an object with `name`, `description`, `prompt`, `schema` (JSON schema for the evaluator output), and `model` (model identifier and optional provider options). Optionally include `passCriteria` to define pass/fail conditions on the schema fields. | ||
|
|
||
| ```typescript | ||
| const evaluator = await client.createEvaluator({ | ||
| name: "Helpfulness", | ||
| description: "Scores how helpful the agent response is (0-1)", | ||
| prompt: `You are an expert evaluator. Score how helpful the assistant's response is to the user on a scale of 0.0 to 1.0. | ||
| Consider clarity, relevance, and completeness. Respond with a JSON object with a "score" field.`, | ||
| schema: { | ||
| type: "object", | ||
| properties: { | ||
| score: { type: "number", description: "Helpfulness score from 0 to 1" }, | ||
| }, | ||
| required: ["score"], | ||
| }, | ||
| model: { | ||
| model: "gpt-4o-mini", | ||
| providerOptions: {}, | ||
| }, | ||
| passCriteria: { | ||
| operator: "and", | ||
| conditions: [{ field: "score", operator: ">=", value: 0.8 }], | ||
| }, | ||
| }); | ||
| ``` | ||
|
|
||
| ## Evaluation suite configs | ||
|
|
||
| Suite configs group evaluators and optional agent filters and sample rates. They are used by **continuous tests** (evaluation run configs) to decide which conversations to evaluate automatically. | ||
|
|
||
| ### Creating an evaluation suite config | ||
|
|
||
| Pass **evaluatorIds** (required, at least one) and optionally **sampleRate** (0–1) and **filters** (e.g. `agentIds` to restrict which agents’ conversations are evaluated). The suite can then be attached to a continuous test (evaluation run config). | ||
|
|
||
| ```typescript | ||
| const suiteConfig = await client.createEvaluationSuiteConfig({ | ||
| evaluatorIds: ["eval-helpfulness", "eval-accuracy"], | ||
| sampleRate: 0.1, | ||
| filters: { | ||
| agentIds: ["agent-support-bot"], | ||
| }, | ||
| }); | ||
| ``` | ||
|
|
||
| | Option | Type | Required | Description | | ||
| |--------|------|----------|-------------| | ||
| | `evaluatorIds` | string[] | Yes | At least one evaluator ID to run in this suite | | ||
| | `sampleRate` | number | No | Fraction of matching conversations to evaluate (0–1). Omit to evaluate all. | | ||
| | `filters` | object | No | Restrict which conversations are in scope, e.g. `{ agentIds: ["agent-id"] }` | | ||
|
|
||
| ## Batch evaluation | ||
|
|
||
| Trigger a one-off batch evaluation over conversations, optionally filtered by conversation IDs or date range: | ||
|
|
||
| ```typescript | ||
| const result = await client.triggerBatchEvaluation({ | ||
| evaluatorIds: ["eval-1", "eval-2"], | ||
| name: "Weekly quality check", | ||
| dateRange: { | ||
| startDate: "2025-02-01", | ||
| endDate: "2025-02-07", | ||
| }, | ||
| }); | ||
| // result: { message, evaluationJobConfigId, evaluatorIds } | ||
| ``` | ||
|
|
||
| | Option | Type | Required | Description | | ||
| |--------|------|----------|-------------| | ||
| | `evaluatorIds` | string[] | Yes | IDs of evaluators to run | | ||
| | `name` | string | No | Name for the job (defaults to a timestamped name) | | ||
| | `conversationIds` | string[] | No | Limit to these conversations | | ||
| | `dateRange` | Object with `startDate` and `endDate` (YYYY-MM-DD) | No | Limit to conversations in this date range | | ||
|
|
||
| To list results by job or run config, use the [Evaluations API](/api-reference/evaluations) (e.g. get evaluation results by job config ID or by run config ID). | ||
|
|
||
| ## Related | ||
|
|
||
| - [Evaluations API reference](/api-reference/evaluations) — Full list of evaluation endpoints and schemas | ||
| - [Visual Builder: Evaluations](/visual-builder/evaluations) — Configure evaluators, batch evaluations, and continuous tests in the UI | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| --- | ||
| title: Evaluations | ||
| sidebarTitle: Evaluations | ||
| description: Configure evaluators, batch evaluations, and continuous tests in the Visual Builder | ||
| icon: LuFlaskConical | ||
| keywords: evaluations, evaluators, batch evaluations, continuous tests, datasets | ||
| --- | ||
|
|
||
| The Visual Builder lets you define, manage, and run evaluations. You define evaluators (how to score agents), then run them in two ways: **batch evaluations** (one-time jobs over selected conversations) and **continuous tests** (automatic evaluation on a sample of live conversations). | ||
|
|
||
| ## Where to find evaluations | ||
|
|
||
| 1. Open your project in the Visual Builder. | ||
| 2. In the project sidebar, go to **Evaluations** for evaluators, batch jobs, and continuous tests. | ||
|
|
||
| <Note> | ||
| You need **Edit** permission on the project to create or change evaluators, batch evaluations, and continuous tests. See [Access control](/visual-builder/access-control) for roles and permissions. | ||
| </Note> | ||
|
|
||
| ## Evaluators | ||
|
|
||
| Evaluators define how agent responses are scored. Each evaluator has a **prompt**, a **schema**, and optionally a **pass criteria** to produce a score or structured output. | ||
|
|
||
| ### Creating an evaluator | ||
|
|
||
| 1. Go to **Evaluations** and open the **Evaluators** tab. | ||
| 2. Click **New evaluator**. | ||
| 3. Fill in: | ||
| - **Name** and optional **Description** | ||
| - **Prompt** — instructions for the model (e.g. what to score and how) | ||
| - **Schema** — JSON schema for structured output (e.g. numeric score, categories) | ||
| - **Model** — the model used to run the evaluator | ||
| - **Pass criteria** (optional) — conditions on numeric schema fields that define pass/fail (e.g. `score >= 0.8`) | ||
|
|
||
| 4. Save. The evaluator is then available for batch evaluations and continuous tests. | ||
|
|
||
| ### Example Evaluator | ||
|
|
||
| <Image | ||
| src="/images/evaluator-example.png" | ||
| alt="Evaluator form in the Visual Builder showing name, prompt, schema, model, and pass criteria fields" | ||
| /> | ||
|
|
||
| ### Editing or deleting | ||
|
|
||
| From the Evaluators list, open an evaluator to view or edit it, or use the delete action. | ||
|
|
||
| ## Batch evaluations | ||
|
|
||
| Batch evaluations run selected evaluators over a set of conversations once. You choose which evaluators to run and over what date range. | ||
|
|
||
| ### Creating a batch evaluation | ||
|
|
||
| 1. Go to **Evaluations** and open the **Batch Evaluations** tab. | ||
| 2. Click **New batch evaluation**. | ||
| 3. Select one or more **Evaluators**. | ||
| 4. Narrow the scope by **Date range** — only conversations within that range | ||
| 5. Start the job. A new batch evaluation job is created and runs asynchronously. | ||
|
|
||
| ### Viewing results | ||
|
|
||
| From the Batch Evaluations list, open a job to see its **results**: per-conversation evaluation outputs, pass/fail if pass criteria are set, and status. You can filter and inspect individual results. | ||
|
|
||
| ## Continuous tests | ||
|
|
||
| Continuous tests evaluate a sample of **live** conversations automatically. You specify which evaluators to run, which agents (optional), and a **sample rate** (e.g. 10% of conversations). | ||
|
|
||
| ### Creating a continuous test | ||
|
|
||
| 1. Go to **Evaluations** and open the **Continuous Tests** tab. | ||
| 2. Click **New continuous test**. | ||
| 3. Set **Name** and optional **Description**. | ||
| 4. The config is **Active** so it runs on new conversations. | ||
| 5. Choose **Evaluators** to run. | ||
| 6. Optionally restrict by **Agents** (only evaluate conversations for those agents). | ||
| 7. Set **Sample rate** (0–1) to evaluate a fraction of matching conversations. | ||
| 8. Save. Once active, matching conversations will be evaluated according to the sample rate. | ||
|
|
||
| ### Viewing results | ||
|
|
||
| From the Continuous Tests list, open a config to see **Results** for that run config: all evaluation results triggered by that continuous test, with filters. | ||
|
|
||
| ## Summary | ||
|
|
||
| | Area | Purpose | | ||
| |------|---------| | ||
| | **Evaluators** | Define how to score agent outputs (prompt, model, schema, pass criteria). | | ||
| | **Batch evaluations** | Run evaluators once over a scoped set of conversations (date range). | | ||
| | **Continuous tests** | Automatically run evaluators on a sample of live conversations. | | ||
|
|
||
| For programmatic access to the same concepts, see [TypeScript SDK: Evaluations](/typescript-sdk/evaluations) and the [Evaluations API reference](/api-reference/evaluations). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,7 @@ | |
| "context-fetchers", | ||
| "access-control", | ||
| "skills", | ||
| "evaluations", | ||
| "..." | ||
| ] | ||
| } | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.