You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -25,20 +25,22 @@ Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another
25
25
26
26
You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability.
27
27
28
+
### Configure the prompt
29
+
28
30
1. In Datadog, navigate to the LLM Observability [Evaluations page][1]. Select **Create Evaluation**, then select **Create your own**.
29
31
{{< img src="llm_observability/evaluations/custom_llm_judge_1.png" alt="The LLM Observability Evaluations page with the Create Evaluation side panel opened. The first item, 'Create your own,' is selected. " style="width:100%;" >}}
30
32
31
-
1. Provide a clear, descriptive **evaluation name** (for example, `factuality-check` or `tone-eval`). You will use this name when querying evaluation results. The name must be unique within your application.
33
+
2. Provide a clear, descriptive **evaluation name** (for example, `factuality-check` or `tone-eval`). You can use this name when querying evaluation results. The name must be unique within your application.
32
34
33
-
1. Use the **Account** drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider][2].
35
+
3. Use the **Account** drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider][2].
34
36
35
-
1. Use the **Model** drop-down menu to select a model to use for your LLM judge.
37
+
4. Use the **Model** drop-down menu to select a model to use for your LLM judge.
36
38
37
-
1. Under **Evaluation Prompt** section, use the **Prompt Template** drop-down menu:
39
+
5. Under **Evaluation Prompt** section, use the **Prompt Template** drop-down menu:
38
40
-**Create from scratch**: Use your own custom prompt (defined in the next step).
39
41
-**Failure to Answer**, **Prompt Injection**, **Sentiment**, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic.
40
42
41
-
1. In the **System Prompt** field, enter your custom prompt or modify a prompt template.
43
+
6. In the **System Prompt** field, enter your custom prompt or modify a prompt template.
42
44
For custom prompts, provide clear instructions describing what the evaluator should assess.
43
45
44
46
- Focus on a single evaluation goal
@@ -79,34 +81,136 @@ Span Input: {{span_input}}
79
81
80
82
7. In the **User** field, provide your user prompt. Explicitly specify what parts of the span to evaluate: Span Input (`{{span_input}}`), Output (`{{span_output}}`), or both.
81
83
82
-
7. Select an evaluation output type:
84
+
### Define the evaluation output
85
+
86
+
For OpenAI or Azure OpenAI models, configure [Structured Output](#structured-output).
87
+
88
+
For Anthropic or Amazon Bedrock models, configure [Keyword Search Output](#keyword-search-output).
-**Boolean**: True/false results (for example, "Did the model follow instructions?")
85
94
-**Score**: Numeric ratings (for example, a 1–5 scale for helpfulness)
86
95
-**Categorical**: Discrete labels (for example, "Good", "Bad", "Neutral")
96
+
97
+
2. Optionally, select **Enable Reasoning**. This configures the LLM judge to provide a short justification for its decision (for example, why a score of 8 was given). Reasoning helps you understand how and why evaluations are made, and is particularly useful for auditing subjective metrics like tone, empathy, or helpfulness. Adding reasoning can also [make the LLM judge more accurate](https://arxiv.org/abs/2504.00050).
98
+
99
+
3. Edit a JSON schema that defines your evaluations output type:
100
+
101
+
{{< tabs >}}
102
+
{{% tab "Boolean" %}}
103
+
For the **Boolean** output type, edit the `description` field to further explain what true and false mean in your use case.
104
+
{{% /tab %}}
105
+
106
+
{{% tab "Score" %}}
107
+
For the **Score** output type:
108
+
- Set a `min` and `max` score for your evaluation.
109
+
- Edit the `description` field to further explain the scale of your evaluation.
110
+
{{% /tab %}}
111
+
{{% tab "Categorical" %}}
112
+
For the **Categorical** output type:
113
+
- Add or remove categories by editing the JSON schema.
114
+
- Edit category names.
115
+
- Edit the `description` field of categories to further explain what they mean in the context of your evaluation.
116
+
117
+
An example schema for a categorical evaluation:
118
+
119
+
```
120
+
{
121
+
"name": "categorical_eval",
122
+
"schema": {
123
+
"type": "object",
124
+
"required": [
125
+
"categorical_eval",
126
+
"reasoning"
127
+
],
128
+
"properties": {
129
+
"categorical_eval": {
130
+
"type": "string",
131
+
"anyOf": [
132
+
{
133
+
"const": "budgeting_question",
134
+
"description": "The user is asking a question about their budget. The answer can be directly determined by looking at their budget and spending."
135
+
},
136
+
{
137
+
"const": "budgeting_request",
138
+
"description": "The user is asking to change something about their budget. This should involve an action that changes their budget."
139
+
},
140
+
{
141
+
"const": "budgeting_advice",
142
+
"description": "The user is asking for advice on their budget. This should not require a change to their budget, but it should require an analysis of their budget and spending."
143
+
},
144
+
{
145
+
"const": "general_financial_advice",
146
+
"description": "The user is asking for general financial advice which is not directly related to their specific budget. However, this can include advice about budgeting in general."
147
+
},
148
+
{
149
+
"const": "unrelated",
150
+
"description": "This is a catch-all category for things not related to budgeting or financial advice."
151
+
}
152
+
]
153
+
},
154
+
"reasoning": {
155
+
"type": "string",
156
+
"description": "Describe how you decided the category"
157
+
}
158
+
},
159
+
"additionalProperties": false
160
+
},
161
+
"strict": true
162
+
}
163
+
```
164
+
{{% /tab %}}
165
+
{{< /tabs >}}
166
+
167
+
168
+
4. Configure **Assessment Criteria**.
169
+
This flexibility allows you to align evaluation outcomes with your team’s quality bar. Pass/fail mapping also powers automation across Datadog LLM Observability, enabling monitors and dashboards to flag regressions or track overall health.
170
+
171
+
{{< tabs >}}
172
+
{{% tab "Boolean" %}}
173
+
Select **True** to mark a result as "Pass", or **False** to mark a result as "Fail".
174
+
{{% /tab %}}
175
+
176
+
{{% tab "Score" %}}
177
+
Define numerical thresholds to determine passing performance.
178
+
{{% /tab %}}
179
+
{{% tab "Categorical" %}}
180
+
Select the categories that should map to a passing state. For example, if you have the categories `Excellent`, `Good`, and `Poor`, where only `Poor` should correspond to a failing state, select `Excellent` and `Good`.
<divclass="alert alert-info">For Anthropic and Amazon Bedrock models, only the <strong>Boolean</strong> output type is available.</div>
88
190
89
-
7. Define the structure of your output.
191
+
2. Provide **True keywords** and **False keywords** that define when the evaluation result is true or false, respectively.
90
192
91
-
{{< tabs >}}
92
-
{{% tab "OpenAI" %}}
93
-
{{% llm-eval-output-json %}}
94
-
{{% /tab %}}
193
+
Datadog searches the LLM-as-a-judge's response text for your defined keywords and provides the appropriate results for the evaluation. For this reason, you should instruct the LLM to respond with your chosen keywords.
95
194
96
-
{{% tab "Azure OpenAI" %}}
97
-
{{% llm-eval-output-json %}}
98
-
{{% /tab %}}
195
+
For example, if you set:
99
196
100
-
{{% tab "Anthropic" %}}
101
-
{{% llm-eval-output-keyword %}}
102
-
{{% /tab %}}
197
+
-**True keywords**: Yes, yes
198
+
-**False keywords**: No, no
199
+
200
+
Then your system prompt should include something like `Respond with "yes" or "no"`.
201
+
202
+
3. For **Assessment Criteria**:
203
+
- Select **True** to mark a result as "Pass"
204
+
- Select **False** to mark a result as "Fail"
205
+
206
+
This flexibility allows you to align evaluation outcomes with your team’s quality bar. Pass/fail mapping also powers automation across Datadog LLM Observability, enabling monitors and dashboards to flag regressions or track overall health.
207
+
{{% /collapse-content %}}
103
208
104
-
{{% tab "Amazon Bedrock" %}}
105
-
{{% llm-eval-output-keyword %}}
106
-
{{% /tab %}}
107
-
{{< /tabs >}}
209
+
{{< img src="llm_observability/evaluations/custom_llm_judge_5.png" alt="Configuring the custom evaluation output under Structured Output, including reasoning and assessment criteria." style="width:100%;" >}}
108
210
109
-
7. Under **Evaluation Scope**, define the scope of your evaluation:
211
+
### Define the evaluation scope: Filtering and sampling
212
+
213
+
Under **Evaluation Scope**, define where and how your evaluation runs. This helps control coverage (which spans are included) and cost (how many spans are sampled).
110
214
-**Application**: Select the application you want to evaluate.
111
215
-**Evaluate On**: Choose one of the following:
112
216
-**Traces**: Evaluate only root spans
@@ -115,16 +219,28 @@ Span Input: {{span_input}}
115
219
-**Tags**: (Optional) Limit evaluation to spans with certain tags.
116
220
-**Sampling Rate**: (Optional) Apply sampling (for example, 10%) to control evaluation cost.
117
221
118
-
7. Use the **Test Evaluation** panel on the right to preview how your evaluator performs. You can input sample `{{span_input}}` and `{{span_output}}` values, then click **Run Evaluation** to see the LLM-as-a-Judge's output before saving. Modify your evaluation until you are satisfied with the results.
222
+
### Test and preview
223
+
224
+
Use the **Test Evaluation** panel on the right to preview results.
225
+
You can enter sample `{{span_input}}` and `{{span_output}}` values and click **Run Evaluation** to see both the result, the reasoning explanation, and whether it passed or failed returned by your LLM judge.
226
+
227
+
Refine your prompt and schema until outputs are consistent and interpretable.
228
+
119
229
120
-
{{< img src="llm_observability/evaluations/custom_llm_judge_2.png" alt="Creation flow for a custom LLM-as-a-judge evaluation. On the right, under Test Evaluation, sample span_input and span_output have been provided. An Evaluation Result textbox below displays a sample result." style="width:100%;" >}}
230
+
{{< img src="llm_observability/evaluations/custom_llm_judge_2-2.png" alt="Creation flow for a custom LLM-as-a-judge evaluation. On the right, under Test Evaluation, sample span_input and span_output have been provided. An Evaluation Result textbox below displays a sample result." style="width:100%;" >}}
121
231
122
232
123
233
## Viewing and using results
124
234
125
-
After you save your evaluation, Datadog automatically runs your evaluation on targeted spans. Results are available across LLM Observability in near-real-time. You can find your custom LLM-as-a-judge results for a specific span in the **Evaluations** tab, next to all other evaluations.
235
+
After you save your evaluation, Datadog automatically runs your evaluation on targeted spans. Results are available across LLM Observability in near-real-time. You can find your custom LLM-as-a-judge results for a specific span in the **Evaluations** tab, alongside other evaluations.
126
236
127
-
{{< img src="llm_observability/evaluations/custom_llm_judge_3.png" alt="The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations." style="width:100%;" >}}
237
+
{{< img src="llm_observability/evaluations/custom_llm_judge_3-2.png" alt="The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations." style="width:100%;" >}}
238
+
239
+
Each evaluation result includes:
240
+
241
+
- The evaluated value (for example `True`, `9`, or `Neutral`)
242
+
- The reasoning (when enabled)
243
+
- The pass/fail indicator (based on your assessment criteria)
128
244
129
245
Use the syntax `@evaluations.custom.<evaluation_name>` to query or visualize results.
130
246
@@ -137,14 +253,16 @@ For example:
137
253
138
254
139
255
You can:
140
-
- Filter traces by evaluation results
256
+
- Filter traces by evaluation results (example, `@evaluations.custom.helpfulness-check`)
257
+
- Filter by pass/fail assessment status (example, `@evaluations.assessment.custom.helpfulness-check:fail`)
141
258
- Use evaluation results as [facets][3]
142
259
- View aggregate results in the LLM Observability Overview page's Evaluation section
143
260
- Create [monitors][4] to alert on performance changes or regression
144
261
145
262
## Best practices for reliable custom evaluations
146
263
147
264
-**Start small**: Target a single, well-defined failure mode before scaling.
265
+
-**Enable reasoning** when you need explainable decisions and to improve the accuracy on complex reasoning tasks.
148
266
-**Iterate**: Run, inspect outputs, and refine your prompt.
149
267
-**Validate**: Periodically check evaluator accuracy using sampled traces.
150
268
-**Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
0 commit comments