Skip to content

Commit 37c6a10

Browse files
gsvigruhacswatt
andauthored
Add Reasoning and Assessment criteria to the custom_llm_as_a_judge_evaluation section (#32376)
* add more byop steps Signed-off-by: Greg Svigruha <gergely.svigruha@datadoghq.com> * fixes Signed-off-by: Greg Svigruha <gergely.svigruha@datadoghq.com> * nit * add reasoning to json * update photos * bullets * nits * nits * moving information around * image swap * replace old images --------- Signed-off-by: Greg Svigruha <gergely.svigruha@datadoghq.com> Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>
1 parent 3060259 commit 37c6a10

File tree

6 files changed

+146
-94
lines changed

6 files changed

+146
-94
lines changed

content/en/llm_observability/evaluations/custom_llm_as_a_judge_evaluations.md

Lines changed: 146 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -25,20 +25,22 @@ Custom LLM-as-a-judge evaluations use an LLM to judge the performance of another
2525

2626
You can create and manage custom evaluations from the [Evaluations page][1] in LLM Observability.
2727

28+
### Configure the prompt
29+
2830
1. In Datadog, navigate to the LLM Observability [Evaluations page][1]. Select **Create Evaluation**, then select **Create your own**.
2931
{{< img src="llm_observability/evaluations/custom_llm_judge_1.png" alt="The LLM Observability Evaluations page with the Create Evaluation side panel opened. The first item, 'Create your own,' is selected. " style="width:100%;" >}}
3032

31-
1. Provide a clear, descriptive **evaluation name** (for example, `factuality-check` or `tone-eval`). You will use this name when querying evaluation results. The name must be unique within your application.
33+
2. Provide a clear, descriptive **evaluation name** (for example, `factuality-check` or `tone-eval`). You can use this name when querying evaluation results. The name must be unique within your application.
3234

33-
1. Use the **Account** drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider][2].
35+
3. Use the **Account** drop-down menu to select the LLM provider and corresponding account to use for your LLM judge. To connect a new account, see [connect an LLM provider][2].
3436

35-
1. Use the **Model** drop-down menu to select a model to use for your LLM judge.
37+
4. Use the **Model** drop-down menu to select a model to use for your LLM judge.
3638

37-
1. Under **Evaluation Prompt** section, use the **Prompt Template** drop-down menu:
39+
5. Under **Evaluation Prompt** section, use the **Prompt Template** drop-down menu:
3840
- **Create from scratch**: Use your own custom prompt (defined in the next step).
3941
- **Failure to Answer**, **Prompt Injection**, **Sentiment**, etc.: Populate a pre-existing prompt template. You can use these templates as-is, or modify them to match your specific evaluation logic.
4042

41-
1. In the **System Prompt** field, enter your custom prompt or modify a prompt template.
43+
6. In the **System Prompt** field, enter your custom prompt or modify a prompt template.
4244
For custom prompts, provide clear instructions describing what the evaluator should assess.
4345

4446
- Focus on a single evaluation goal
@@ -79,34 +81,136 @@ Span Input: {{span_input}}
7981

8082
7. In the **User** field, provide your user prompt. Explicitly specify what parts of the span to evaluate: Span Input (`{{span_input}}`), Output (`{{span_output}}`), or both.
8183

82-
7. Select an evaluation output type:
84+
### Define the evaluation output
85+
86+
For OpenAI or Azure OpenAI models, configure [Structured Output](#structured-output).
87+
88+
For Anthropic or Amazon Bedrock models, configure [Keyword Search Output](#keyword-search-output).
89+
90+
{{% collapse-content title="Structured Output (OpenAI, Azure OpenAI)" level="h4" expanded="true" id="structured-output" %}}
91+
1. Select an evaluation output type:
8392

8493
- **Boolean**: True/false results (for example, "Did the model follow instructions?")
8594
- **Score**: Numeric ratings (for example, a 1–5 scale for helpfulness)
8695
- **Categorical**: Discrete labels (for example, "Good", "Bad", "Neutral")
96+
97+
2. Optionally, select **Enable Reasoning**. This configures the LLM judge to provide a short justification for its decision (for example, why a score of 8 was given). Reasoning helps you understand how and why evaluations are made, and is particularly useful for auditing subjective metrics like tone, empathy, or helpfulness. Adding reasoning can also [make the LLM judge more accurate](https://arxiv.org/abs/2504.00050).
98+
99+
3. Edit a JSON schema that defines your evaluations output type:
100+
101+
{{< tabs >}}
102+
{{% tab "Boolean" %}}
103+
For the **Boolean** output type, edit the `description` field to further explain what true and false mean in your use case.
104+
{{% /tab %}}
105+
106+
{{% tab "Score" %}}
107+
For the **Score** output type:
108+
- Set a `min` and `max` score for your evaluation.
109+
- Edit the `description` field to further explain the scale of your evaluation.
110+
{{% /tab %}}
111+
{{% tab "Categorical" %}}
112+
For the **Categorical** output type:
113+
- Add or remove categories by editing the JSON schema.
114+
- Edit category names.
115+
- Edit the `description` field of categories to further explain what they mean in the context of your evaluation.
116+
117+
An example schema for a categorical evaluation:
118+
119+
```
120+
{
121+
"name": "categorical_eval",
122+
"schema": {
123+
"type": "object",
124+
"required": [
125+
"categorical_eval",
126+
"reasoning"
127+
],
128+
"properties": {
129+
"categorical_eval": {
130+
"type": "string",
131+
"anyOf": [
132+
{
133+
"const": "budgeting_question",
134+
"description": "The user is asking a question about their budget. The answer can be directly determined by looking at their budget and spending."
135+
},
136+
{
137+
"const": "budgeting_request",
138+
"description": "The user is asking to change something about their budget. This should involve an action that changes their budget."
139+
},
140+
{
141+
"const": "budgeting_advice",
142+
"description": "The user is asking for advice on their budget. This should not require a change to their budget, but it should require an analysis of their budget and spending."
143+
},
144+
{
145+
"const": "general_financial_advice",
146+
"description": "The user is asking for general financial advice which is not directly related to their specific budget. However, this can include advice about budgeting in general."
147+
},
148+
{
149+
"const": "unrelated",
150+
"description": "This is a catch-all category for things not related to budgeting or financial advice."
151+
}
152+
]
153+
},
154+
"reasoning": {
155+
"type": "string",
156+
"description": "Describe how you decided the category"
157+
}
158+
},
159+
"additionalProperties": false
160+
},
161+
"strict": true
162+
}
163+
```
164+
{{% /tab %}}
165+
{{< /tabs >}}
166+
167+
168+
4. Configure **Assessment Criteria**.
169+
This flexibility allows you to align evaluation outcomes with your team’s quality bar. Pass/fail mapping also powers automation across Datadog LLM Observability, enabling monitors and dashboards to flag regressions or track overall health.
170+
171+
{{< tabs >}}
172+
{{% tab "Boolean" %}}
173+
Select **True** to mark a result as "Pass", or **False** to mark a result as "Fail".
174+
{{% /tab %}}
175+
176+
{{% tab "Score" %}}
177+
Define numerical thresholds to determine passing performance.
178+
{{% /tab %}}
179+
{{% tab "Categorical" %}}
180+
Select the categories that should map to a passing state. For example, if you have the categories `Excellent`, `Good`, and `Poor`, where only `Poor` should correspond to a failing state, select `Excellent` and `Good`.
181+
{{% /tab %}}
182+
{{< /tabs >}}
183+
184+
185+
{{% /collapse-content %}}
186+
187+
{{% collapse-content title="Keyword Search Output (Anthropic, Amazon Bedrock)" level="h4" expanded="true" id="keyword-search-output" %}}
188+
1. Select the **Boolean** output type.
87189
<div class="alert alert-info">For Anthropic and Amazon Bedrock models, only the <strong>Boolean</strong> output type is available.</div>
88190

89-
7. Define the structure of your output.
191+
2. Provide **True keywords** and **False keywords** that define when the evaluation result is true or false, respectively.
90192

91-
{{< tabs >}}
92-
{{% tab "OpenAI" %}}
93-
{{% llm-eval-output-json %}}
94-
{{% /tab %}}
193+
Datadog searches the LLM-as-a-judge's response text for your defined keywords and provides the appropriate results for the evaluation. For this reason, you should instruct the LLM to respond with your chosen keywords.
95194

96-
{{% tab "Azure OpenAI" %}}
97-
{{% llm-eval-output-json %}}
98-
{{% /tab %}}
195+
For example, if you set:
99196

100-
{{% tab "Anthropic" %}}
101-
{{% llm-eval-output-keyword %}}
102-
{{% /tab %}}
197+
- **True keywords**: Yes, yes
198+
- **False keywords**: No, no
199+
200+
Then your system prompt should include something like `Respond with "yes" or "no"`.
201+
202+
3. For **Assessment Criteria**:
203+
- Select **True** to mark a result as "Pass"
204+
- Select **False** to mark a result as "Fail"
205+
206+
This flexibility allows you to align evaluation outcomes with your team’s quality bar. Pass/fail mapping also powers automation across Datadog LLM Observability, enabling monitors and dashboards to flag regressions or track overall health.
207+
{{% /collapse-content %}}
103208

104-
{{% tab "Amazon Bedrock" %}}
105-
{{% llm-eval-output-keyword %}}
106-
{{% /tab %}}
107-
{{< /tabs >}}
209+
{{< img src="llm_observability/evaluations/custom_llm_judge_5.png" alt="Configuring the custom evaluation output under Structured Output, including reasoning and assessment criteria." style="width:100%;" >}}
108210

109-
7. Under **Evaluation Scope**, define the scope of your evaluation:
211+
### Define the evaluation scope: Filtering and sampling
212+
213+
Under **Evaluation Scope**, define where and how your evaluation runs. This helps control coverage (which spans are included) and cost (how many spans are sampled).
110214
- **Application**: Select the application you want to evaluate.
111215
- **Evaluate On**: Choose one of the following:
112216
- **Traces**: Evaluate only root spans
@@ -115,16 +219,28 @@ Span Input: {{span_input}}
115219
- **Tags**: (Optional) Limit evaluation to spans with certain tags.
116220
- **Sampling Rate**: (Optional) Apply sampling (for example, 10%) to control evaluation cost.
117221

118-
7. Use the **Test Evaluation** panel on the right to preview how your evaluator performs. You can input sample `{{span_input}}` and `{{span_output}}` values, then click **Run Evaluation** to see the LLM-as-a-Judge's output before saving. Modify your evaluation until you are satisfied with the results.
222+
### Test and preview
223+
224+
Use the **Test Evaluation** panel on the right to preview results.
225+
You can enter sample `{{span_input}}` and `{{span_output}}` values and click **Run Evaluation** to see both the result, the reasoning explanation, and whether it passed or failed returned by your LLM judge.
226+
227+
Refine your prompt and schema until outputs are consistent and interpretable.
228+
119229

120-
{{< img src="llm_observability/evaluations/custom_llm_judge_2.png" alt="Creation flow for a custom LLM-as-a-judge evaluation. On the right, under Test Evaluation, sample span_input and span_output have been provided. An Evaluation Result textbox below displays a sample result." style="width:100%;" >}}
230+
{{< img src="llm_observability/evaluations/custom_llm_judge_2-2.png" alt="Creation flow for a custom LLM-as-a-judge evaluation. On the right, under Test Evaluation, sample span_input and span_output have been provided. An Evaluation Result textbox below displays a sample result." style="width:100%;" >}}
121231

122232

123233
## Viewing and using results
124234

125-
After you save your evaluation, Datadog automatically runs your evaluation on targeted spans. Results are available across LLM Observability in near-real-time. You can find your custom LLM-as-a-judge results for a specific span in the **Evaluations** tab, next to all other evaluations.
235+
After you save your evaluation, Datadog automatically runs your evaluation on targeted spans. Results are available across LLM Observability in near-real-time. You can find your custom LLM-as-a-judge results for a specific span in the **Evaluations** tab, alongside other evaluations.
126236

127-
{{< img src="llm_observability/evaluations/custom_llm_judge_3.png" alt="The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations." style="width:100%;" >}}
237+
{{< img src="llm_observability/evaluations/custom_llm_judge_3-2.png" alt="The Evaluations tab of a trace, displaying custom evaluation results alongside managed evaluations." style="width:100%;" >}}
238+
239+
Each evaluation result includes:
240+
241+
- The evaluated value (for example `True`, `9`, or `Neutral`)
242+
- The reasoning (when enabled)
243+
- The pass/fail indicator (based on your assessment criteria)
128244

129245
Use the syntax `@evaluations.custom.<evaluation_name>` to query or visualize results.
130246

@@ -137,14 +253,16 @@ For example:
137253

138254

139255
You can:
140-
- Filter traces by evaluation results
256+
- Filter traces by evaluation results (example, `@evaluations.custom.helpfulness-check`)
257+
- Filter by pass/fail assessment status (example, `@evaluations.assessment.custom.helpfulness-check:fail`)
141258
- Use evaluation results as [facets][3]
142259
- View aggregate results in the LLM Observability Overview page's Evaluation section
143260
- Create [monitors][4] to alert on performance changes or regression
144261

145262
## Best practices for reliable custom evaluations
146263

147264
- **Start small**: Target a single, well-defined failure mode before scaling.
265+
- **Enable reasoning** when you need explainable decisions and to improve the accuracy on complex reasoning tasks.
148266
- **Iterate**: Run, inspect outputs, and refine your prompt.
149267
- **Validate**: Periodically check evaluator accuracy using sampled traces.
150268
- **Document your rubric**: Clearly define what "Pass" and "Fail" mean to avoid drift over time.
@@ -158,4 +276,4 @@ You can:
158276
[2]: /llm_observability/evaluations/managed_evaluations#connect-your-llm-provider-account
159277
[3]: /service_management/events/explorer/facets/
160278
[4]: /monitors/
161-
279+
[5]: https://arxiv.org/abs/2504.00050

layouts/shortcodes/llm-eval-output-json.en.md

Lines changed: 0 additions & 56 deletions
This file was deleted.

layouts/shortcodes/llm-eval-output-keyword.en.md

Lines changed: 0 additions & 10 deletions
This file was deleted.
430 KB
Loading
577 KB
Loading
308 KB
Loading

0 commit comments

Comments
 (0)