Skip to content

Commit

Permalink
updated confident docs
Browse files Browse the repository at this point in the history
  • Loading branch information
penguine-ip committed Jan 23, 2025
1 parent 20ff484 commit 714ad12
Show file tree
Hide file tree
Showing 6 changed files with 234 additions and 53 deletions.
38 changes: 38 additions & 0 deletions docs/confident-ai/confident-ai-advanced-evaluation-model.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
id: confident-ai-advanced-evaluation-model
title: Setup Evaluation Model
sidebar_label: Setup Evaluation Model
---

If you choose to run evaluations on Confident AI, you'll need to specify which evaluation model to use when running LLM-as-a-judge metrics. You'll need to configure an evaluation model if you're using any of these features:

- [Online evaluations](/confident-ai/confident-ai-llm-monitoring-evaluations)
- [Experiments](/confident-ai/confident-ai-testing-n-evaluation-experiments)

To setup an evaluation model, you can either:

1. Use Confident AI's default models (on additional setup required).
2. Use OpenAI by supplying your own OpenAI api key.
3. Use a custom endpoint as your evaluation model.

### Using Custom LLMs

To use a custom LLM, go to **Project Settings** > **Evaluation Model**, and select the "Custom" option under the model provider dropdown. You'll have the opportunity to supply a model "Name", as well as an "Endpoint. The name is straightforward, since it is merely for identification purposes, but for the model endpoint, you'll have to ensure it follows the following rules:

1. It **MUST** accept a `POST` request over HTTPS, and must be reachable over the internet.
2. It **MUST** return a `200` status where **the response body is a simple string** (not Json, not nothing else, but string only) that represents the generated text of your evaluation model based on the given request body (which contains the `input`).
3. It **MUST** correctly parse the request body to extract the `input` and use your evaluation model to generate a response based on this `input`.

Here is what the structure of the request body looks like:

```python
{
"input": "Given the metric score, tell me why..."
}
```

Please test your endpoint that it meets all the given requirements before continuing. This might be harder to debug than other steps, so you should aim to log any errors and reach out to support@confident-ai.com for any errors during setup.

:::tip
If you find a model provider that is not support, most of the time it is because no one has asked for it. Simply email support@confident-ai.com as well if that is the case.
:::
73 changes: 73 additions & 0 deletions docs/confident-ai/confident-ai-advanced-llm-connection.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
id: confident-ai-advanced-llm-connection
title: Setup LLM Endpoint Connection
sidebar_label: Setup LLM Endpoint Connection
---

:::tip
This is particularly helpful if you wish to enable a no-code evaluation workflow for non-technical users, or simulating conversations to be evaluated in one click of a button.
:::

You can also setup an LLM endpoint that accepts a `POST` request over HTTPS to **enable users to run evaluations directly on the platform without having to code**, and start an evaluation through a click of a button instead. At a high level, you would have to provide Confident AI with the mappings to test case parameters such as the `actual_output`, `retrieval_context`, etc., and at evaluation time Confident AI will use the dataset and metrics settings you've specified for your experiment to unit test your LLM application.

### Create an LLM Endpoint

In order for Confident AI to reach your LLM application, you'll need to expose your LLM in a RESTFUL API endpoint that is accessible over the internet. These are the hard rules you **MUST** follow when setting up your endpoint:

1. Accepts a POST request over HTTPS.
2. Returns a JSON response and **MUST** contain the `actual_output` value somewhere in the returned Json object. Whether or not to supply a `retrieval_context` or `tools_called` value in your returned Json is optional, and this depends on whether the metrics you have enabled for your experiment requires these parameters.

:::caution
When Confident AI calls your LLM endpoint, it does a POST request with a data structure of this type:

```python
{
"input": "..."
}
```

This input will be used to unit test your LLM application, and any JSON response returned will be parsed and used to deduce what the remaining test case parameters (i.e. `actual_output`) values are.

So, it is **imperative that your LLM endpoint**:

1. Parses the incoming data to extract this `input` value to carry out generations.
2. Returns the `actual_output` and any other `LLMTestCase` parameters in the JSON response with their **correct respective type**.

For those that want a recap of what types each test case parameter is of, visit the [test cases section.](/docs/evaluation-test-cases)

:::

### Connect Your LLM Endpoint

Now that you have your endpoint up and running, all that's left is to tell Confident AI how to reach it.

:::tip
You can setup your LLM connection in **Project Settings** > **LLM Connection**. There is also a button for you to **ping your LLM endpoint** to sanity check that you've setup everything correctly.
:::

You'll have to provide the:

1. HTTPS endpoint you've setup.
2. The json key path to the mandaatory `actual_output` parameter, and the optional `retrieval_context` , and `tools_called` parameters. The json key path is a list of strings.

In order for evaluation to work, you **MUST** set the json key path for the `actual_output` parameter. Remember, the `actual_output` of your LLM application is always required for evaluation, while the `retrieval_context` and `tools_called` parameters are optional depending on the metrics you've enabled.

:::note
The json key path tells Confident AI where to look in your Json response for the respective test case parameter values.
:::

For instance, if you set the key path of the `actual_output` parameter to `["response", "actual output"]`, the correct Json response to return from your LLM endpoint is as follows:

```python
{
"response": {
"actual output": "must be a string!"
}
}
```

That's not to say you can't include other things to return in your Json response, but the key path will determine the variables Confident AI will be using to populate `LLMTestCase`s at evaluation time.

:::info
If you're wondering why `expected_output`, `context`, and `expected_tools` is not required in setting up your json key path, it is because it is expected that these variables are static, just like the `input`, and should therefore come from your dataset instead.
:::
143 changes: 92 additions & 51 deletions docs/confident-ai/confident-ai-introduction.mdx
Original file line number Diff line number Diff line change
@@ -1,30 +1,105 @@
---
id: confident-ai-introduction
title: Confident AI Quickstart
sidebar_label: Confident AI Quickstart
title: Confident AI Introduction
sidebar_label: Confident AI Introduction
---

## Introduction
import Equation from "@site/src/components/equation";

Confident AI was designed for LLM teams to quality assure LLM applications from development to production. It is an all-in-one platform that unlocks `deepeval`'s full potential by allowing you to:
:::caution
Without best LLM evaluation practices in place, your testing results aren't really valid, and you might be iterating back and fourth between the wrong things, which means your LLM application isn't nearly as performant as they should be.
:::

**Confident AI is the LLM evaluation platform for DeepEval**. It is native to DeepEval, and was designed for teams building LLM applications to maximize its performance, and to safeguard against unsatisfactory LLM outputs. Whilst DeepEval's open-source metrics are great for running evaluations, there is so much more to building a robust LLM evaluation workflow than collecting metric scores.

If you're _serious_ about LLM evaluation, Confident AI is for you.

<div
style={{
width: "100%",
display: "flex",
alignItems: "center",
justifyContent: "center",
position: "relative",
marginBottom: "2rem",
}}
>
<img
id="light-invertable-img"
src="https://confident-docs.s3.us-east-1.amazonaws.com/confident-flywheel.svg"
alt="Confident AI"
style={{
marginTop: "2rem",
marginBottom: "2rem",
height: "auto",
}}
/>
<img
src="/icons/logo.svg"
alt="Confident AI"
className="glowing"
style={{
width: "80px",
height: "80px",
transform: "translateY(-50%) translateX(-50%)",
position: "absolute",
left: "50%",
top: "50%",
filter: "drop-shadow(0px 0px 16px #7c3aed)",
}}
/>
</div>

Apart from running the actual evaluations, you'll need a way to:

- Curate a robust testing dataset
- Perform LLM benchmark analysis
- Tailor evaluation metrics to your opinions
- Improve your testing dataset over time

Confident AI enforces this by offering an opinionated, centralized platform to manage all the things mentioned above, which for you means "more accurate, informative, and faster insights", and allows you to "identify any performance gaps and identify how to improve my LLM system".

- Evaluate LLM applications on Confident AI's infrastructure with proprietary models
- Keep track of the evaluation history of your LLM application
- Centralize and standardize evaluation datasets on the cloud
- Trace and debug LLM applications during evaluation
- Online evaluation of LLM applications in production
- Generate evaluation-based summary reports for relevant stakeholders
:::tip DID YOU KNOW?

<Equation formula="\textbf{Great LLM Evaluation} == \textbf{Quality of Dataset} \times \textbf{Quality of Metrics}" />

:::

## Why Confident AI?

As you try to evaluate and monitor LLM applications in both development and production environments, you might face several challenges:
If your team has ever tried building its own LLM evaluation pipeline, here are the list of problems your team has likely encountered (and it's a long list):

- **Dataset Curation Is Fragmented And Annoying**

- Your team often juggle between tools like Google Sheets or Notion to curate and update datasets, leading to constant back-and-forth between engineers and domain expert annotators.
- There is no "source of truth" since datasets aren't in-sync with your codebase for evaluations.

- **Evaluation Results Are (Still) More Vibe Checks Rather Than Experimentation**

- **Evaluation and Testing Quality:** Running evaluations locally on `deepeval` is great but often times you will find flaky metric scores when using your own model of choice for evaluation. By running evaluations on Confident AI's infrastructure you get the latest metrics implementation with the best evaluation models available for each particular metric.
- **Dataset Quality Assurance:** Keeping track of which test cases are ready for evaluation can become cumbersome, and miscommunication between expert data annotators and engineers regarding test case specifics can lead to inefficiencies.
- **Experimentation Difficulties:** Finding an easy way to experiment with the best LLM system implementations is essential but often challenging and unintuitive.
- **Identifying Issues at Scale:** Spotting unsatisfactory responses in production at scale can be daunting, especially for a complex LLM system architecture.
- You basically just look at failing test cases, but they don’t provide actionable insights, and sharing it among your team is hard.
- It’s impossible to compare benchmarks side-by-side to understand how changes impact performance for each unit test, making it more guesswork than experimentation.

Here's a diagram outlining how Confident AI works:
- **Testing Data Are Static With No Easy Way To Keep Them Updated**

- Your LLM application needs and priorities evolves in production, but your datasets don’t.
- Figuring out how to query and incorporate real-world interactions into evaluation datasets is tedious and error-prone.

- **Building A/B Testing Infrastructure Is Hard And Current Tools Don't Cut It**

- Setting up A/B testing for prompts/models to route traffic between versions is easy, but figuring out which version performed better and on what areas is hard.
- Tools like PostHog or Mixpanel give user-level analytics, while other LLM observability tools focus too much on cost and latency, none of which tell you anything about the end output quality.

- **Human Feedback Doesn't Lead to Improvements**

- Teams spend time collecting feedback from end-users or internal reviewers, but there’s no clear path to integrate it back into datasets.
- A lot of manual effort is needed to make good use of feedback, and unfortunately it is a waste of everyone's time.

- **There's No End To Manual Human Intervention**

- Teams rely on human reviewers to gatekeep LLM outputs before it reaches users in production, but the process is random, unstructured, and never ending.
- No automation to focus reviewers on high-risk areas or repetitive tasks.

Confident AI solves all of your LLM evaluation problems so you can stop going around in circles. Here's a diagram outlining how Confident AI works:

<div
style={{
Expand All @@ -42,7 +117,7 @@ Here's a diagram outlining how Confident AI works:

## Login to Confident AI

Everything in `deepeval` is already automatically integrated with Confident AI, including `deepeval`'s [custom metrics](/docs/metrics-custom). To start using Confident AI with `deepeval`, simply login in the CLI:
Everything in `deepeval` is already automatically integrated with Confident AI, including any [custom metrics](/docs/metrics-custom) you've built on `deepeval`. To start using Confident AI with `deepeval`, simply login in the CLI:

```
deepeval login
Expand All @@ -64,37 +139,3 @@ deepeval login --confident-api-key "your-confident-api-key"
```

:::

## Setting Up Your Evaluation Model

If you choose to run evaluations on Confident AI, you'll need to specify which evaluation model to use when running LLM-as-a-judge metrics. You'll need to configure an evaluation model if you're using any of these features:

- [Online evaluations](/confident-ai/confident-ai-llm-monitoring-evaluations)
- [Experiments](/confident-ai/confident-ai-testing-n-evaluation-experiments)

To setup an evaluation model, you can either:

1. Use OpenAI by supplying an OpenAI api key.
2. Use a custom endpoint as your evaluation model.

### Using Custom LLMs

To use a custom LLM, go to **Project Settings** > **Evaluation Model**, and select the "Custom" option under the model provider dropdown. You'll have the opportunity to supply a model "Name", as well as an "Endpoint. The name is straightforward, since it is merely for identification purposes, but for the model endpoint, you'll have to ensure it follows the following rules:

1. It **MUST** accept a `POST` request over HTTPS, and must be reachable over the internet.
2. It **MUST** return a `200` status where **the response body is a simple string** (not Json, not nothing else, but string only) that represents the generated text of your evaluation model based on the given request body (which contains the `input`).
3. It **MUST** correctly parse the request body to extract the `input` and use your evaluation model to generate a response based on this `input`.

Here is what the structure of the request body looks like:

```python
{
"input": "Given the metric score, tell me why..."
}
```

Please test your endpoint that it meets all the given requirements before continuing. This might be harder to debug than other steps, so you should aim to log any errors and reach out to support@confident-ai.com for any errors during setup.

:::tip
If you find a model provider that is not support, most of the time it is because no one has asked for it. Simply email support@confident-ai.com as well if that is the case.
:::
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Once an experiment has completed running on Confident AI's infrastructure, a tes
This is particularly helpful if you wish to enable a no-code evaluation workflow for non-technical users.
:::

You can also setup an LLM endpoint that accepts a `POST` request over HTTPS to **enable users to run evaluations directly on the platform without having to code**, and start an evaluation through a click of a button instead. At a high level, you would have to provide Confident AI with the mappings to test case parameters such as the `actual_output`, `retrieval_contexr`, etc., and at evaluation time Confident AI will use the dataset and metrics settings you've specified for your experiment to unit test your LLM application.
You can also setup an LLM endpoint that accepts a `POST` request over HTTPS to **enable users to run evaluations directly on the platform without having to code**, and start an evaluation through a click of a button instead. At a high level, you would have to provide Confident AI with the mappings to test case parameters such as the `actual_output`, `retrieval_context`, etc., and at evaluation time Confident AI will use the dataset and metrics settings you've specified for your experiment to unit test your LLM application.

### Create an LLM Endpoint

Expand Down
9 changes: 9 additions & 0 deletions docs/sidebarConfidentAI.js
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,15 @@ module.exports = {
],
collapsed: true,
},
{
type: "category",
label: "Advanced",
items: [
"confident-ai-advanced-evaluation-model",
"confident-ai-advanced-llm-connection"
],
collapsed: true,
},
],
};

22 changes: 21 additions & 1 deletion docs/src/css/custom.scss
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,10 @@ html[data-theme="light"] #invertable-img {
filter: invert(100%);
}

html[data-theme="dark"] #light-invertable-img {
filter: invert(100%);
}

#confident-workflow {
width: 70%;
}
Expand Down Expand Up @@ -195,4 +199,20 @@ html[data-theme="dark"] #rag-evaluation {

img, video {
border-radius: 6px;
}
}

.glowing {
animation: glow 2s infinite alternate;
}

@keyframes glow {
0% {
filter: drop-shadow(0 0 5px #c4b5fd);
}
30% {
filter: drop-shadow(0 0 15px #7c3aed);
}
100% {
filter: drop-shadow(0 0 25px #6d28d9);
}
}

0 comments on commit 714ad12

Please sign in to comment.