diff --git a/docs/.gitbook/assets/Benchmark Performance.png b/docs/.gitbook/assets/Benchmark Performance.png new file mode 100644 index 0000000000..df58741977 Binary files /dev/null and b/docs/.gitbook/assets/Benchmark Performance.png differ diff --git a/docs/.gitbook/assets/Choose LLM.png b/docs/.gitbook/assets/Choose LLM.png new file mode 100644 index 0000000000..b55a133b31 Binary files /dev/null and b/docs/.gitbook/assets/Choose LLM.png differ diff --git a/docs/.gitbook/assets/Create Template.png b/docs/.gitbook/assets/Create Template.png new file mode 100644 index 0000000000..fa1e0703a8 Binary files /dev/null and b/docs/.gitbook/assets/Create Template.png differ diff --git a/docs/.gitbook/assets/Golden Dataset.png b/docs/.gitbook/assets/Golden Dataset.png new file mode 100644 index 0000000000..c40b7f268e Binary files /dev/null and b/docs/.gitbook/assets/Golden Dataset.png differ diff --git a/docs/evaluation/concepts-evals/building-your-own-evals.md b/docs/evaluation/concepts-evals/building-your-own-evals.md index 00da4a5ce7..dbe81387e8 100644 --- a/docs/evaluation/concepts-evals/building-your-own-evals.md +++ b/docs/evaluation/concepts-evals/building-your-own-evals.md @@ -20,7 +20,7 @@ Then, you need the **golden dataset**. This should be representative of the type Building such a dataset is laborious, but you can often find a standardized one for the most common use cases (as we did in the code above) -

Golden Dataset

+

Build a golden dataset

The Eval inferences are designed or easy benchmarking and pre-set downloadable test inferences. The inferences are pre-tested, many are hand crafted and designed for testing specific Eval tasks. @@ -37,7 +37,7 @@ df.head() Then you need to decide **which LLM** you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy. -

Decide your LLM for evaluation

+

Decide on LLM for evaluation

### 4. Build the Eval Template @@ -51,7 +51,7 @@ Be explicit about the following: * **What are we asking?** In our example, we’re asking the LLM to tell us if the document was relevant to the query * **What are the possible output formats?** In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant). -

Building the eval template

+

Build eval template

In order to create a new template all that is needed is the setting of the input string to the Eval function. @@ -94,4 +94,4 @@ MY_CUSTOM_TEMPLATE = PromptTemplate("This is a test {prompt}") You now need to run the eval across your golden dataset. Then you can **generate metrics** (overall accuracy, precision, recall, F1, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail. -

Benchmark performance

+

Benchmark performance