diff --git a/docs/.gitbook/assets/Benchmark Performance.png b/docs/.gitbook/assets/Benchmark Performance.png
new file mode 100644
index 0000000000..df58741977
Binary files /dev/null and b/docs/.gitbook/assets/Benchmark Performance.png differ
diff --git a/docs/.gitbook/assets/Choose LLM.png b/docs/.gitbook/assets/Choose LLM.png
new file mode 100644
index 0000000000..b55a133b31
Binary files /dev/null and b/docs/.gitbook/assets/Choose LLM.png differ
diff --git a/docs/.gitbook/assets/Create Template.png b/docs/.gitbook/assets/Create Template.png
new file mode 100644
index 0000000000..fa1e0703a8
Binary files /dev/null and b/docs/.gitbook/assets/Create Template.png differ
diff --git a/docs/.gitbook/assets/Golden Dataset.png b/docs/.gitbook/assets/Golden Dataset.png
new file mode 100644
index 0000000000..c40b7f268e
Binary files /dev/null and b/docs/.gitbook/assets/Golden Dataset.png differ
diff --git a/docs/evaluation/concepts-evals/building-your-own-evals.md b/docs/evaluation/concepts-evals/building-your-own-evals.md
index 00da4a5ce7..dbe81387e8 100644
--- a/docs/evaluation/concepts-evals/building-your-own-evals.md
+++ b/docs/evaluation/concepts-evals/building-your-own-evals.md
@@ -20,7 +20,7 @@ Then, you need the **golden dataset**. This should be representative of the type
Building such a dataset is laborious, but you can often find a standardized one for the most common use cases (as we did in the code above)
-
+
The Eval inferences are designed or easy benchmarking and pre-set downloadable test inferences. The inferences are pre-tested, many are hand crafted and designed for testing specific Eval tasks.
@@ -37,7 +37,7 @@ df.head()
Then you need to decide **which LLM** you want to use for evaluation. This could be a different LLM from the one you are using for your application. For example, you may be using Llama for your application and GPT-4 for your eval. Often this choice is influenced by questions of cost and accuracy.
-
+
### 4. Build the Eval Template
@@ -51,7 +51,7 @@ Be explicit about the following:
* **What are we asking?** In our example, we’re asking the LLM to tell us if the document was relevant to the query
* **What are the possible output formats?** In our example, it is binary relevant/irrelevant, but it can also be multi-class (e.g., fully relevant, partially relevant, not relevant).
-
+
In order to create a new template all that is needed is the setting of the input string to the Eval function.
@@ -94,4 +94,4 @@ MY_CUSTOM_TEMPLATE = PromptTemplate("This is a test {prompt}")
You now need to run the eval across your golden dataset. Then you can **generate metrics** (overall accuracy, precision, recall, F1, etc.) to determine the benchmark. It is important to look at more than just overall accuracy. We’ll discuss that below in more detail.
-
+