Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# Evaluation (Preview)

Note: **Evaluation in Firebase Genkit is currently in early preview** with a limited set of available evaluation metrics. You can try out the current experience by following the documentation below. If you run into any issues or have suggestions for improvements, please [file an issue](http://github.com/google/genkit/issues). We would love to see your feedback as we refine the evaluation experience!
# Evaluation

Evaluations are a form of testing which helps you validate your LLM’s responses and ensure they meet your quality bar.

Expand All @@ -9,7 +7,7 @@ of your LLM-powered applications. Genkit tooling helps you automatically extract

For example, if you have a RAG flow, Genkit will extract the set
of documents that was returned by the retriever so that you can evaluate the
quality of your retriever while it runs in the context of the flow as shown below with the RAGAS faithfulness and answer relevancy metrics:
quality of your retriever while it runs in the context of the flow as shown below with the Genkit faithfulness and answer relevancy metrics:

```js
import { GenkitMetric, genkitEval } from '@genkit-ai/evaluator';
Expand All @@ -25,8 +23,6 @@ export default configureGenkit({
});
```

We only support a small number of evaluators to help developers get started that are inspired by [RAGAS](https://docs.ragas.io/en/latest/index.html) metrics including: Faithfulness, Answer Relevancy, and Maliciousness.

Start by defining a set of inputs that you want to use as an input dataset called `testQuestions.json`. This input dataset represents the test cases you will use to generate output for evaluation.

```json
Expand Down Expand Up @@ -57,6 +53,23 @@ genkit eval:flow bobQA --input testQuestions.json --output eval-result.json
Note: Below you can see an example of how an LLM can help you generate the test
cases.

## Supported evaluators

### Genkit evaluators

Genkit includes a small number of native evaluators, inspired by RAGAS, to help you get started:

- Faithfulness
- Answer Relevancy
- Maliciousness

### Evaluator plugins

Genkit supports additional evaluators through plugins:

- VertexAI Rapid Evaluators via the [VertexAI Plugin](plugins/vertex-ai#evaluation).
- [LangChain Criteria Evaluation](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/) via the [LangChain plugin](plugins/langchain.md).

## Advanced use

`eval:flow` is a convenient way quickly evaluate the flow, but sometimes you
Expand Down