Skip to content

Commit

Permalink
Delete Old Evals Examples (#8252)
Browse files Browse the repository at this point in the history
Still retain:
- Comparison Examples
- Data + QA walkthrough
- QA (but really minimize it)
  • Loading branch information
hinthornw authored Jul 27, 2023
1 parent db9d5b2 commit 9eb7e6e
Show file tree
Hide file tree
Showing 30 changed files with 825 additions and 4,069 deletions.
2 changes: 1 addition & 1 deletion docs/api_reference/guide_imports.json

Large diffs are not rendered by default.

18 changes: 17 additions & 1 deletion docs/docs_skeleton/docs/guides/evaluation/comparison/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,22 @@ sidebar_position: 3
---
# Comparison Evaluators

Comparison evaluators in LangChain help measure two different chain or LLM outputs. These evaluators are helpful for comparative analyses, such as A/B testing between two language models, or comparing different versions of the same model. They can also be useful for things like generating preference scores for ai-assisted reinforcement learning.

These evaluators inherit from the `PairwiseStringEvaluator` class, providing a comparison interface for two strings - typically, the outputs from two different prompts or models, or two versions of the same model. In essence, a comparison evaluator performs an evaluation on a pair of strings and returns a dictionary containing the evaluation score and other relevant details.

To create a custom comparison evaluator, inherit from the `PairwiseStringEvaluator` class and overwrite the `_evaluate_string_pairs` method. If you require asynchronous evaluation, also overwrite the `_aevaluate_string_pairs` method.

Here's a summary of the key methods and properties of a comparison evaluator:

- `evaluate_string_pairs`: Evaluate the output string pairs. This function should be overwritten when creating custom evaluators.
- `aevaluate_string_pairs`: Asynchronously evaluate the output string pairs. This function should be overwritten for asynchronous evaluation.
- `requires_input`: This property indicates whether this evaluator requires an input string.
- `requires_reference`: This property specifies whether this evaluator requires a reference label.

Detailed information about creating custom evaluators and the available built-in comparison evaluators are provided in the following sections.

import DocCardList from "@theme/DocCardList";

<DocCardList />
<DocCardList />

21 changes: 12 additions & 9 deletions docs/docs_skeleton/docs/guides/evaluation/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,26 @@ import DocCardList from "@theme/DocCardList";

# Evaluation

Language models can be unpredictable. This makes it challenging to ship reliable applications to production, where repeatable, useful outcomes across diverse inputs are a minimum requirement. Tests help demonstrate each component in an LLM application can produce the required or expected functionality. These tests also safeguard against regressions while you improve interconnected pieces of an integrated system. However, measuring the quality of generated text can be challenging. It can be hard to agree on the right set of metrics for your application, and it can be difficult to translate those into better performance. Furthermore, it's common to lack sufficient evaluation data to adequately test the range of inputs and expected outputs for each component when you're just getting started. The LangChain community is building open source tools and guides to help address these challenges.
Building applications with language models involves many moving parts. One of the most critical components is ensuring that the outcomes produced by your models are reliable and useful across a broad array of inputs, and that they work well with your application's other software components. Ensuring reliability usually boils down to some combination of application design, testing & evaluation, and runtime checks.

LangChain exposes different types of evaluators for common types of evaluation. Each type has off-the-shelf implementations you can use to get started, as well as an
extensible API so you can create your own or contribute improvements for everyone to use. The following sections have example notebooks for you to get started.
The guides in this section review the APIs and functionality LangChain provides to help yous better evaluate your applications. Evaluation and testing are both critical when thinking about deploying LLM applications, since production environments require repeatable and useful outcomes.

- [String Evaluators](/docs/guides/evaluation/string/): Evaluate the predicted string for a given input, usually against a reference string
- [Trajectory Evaluators](/docs/guides/evaluation/trajectory/): Evaluate the whole trajectory of agent actions
- [Comparison Evaluators](/docs/guides/evaluation/comparison/): Compare predictions from two runs on a common input
LangChain offers various types of evaluators to help you measure performance and integrity on diverse data, and we hope to encourage the the community to create and share other useful evaluators so everyone can improve. These docs will introduce the evaluator types, how to use them, and provide some examples of their use in real-world scenarios.

Each evaluator type in LangChain comes with ready-to-use implementations and an extensible API that allows for customization according to your unique requirements. Here are some of the types of evaluators we offer:

This section also provides some additional examples of how you could use these evaluators for different scenarios or apply to different chain implementations in the LangChain library. Some examples include:
- [String Evaluators](/docs/guides/evaluation/string/): These evaluators assess the predicted string for a given input, usually comparing it against a reference string.
- [Trajectory Evaluators](/docs/guides/evaluation/trajectory/): These are used to evaluate the entire trajectory of agent actions.
- [Comparison Evaluators](/docs/guides/evaluation/comparison/): These evaluators are designed to compare predictions from two runs on a common input.

- [Preference Scoring Chain Outputs](/docs/guides/evaluation/examples/comparisons): An example using a comparison evaluator on different models or prompts to select statistically significant differences in aggregate preference scores
These evaluators can be used across various scenarios and can be applied to different chain and LLM implementations in the LangChain library.

We also are working to share guides and cookbooks that demonstrate how to use these evaluators in real-world scenarios, such as:

- [Chain Comparisons](/docs/guides/evaluation/examples/comparisons): This example uses a comparison evaluator to predict the preferred output. It reviews ways to measure confidence intervals to select statistically significant differences in aggregate preference scores across different models or prompts.

## Reference Docs

For detailed information of the available evaluators, including how to instantiate, configure, and customize them. Check out the [reference documentation](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.evaluation) directly.
For detailed information on the available evaluators, including how to instantiate, configure, and customize them, check out the [reference documentation](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.evaluation) directly.

<DocCardList />
21 changes: 20 additions & 1 deletion docs/docs_skeleton/docs/guides/evaluation/string/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,25 @@ sidebar_position: 2
---
# String Evaluators

A string evaluator is a component within LangChain designed to assess the performance of a language model by comparing its generated outputs (predictions) to a reference string or an input. This comparison is a crucial step in the evaluation of language models, providing a measure of the accuracy or quality of the generated text.

In practice, string evaluators are typically used to evaluate a predicted string against a given input, such as a question or a prompt. Often, a reference label or context string is provided to define what a correct or ideal response would look like. These evaluators can be customized to tailor the evaluation process to fit your application's specific requirements.

To create a custom string evaluator, inherit from the `StringEvaluator` class and implement the `_evaluate_strings` method. If you require asynchronous support, also implement the `_aevaluate_strings` method.

Here's a summary of the key attributes and methods associated with a string evaluator:

- `evaluation_name`: Specifies the name of the evaluation.
- `requires_input`: Boolean attribute that indicates whether the evaluator requires an input string. If True, the evaluator will raise an error when the input isn't provided. If False, a warning will be logged if an input _is_ provided, indicating that it will not be considered in the evaluation.
- `requires_reference`: Boolean attribute specifying whether the evaluator requires a reference label. If True, the evaluator will raise an error when the reference isn't provided. If False, a warning will be logged if a reference _is_ provided, indicating that it will not be considered in the evaluation.

String evaluators also implement the following methods:

- `aevaluate_strings`: Asynchronously evaluates the output of the Chain or Language Model, with support for optional input and label.
- `evaluate_strings`: Synchronously evaluates the output of the Chain or Language Model, with support for optional input and label.

The following sections provide detailed information on available string evaluator implementations as well as how to create a custom string evaluator.

import DocCardList from "@theme/DocCardList";

<DocCardList />
<DocCardList />
22 changes: 21 additions & 1 deletion docs/docs_skeleton/docs/guides/evaluation/trajectory/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,26 @@ sidebar_position: 4
---
# Trajectory Evaluators

Trajectory Evaluators in LangChain provide a more holistic approach to evaluating an agent. These evaluators assess the full sequence of actions taken by an agent and their corresponding responses, which we refer to as the "trajectory". This allows you to better measure an agent's effectiveness and capabilities.

A Trajectory Evaluator implements the `AgentTrajectoryEvaluator` interface, which requires two main methods:

- `evaluate_agent_trajectory`: This method synchronously evaluates an agent's trajectory.
- `aevaluate_agent_trajectory`: This asynchronous counterpart allows evaluations to be run in parallel for efficiency.

Both methods accept three main parameters:

- `input`: The initial input given to the agent.
- `prediction`: The final predicted response from the agent.
- `agent_trajectory`: The intermediate steps taken by the agent, given as a list of tuples.

These methods return a dictionary. It is recommended that custom implementations return a `score` (a float indicating the effectiveness of the agent) and `reasoning` (a string explaining the reasoning behind the score).

You can capture an agent's trajectory by initializing the agent with the `return_intermediate_steps=True` parameter. This lets you collect all intermediate steps without relying on special callbacks.

For a deeper dive into the implementation and use of Trajectory Evaluators, refer to the sections below.

import DocCardList from "@theme/DocCardList";

<DocCardList />
<DocCardList />

Loading

0 comments on commit 9eb7e6e

Please sign in to comment.