Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: tool calling and improvement to models section #217

Merged
merged 3 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 21 additions & 14 deletions docs/examples.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,31 @@ description: 'Examples for common scenarios'
---

<CardGroup cols={2}>
<Card
title="Basic"
href="https://github.com/empirical-run/empirical/tree/main/examples/basic"
>
Uses an entity extraction use-case to check for valid JSON outputs.
</Card>
<Card
title="Tool calling"
href="https://github.com/empirical-run/empirical/tree/main/examples/tool_calls"
>
Uses an LLM to grade the output responses and ensure that they do not
contain "as a AI language model" in them.
</Card>
<Card
title="RAG"
href="https://github.com/empirical-run/empirical/tree/main/examples/rag"
>
Tests a Retrieval-augmented Generation application built with LlamaIndex, scored on
metrics from RAGAS.
metrics from Ragas.
</Card>
<Card
title="OpenAI Assistants"
href="https://github.com/empirical-run/empirical/tree/main/examples/assistants"
>
Runs Empirical on an OpenAI Assistant.
</Card>
<Card
title="HumanEval"
Expand All @@ -25,22 +44,10 @@ description: 'Examples for common scenarios'
Runs the Spider dataset to demo text-to-SQL and relevant scorer functions.
</Card>
<Card
title="Chat bot"
title="Chat bot with LLM scorer"
href="https://github.com/empirical-run/empirical/tree/main/examples/chatbot"
>
Uses an LLM to grade the output responses and ensure that they do not
contain "as a AI language model" in them.
</Card>
<Card
title="OpenAI Assistants"
href="https://github.com/empirical-run/empirical/tree/main/examples/assistants"
>
Runs Empirical on an OpenAI Assistant.
</Card>
<Card
title="Basic"
href="https://github.com/empirical-run/empirical/tree/main/examples/basic"
>
Uses an entity extraction use-case to check for valid JSON outputs.
</Card>
</CardGroup>
17 changes: 10 additions & 7 deletions docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,24 @@
"examples",
"configuration",
"running-in-ci",
"reporter",
"telemetry"
]
},
{
"group": "Model providers",
"pages": [
"models/basics",
{
"group": "Hosted models",
"pages": [
"models/model",
"models/providers"
]
},
"models/assistants",
"models/custom",
"models/assistants"
"models/output"
]
},
{
Expand All @@ -68,12 +77,6 @@
"scoring/llm",
"scoring/python"
]
},
{
"group": "Reporter",
"pages": [
"reporter/basics"
]
}
],
"footerSocials": {
Expand Down
2 changes: 1 addition & 1 deletion docs/models/assistants.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ below, we refer to the `user_query` input from the test [dataset](../dataset/bas
JSON object of parameters to customize the Assistant (see more below)
</ParamField>
<ParamField body="name" type="string">
A a custom name or label for this run
A custom name or label for this run
</ParamField>

## Example
Expand Down
227 changes: 14 additions & 213 deletions docs/models/basics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,231 +7,32 @@ Empirical can test how different models and model configurations work for your
application. You can define which models and configurations to test in the
[configuration file](../configuration).

Empirical supports three types of model providers:
Empirical supports a few types of model providers:

- `model`: API calls to LLMs that are hosted by inference platforms, like OpenAI's GPT4
- `py-script`: Custom models or applications defined as a Python module. See [the Python guide](./custom) to configure this.
- `assistant`: API calls to OpenAI Assistants. See [the Assistants guide](./assistants) to configure this.

The rest of this guide focuses on the `model` type.

## Run configuration for LLMs

To test an LLM, specify the following properties in the configuration:

- `provider`: Name of the inference provider (e.g. `openai`, or other [supported providers](#supported-providers))
- `model`: Name of the model (e.g. `gpt-3.5-turbo` or `claude-3-haiku`)
- `prompt`: [Prompt](#prompt) sent to the model, with optional [placeholders](#placeholders)
- `name` [optional]: A name or label for this run (auto-generated if not specified)

You can configure as many model providers as you like. These models will be shown in a
side-by-side comparison view in the web reporter.

```json empiricalrc.json
"runs": [
{
"type": "model",
"provider": "openai",
"model": "gpt-3.5-turbo",
"prompt": "Hey I'm {{user_name}}"
}
]
```

### Prompt
The prompt serves as the initial input provided to the model to generate a response.
This property accepts either a string or a JSON chat format.

The JSON chat format allows for a sequence of messages comprising the conversation so far.
Each message object has two required fields:
- `role`: Role of the messenger (either `system`, `user` or `assistant`)
- `content`: The content of the message

```json empiricalrc.json
{
"runs": [
{
"prompt": [{
"role": "system",
"content": "You are an SQLite expert who can convert natural language questions to SQL queries...."
}, {
"role": "user",
"content": "How many singers do we have?"
}]
}
]
}
```
The [Text-to-SQL example](https://github.com/empirical-run/empirical/tree/main/examples/spider)
uses this prompt format to test conversion of natural language questions to SQL queries.

String based prompts are wrapped in `user` role message before sending to the model.
```json empiricalrc.json
{
"runs": [
{
"prompt": "Extract the name, age and location from the message, and respond with a JSON object ..."
}
]
}
```
The [basic example](https://github.com/empirical-run/empirical/tree/main/examples/basic) uses this prompt
format to test extraction of named entities from natural language text.


### Placeholders

Define placeholders in the prompt with Handlebars syntax (like `{{user_name}}`) to inject values
from the dataset sample. These placeholders will be replaced with the corresponding input value
during execution.

See [dataset](../dataset/basics) to learn more about sample inputs.

## Supported providers

| Provider | Description |
|----------|-------------|
| `openai` | All chat models are supported. Requires `OPENAI_API_KEY` environment variable. |
| `azure-openai` | All chat models from OpenAI that are hosted on Azure are supported. Requires `AZURE_OPENAI_API_KEY` and either of `AZURE_OPENAI_RESOURCE_NAME` or `AZURE_OPENAI_BASE_URL` environment variables. |
| `anthropic` | Claude 3 models are supported. Requires `ANTHROPIC_API_KEY` environment variable. |
| `mistral` | All chat models are supported. Requires `MISTRAL_API_KEY` environment variable. |
| `google` | Gemini Pro models are supported. Requires `GOOGLE_API_KEY` environment variable. |
| `fireworks` | Models hosted on Fireworks (e.g. `dbrx-instruct`) are supported. Requires `FIREWORKS_API_KEY` environment variable. |

<AccordionGroup>
<Accordion title="Using models from Azure OpenAI">

#### Get API key

- `AZURE_OPENAI_API_KEY`: This is the API key to authenticate with Azure. See [their docs](https://learn.microsoft.com/en-us/javascript/api/overview/azure/openai-readme?view=azure-node-preview#using-an-api-key-from-azure) to get the API key.

#### Specify base url
You can specify the base URL of the Azure OpenAI endpoint by setting **either** one of the following environment variables:
- `AZURE_OPENAI_RESOURCE_NAME`: This the resource name which is used to create the endpoint base URL with the format `https://$AZURE_OPENAI_RESOURCE_NAME.openai.azure.com`
- `AZURE_OPENAI_BASE_URL`: This is if you want to specify the entire base URL used to access the chat completions API with the format `$AZURE_OPENAI_BASE_URL/openai/deployments/<model>/chat/completions`. For example - `https://some-custom-url.com`

#### Model configuration

In the configuration file,
- Set the `provider` to `azure-openai`
- Set `model` to the name of your model deployment

#### Additional parameters

- By default, the `api-version` parameter is set to "2024-02-15-preview". If you need to override this, set the `apiVersion` parameter
## Hosted models

Popular models hosted by inference platforms (e.g. GPT-4o by OpenAI) can be
directly specified in the Empirical run configuration with type as `model`.

```json
"runs": [
{
"type": "model",
"provider": "azure-openai",
"model": "gpt-35-deployment",
"prompt": "Hey I'm {{user_name}}",
"parameters": {
"apiVersion": "2024-02-15-preview"
}
}
]
```

</Accordion>
<Accordion title="Using models from Google">

#### Get API key

The [Google AI studio](https://aistudio.google.com/) is the easiest way to get API keys. Once you have the key,
set it as the `GOOGLE_API_KEY` environment variable.

#### Supported models

We support the Gemini model codes, as defined in the [official docs](https://ai.google.dev/models/gemini).

- Gemini 1.5 Pro: set `model` to `gemini-1.5-pro-latest`
- Gemini 1 Pro: set `model` to `gemini-pro` or `gemini-1.0-pro`

</Accordion>
</AccordionGroup>

### Environment variables

API calls to model providers require API keys, which are stored as environment variables. The CLI can work with:

- Existing environment variables (using `process.env`)
- Environment variables defined in `.env` or `.env.local` files, in the current working directory
- For .env files that are located elsewhere, you can pass the `--env-file` flag

```sh
npx @empiricalrun/cli --env-file <PATH_TO_ENV_FILE>
```

### Model parameters

To override parameters like `temperature` or `max_tokens`, you can pass `parameters` alongwith the provider
configuration. All OpenAI parameters (see their [API reference](https://platform.openai.com/docs/api-reference/chat/create))
are supported, except for a few [limitations](#limitations).

For non-OpenAI models, we coerce these parameters to the most appropriate target parameter (e.g. `stop` in OpenAI
becomes `stop_sequences` for Anthropic.)

You can add other parameters or override this behavior with [passthrough](#passthrough).

```json empiricalrc.json
"runs": [
{
"type": "model",
"provider": "openai",
"model": "gpt-3.5-turbo",
"prompt": "Hey I'm {{user_name}}",
"parameters": {
"temperature": 0.1
}
}
]
```

#### Passthrough

If your models rely on other parameters, you can still specify them in the configuration. These
parameters will be passed as-is to the model.

For example, Mistral models support a `safePrompt` parameter for [guardrailing](https://docs.mistral.ai/platform/guardrailing/).

```json empiricalrc.json
"runs": [
{
"type": "model",
"provider": "mistral",
"model": "mistral-tiny",
"prompt": "Hey I'm {{user_name}}",
"parameters": {
"temperature": 0.1,
"safePrompt": true
}
"model": "gpt-4o",
"prompt": "Hey I'm {{user_name}}"
}
]
```

#### Configuring request timeout

You can set the timeout duration in milliseconds under model parameters in the `empiricalrc.json` file. This might be required for prompt completions that are expected to take more time, for example while running models like Claude Opus. If no specific value is assigned, the default timeout duration of 30 seconds will be applied.

```json empiricalrc.json
"runs": [
{
"type": "model",
"provider": "anthropic",
"model": "claude-3-opus",
"prompt": "Hey I'm {{user_name}}",
"parameters": {
"timeout": 10000
}
}
]
```
- See how to [configure these models](./model)
- For OpenAI Assistants, see [the Assistant guide](./assistants)

#### Limitations
## Custom scripts

- These parameters are not supported today: `logit_bias`, `user`, `stream`
For mature applications, or for those that require pre or post-processing around
the model API call, it is recommended to write a custom script provider. That way,
you can reference/import parts of your application and sharing code between your
app and tests.

If this limitation is blocking your use of Empirical, please file a [feature request](https://github.com/empirical-run/empirical/issues/new).
- See [the Python guide](./custom) to configure models or apps defined as a Python module, with type `py-script`
Loading