Release v0.3.0 · empirical-run/empirical

Empirical is a CLI and web UI for developers to test different LLMs, prompts and other model configurations — across all the scenarios that matter. Try it out with the quick start →

This is our first open source release and lays out the core primitives and capabilities of the product

Capabilities

Configuration

Empirical has a declarative configuration file for your tests in empiricalrc.json (see an example)
- Each config has 3 parts: model providers (what to test), datasets (scenarios to test), scorers (measure output quality)

Model providers

Empirical can run tests against off-the-shelf LLMs and custom models or apps (e.g. RAG)
- Off-the-shelf LLMs: Models hosted by OpenAI, Anthropic, Mistral, Google and Fireworks are supported today
- Custom models or apps: You can write a Python script that behaves as an entry point to run tests against custom models or RAG applications

Datasets

Specify scenarios to test as samples in the configuration, or import them from file

Scorers

Measure the quality of your output with built-in scoring functions, or write your own scorers as LLM prompts or Python functions

Web UI

Review results and compare model configurations side-by-side in our web UI
- This brings the Empirical playground and comparison pages to your local environment

Continuous evaluation in CI

Run your tests in GitHub Actions and get results reported as a PR comment

Get in touch

File an issue or join our Discord — we look forward to hearing from you ^_^

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0