Experiment Catalog

A comprehensive tool for cataloging, comparing, and analyzing experiment results. The Experiment Catalog enables teams to track evaluation runs across projects, compare metrics against baselines, and identify performance regressions or improvements in AI/ML experimentation workflows.

Overview

The Experiment Catalog is designed for teams running iterative experiments—particularly useful for AI evaluation pipelines where you need to:

Track results across multiple evaluation runs
Compare experiment metrics against established baselines
Analyze performance trends and identify regressions
Filter and drill down into specific ground-truth results
Annotate experiments with links to commits, configurations, or documentation

There are some videos you can watch:

Installation.............6:08
Usage..................30:56
Configuration.......16:36

Architecture

The application consists of several main components:

Component	Description
catalog	C# .NET 8 backend that stores experiment data in Azure Blob Storage
ui	Svelte-based frontend for visualizing and comparing experiments
evaluator	An evaluation runner that can execute inference and evaluation then send results to the catalog
evaluation	An example evaluation script

Key Concepts

Project: A collection of experiments sharing the same baseline, grounding data, and evaluation configuration. Typically this aligns to a sprint. This is described in more detail in the experimentation process.
Experiment: A hypothesis-driven collection of evaluation runs within a project.
Set: A group of results from a single evaluation run - also commonly called a permutation (e.g., 3 iterations × 12 ground truths).
Ref: A reference to a specific ground-truth entity being evaluated, allowing aggregation across iterations.
Baseline: A reference point for comparison. This can be set at both project and experiment levels.

Features

Experiment Management

Create projects and experiments with hypotheses
Set project-level and experiment-level baselines
Record arbitrary metrics without pre-definition
Annotate sets with commit hashes, configuration links, or notes

Comparison & Analysis

Compare experiment results against baselines
View aggregate statistics across sets
Drill down into individual ground-truth results
Compare metrics across multiple evaluation runs

Filtering Capabilities

Metrics Filter: Show/hide specific metrics in comparison views
Tags Filter: Filter ground truths by tags extracted from source data
Free Filter: Write custom filter expressions to find specific results

Free Filter Examples

# Find poor performers
[generation_correctness] < 0.8

# Find regressions compared to baseline
[generation_correctness] < [baseline.generation_correctness]

# Find significant improvements (>20% better)
[generation_correctness] > [baseline.generation_correctness] * 1.2

# Complex analysis - retrieval got worse but generation improved
[retrieval_recall] < [baseline.retrieval_recall] AND [generation_correctness] > [baseline.generation_correctness]

# Find specific ground truths
ref == "TQ10" OR ref == "TQ25"

You can find out more about the Free Filter syntax and use cases in the UI README.

Getting Started

Prerequisites

.NET 10 SDK
Node.js 20+
Python 3.9+ (for tags utility)
Docker (for containerized deployment)
Azure Storage Account

Running Locally

Backend API

Navigate to the API directory:
```
cd api
```

Create a .env file with required configuration:

# if using az-cli for login
INCLUDE_CREDENTIAL_TYPES=azcli
AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account>

# or if using a connection string
AZURE_STORAGE_ACCOUNT_CONNSTRING=<your-connection-string>

Full configuration for the API can be found in the API README.

Run the API:
```
dotnet run
```

The API will be available at http://localhost:6010 with Swagger documentation at /swagger.

Frontend UI

Navigate to the UI directory:
```
cd ui
```
Install dependencies:
```
npm install
```
Start the development server:
```
npm run dev
```

The UI will be available at http://localhost:6020.

Docker Deployment

Build the complete application (UI + API) as a Docker container:

docker build --rm -t exp-catalog:latest -f catalog.Dockerfile .

Run the container:

docker run -p 6010:6010 \
  -e AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account> \
  exp-catalog:latest

API Usage

All examples for using the API can be found in catalog.http.

Evaluator Usage

The evaluator is a .NET console application that can run inference and evaluation, then send results to the Experiment Catalog. You can find the evaluator in the evaluator directory with full instructions in the evaluator README.

Evaluation Example

You can find an example evaluation script in the evaluation directory.

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
.github		.github
catalog.tests		catalog.tests
catalog		catalog
evaluation		evaluation
evaluator		evaluator
ui		ui
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
catalog.Dockerfile		catalog.Dockerfile
exp-catalog.code-workspace		exp-catalog.code-workspace
experimentation-process.md		experimentation-process.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experiment Catalog

Overview

Architecture

Key Concepts

Features

Experiment Management

Comparison & Analysis

Filtering Capabilities

Free Filter Examples

Getting Started

Prerequisites

Running Locally

Backend API

Frontend UI

Docker Deployment

API Usage

Evaluator Usage

Evaluation Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

plasne/experiment-catalog

Folders and files

Latest commit

History

Repository files navigation

Experiment Catalog

Overview

Architecture

Key Concepts

Features

Experiment Management

Comparison & Analysis

Filtering Capabilities

Free Filter Examples

Getting Started

Prerequisites

Running Locally

Backend API

Frontend UI

Docker Deployment

API Usage

Evaluator Usage

Evaluation Example

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages