Skip to content

plasne/experiment-catalog

Repository files navigation

Experiment Catalog

A comprehensive tool for cataloging, comparing, and analyzing experiment results. The Experiment Catalog enables teams to track evaluation runs across projects, compare metrics against baselines, and identify performance regressions or improvements in AI/ML experimentation workflows.

OpenSSF Scorecard

Overview

The Experiment Catalog is designed for teams running iterative experiments—particularly useful for AI evaluation pipelines where you need to:

  • Track results across multiple evaluation runs
  • Compare experiment metrics against established baselines
  • Analyze performance trends and identify regressions
  • Filter and drill down into specific ground-truth results
  • Annotate experiments with links to commits, configurations, or documentation

There are some videos you can watch:

Architecture

The application consists of several main components:

Component Description
catalog C# .NET 8 backend that stores experiment data in Azure Blob Storage
ui Svelte-based frontend for visualizing and comparing experiments
evaluator An evaluation runner that can execute inference and evaluation then send results to the catalog
evaluation An example evaluation script

Key Concepts

  • Project: A collection of experiments sharing the same baseline, grounding data, and evaluation configuration. Typically this aligns to a sprint. This is described in more detail in the experimentation process.
  • Experiment: A hypothesis-driven collection of evaluation runs within a project.
  • Set: A group of results from a single evaluation run - also commonly called a permutation (e.g., 3 iterations × 12 ground truths).
  • Ref: A reference to a specific ground-truth entity being evaluated, allowing aggregation across iterations.
  • Baseline: A reference point for comparison. This can be set at both project and experiment levels.

Features

Experiment Management

  • Create projects and experiments with hypotheses
  • Set project-level and experiment-level baselines
  • Record arbitrary metrics without pre-definition
  • Annotate sets with commit hashes, configuration links, or notes

Comparison & Analysis

  • Compare experiment results against baselines
  • View aggregate statistics across sets
  • Drill down into individual ground-truth results
  • Compare metrics across multiple evaluation runs

Filtering Capabilities

  • Metrics Filter: Show/hide specific metrics in comparison views
  • Tags Filter: Filter ground truths by tags extracted from source data
  • Free Filter: Write custom filter expressions to find specific results

Free Filter Examples

# Find poor performers
[generation_correctness] < 0.8

# Find regressions compared to baseline
[generation_correctness] < [baseline.generation_correctness]

# Find significant improvements (>20% better)
[generation_correctness] > [baseline.generation_correctness] * 1.2

# Complex analysis - retrieval got worse but generation improved
[retrieval_recall] < [baseline.retrieval_recall] AND [generation_correctness] > [baseline.generation_correctness]

# Find specific ground truths
ref == "TQ10" OR ref == "TQ25"

You can find out more about the Free Filter syntax and use cases in the UI README.

Getting Started

Prerequisites

Running Locally

Backend API

  1. Navigate to the API directory:

    cd api
  2. Create a .env file with required configuration:

    # if using az-cli for login
    INCLUDE_CREDENTIAL_TYPES=azcli
    AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account>
    
    # or if using a connection string
    AZURE_STORAGE_ACCOUNT_CONNSTRING=<your-connection-string>

    Full configuration for the API can be found in the API README.

  3. Run the API:

    dotnet run

The API will be available at http://localhost:6010 with Swagger documentation at /swagger.

Frontend UI

  1. Navigate to the UI directory:

    cd ui
  2. Install dependencies:

    npm install
  3. Start the development server:

    npm run dev

The UI will be available at http://localhost:6020.

Docker Deployment

Build the complete application (UI + API) as a Docker container:

docker build --rm -t exp-catalog:latest -f catalog.Dockerfile .

Run the container:

docker run -p 6010:6010 \
  -e AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account> \
  exp-catalog:latest

API Usage

All examples for using the API can be found in catalog.http.

Evaluator Usage

The evaluator is a .NET console application that can run inference and evaluation, then send results to the Experiment Catalog. You can find the evaluator in the evaluator directory with full instructions in the evaluator README.

Evaluation Example

You can find an example evaluation script in the evaluation directory.

About

A catalog for AI experiments.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •