A comprehensive tool for cataloging, comparing, and analyzing experiment results. The Experiment Catalog enables teams to track evaluation runs across projects, compare metrics against baselines, and identify performance regressions or improvements in AI/ML experimentation workflows.
The Experiment Catalog is designed for teams running iterative experiments—particularly useful for AI evaluation pipelines where you need to:
- Track results across multiple evaluation runs
- Compare experiment metrics against established baselines
- Analyze performance trends and identify regressions
- Filter and drill down into specific ground-truth results
- Annotate experiments with links to commits, configurations, or documentation
There are some videos you can watch:
- Installation.............6:08
- Usage..................30:56
- Configuration.......16:36
The application consists of several main components:
| Component | Description |
|---|---|
| catalog | C# .NET 8 backend that stores experiment data in Azure Blob Storage |
| ui | Svelte-based frontend for visualizing and comparing experiments |
| evaluator | An evaluation runner that can execute inference and evaluation then send results to the catalog |
| evaluation | An example evaluation script |
- Project: A collection of experiments sharing the same baseline, grounding data, and evaluation configuration. Typically this aligns to a sprint. This is described in more detail in the experimentation process.
- Experiment: A hypothesis-driven collection of evaluation runs within a project.
- Set: A group of results from a single evaluation run - also commonly called a permutation (e.g., 3 iterations × 12 ground truths).
- Ref: A reference to a specific ground-truth entity being evaluated, allowing aggregation across iterations.
- Baseline: A reference point for comparison. This can be set at both project and experiment levels.
- Create projects and experiments with hypotheses
- Set project-level and experiment-level baselines
- Record arbitrary metrics without pre-definition
- Annotate sets with commit hashes, configuration links, or notes
- Compare experiment results against baselines
- View aggregate statistics across sets
- Drill down into individual ground-truth results
- Compare metrics across multiple evaluation runs
- Metrics Filter: Show/hide specific metrics in comparison views
- Tags Filter: Filter ground truths by tags extracted from source data
- Free Filter: Write custom filter expressions to find specific results
# Find poor performers
[generation_correctness] < 0.8
# Find regressions compared to baseline
[generation_correctness] < [baseline.generation_correctness]
# Find significant improvements (>20% better)
[generation_correctness] > [baseline.generation_correctness] * 1.2
# Complex analysis - retrieval got worse but generation improved
[retrieval_recall] < [baseline.retrieval_recall] AND [generation_correctness] > [baseline.generation_correctness]
# Find specific ground truths
ref == "TQ10" OR ref == "TQ25"
You can find out more about the Free Filter syntax and use cases in the UI README.
- .NET 10 SDK
- Node.js 20+
- Python 3.9+ (for tags utility)
- Docker (for containerized deployment)
- Azure Storage Account
-
Navigate to the API directory:
cd api -
Create a
.envfile with required configuration:# if using az-cli for login INCLUDE_CREDENTIAL_TYPES=azcli AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account> # or if using a connection string AZURE_STORAGE_ACCOUNT_CONNSTRING=<your-connection-string>
Full configuration for the API can be found in the API README.
-
Run the API:
dotnet run
The API will be available at http://localhost:6010 with Swagger documentation at /swagger.
-
Navigate to the UI directory:
cd ui -
Install dependencies:
npm install
-
Start the development server:
npm run dev
The UI will be available at http://localhost:6020.
Build the complete application (UI + API) as a Docker container:
docker build --rm -t exp-catalog:latest -f catalog.Dockerfile .Run the container:
docker run -p 6010:6010 \
-e AZURE_STORAGE_ACCOUNT_NAME=<your-storage-account> \
exp-catalog:latestAll examples for using the API can be found in catalog.http.
The evaluator is a .NET console application that can run inference and evaluation, then send results to the Experiment Catalog. You can find the evaluator in the evaluator directory with full instructions in the evaluator README.
You can find an example evaluation script in the evaluation directory.