Skip to content

KohakuBlueleaf/KohakuBoard

Repository files navigation

KohakuBoard

High-performance ML experiment tracking with zero training overhead.

Ask DeepWiki

Part of KohakuHub - Self-hosted AI Infrastructure


Quick Start

pip install -e .
from kohakuboard.client import Board

board = Board(name="my-experiment", config={"lr": 0.001, "batch_size": 32})

# Training loop
for epoch in range(10):
    for data, target in train_loader:
        loss = train_step(data, target)

        board.step()  # Once per optimizer step
        board.log(loss=loss.item())  # Non-blocking, <0.1ms
        # Alternative: move board.step() after board.log() for 0-indexed steps

# logs are stored under ./kohakuboard using KohakuVault column stores + SQLite metadata

1761752427584 1761752450957

Join our community: https://discord.gg/xWYrkyvJ2s


Why KohakuBoard?

KohakuBoard's Advantages

  • Zero Training Overhead - Non-blocking logging returns in <0.1ms
  • Local-First - No server required during training, view results instantly
  • High Throughput - 20,000+ metrics/second sustained
  • Rich Data Types - Scalars, images, videos, tables, histograms
  • WebGL Visualization - Handle 100K+ datapoints smoothly
  • Self-Hosted - Your data stays on your infrastructure

Features

Non-Blocking Architecture

Background Writer Process ensures training never waits:

Training Script          Background Process
     │                          │
     ├─ board.log(loss=0.5)     │
     │  └─> Queue.put()         │
     │      (<0.1ms return!)    │
     │                          ├─ Queue.get()
     ├─ Continue training...    ├─ Batch write
     │                          └─ Flush to disk

Performance:

  • Log call latency: <0.1ms
  • Throughput: 20,000+ metrics/sec
  • Queue capacity: 50,000 messages
  • Memory overhead: ~100-200 MB

Rich Data Types

Unified API for all data types - no step inflation:

board.log(
    loss=0.5,                           # Scalar
    sample_img=Media(image),            # Image
    predictions=Table(results),         # Table
    gradients=Histogram(grads)          # Histogram
)
# All logged at SAME step with 1 queue message!

Supported Types:

  • Scalars - Metrics, learning rates, accuracies
  • Media - Images (PNG/JPG), videos (MP4), audio (WAV)
  • Tables - Structured data with embedded images
  • Histograms - Weight/gradient distributions with compression (99.8% size reduction)

Three-Tier SQLite Storage Architecture

Powered by KohakuVault - A high-performance storage library with dual interfaces over SQLite:

Three Specialized SQLite Implementations:

1. KohakuVault KVault        2. KohakuVault ColumnVault     3. Standard SQLite
   (K-V Store)                   (Columnar Storage)             (Relational)
   ├─ Media blobs                ├─ Metrics                     ├─ Media metadata
   ├─ B+Tree index on K          ├─ Histograms                  ├─ Tables
   ├─ Content-addressable        ├─ Blob-based columnar         └─ Step info
   └─ .cache() for bulk ops      └─ Dynamic chunk growth

Why KohakuVault?

  • Zero dependencies - Single SQLite file, no external services
  • Simple deployment - Just .db files, no infrastructure
  • Dual-interface design - Dict-like for blobs, list-like for sequences
  • High performance - Native speed with Pythonic API
  • Memory efficient - Streaming support, dynamic chunk growth
  • True SWMR - Multiple readers, single writer via SQLite WAL

Why Three Tiers?

  • KVault: Optimized for blob storage with B+Tree index, content-addressable
  • ColumnVault: Optimized for append-heavy time-series with columnar layout
  • Standard SQLite: Optimized for structured metadata with ACID guarantees

Advanced Visualization

WebGL-Based Charts powered by Plotly.js:

  • Handle 100K+ datapoints smoothly
  • Configurable smoothing (EMA, MA, Gaussian)
  • X-axis selection (step, global_step, any metric)
  • Multi-metric overlays
  • Dark/light mode
  • Responsive design

Rich Viewers:

  • Histogram Navigator - Step-by-step distribution exploration
  • Media Viewer - Image grids, video playback
  • Table Viewer - Structured data with embedded images
  • Dashboard - Customizable metric layouts

Local-First Workflow

# Train locally
python train.py              # Logs to ./kohakuboard/

# View results (no server required!)
kobo open ./kohakuboard --browser

# Optional server for team sharing (requires kohakuboard_server)
kobo-serve --port 48889

No server setup, no configuration, no hassle.


Quick Start

Installation

pip install -e .

Basic Usage

from kohakuboard.client import Board

# Create board - automatically saves on program exit
board = Board(name="my-experiment", config={"lr": 0.001, "batch_size": 32})

# Training loop
for epoch in range(10):
    for batch_idx, (data, target) in enumerate(train_loader):
        loss = train_step(data, target)

        # Increment step once per optimizer step (not per epoch!)
        board.step()

        # Log metrics (non-blocking, returns in <0.1ms)
        board.log(loss=loss.item(), lr=optimizer.param_groups[0]['lr'])

    # Log validation at end of epoch
    val_loss = validate(model, val_loader)
    board.log(**{"val/loss": val_loss})

# That's it! No .finish() needed - auto-cleanup via atexit

View Results

# Local viewer (no server)
kobo open ./kohakuboard --browser

# Or launch the authenticated server (requires kohakuboard_server)
kobo-serve --port 48889
# Drop/copy board folders into the configured data dir to share runs

Complete Example

from kohakuboard.client import Board, Histogram, Table, Media
import torch

# Create board with hyperparameters
board = Board(
    name="cifar10-resnet18",
    config={"lr": 0.001, "batch_size": 128, "epochs": 100, "optimizer": "AdamW"}
)

# Training loop
for epoch in range(100):
    model.train()
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        # Step once per optimizer step
        board.step()

        # Log scalars (non-blocking, <0.1ms)
        board.log(loss=loss.item(), lr=optimizer.param_groups[0]['lr'])

    # Validation
    model.eval()
    val_loss, correct, predictions_table = 0, 0, []

    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(val_loader):
            output = model(data)
            val_loss += criterion(output, target).item()
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()

            # Sample predictions for table (first batch only)
            if batch_idx == 0:
                for i in range(min(8, len(data))):
                    predictions_table.append({
                        "image": Media(data[i].cpu().numpy()),
                        "true": class_names[target[i]],
                        "pred": class_names[pred[i]],
                        "correct": "✓" if pred[i] == target[i] else "✗"
                    })

    # Log validation (scalars + table + histograms - all at same step!)
    hist_data = {f"grad/{n}": Histogram(p.grad) for n, p in model.named_parameters() if p.grad is not None}
    board.log(**{
        "val/loss": val_loss / len(val_loader),
        "val/accuracy": correct / len(val_loader.dataset),
        "val/predictions": Table(predictions_table),
        **hist_data
    })

# No .finish() needed - automatic cleanup when script exits

Architecture

Client (Training Script)

Main Process (Training)          Background Writer Process
       │                                   │
       ├─ board.log(loss=0.5)              │
       │  └─> Queue.put()                  │
       │      (returns instantly!)         │
       │                                   ├─ Queue.get()
       │                                   ├─ Process batch
       ├─ Continue training...             ├─ Write to storage
       │                                   └─ Flush to disk

Key Features:

  • Non-blocking: log() returns in <0.1ms
  • Message Queue: 50,000 message capacity
  • Writer Process: Background process drains queue
  • Storage Layer: Three-tier SQLite architecture (KohakuVault KVault + ColumnVault + Standard SQLite)
  • Graceful Shutdown: atexit hooks + signal handlers ensure no data loss

Backend (Visualization Server)

FastAPI Backend (Port 48889)
    ↓ Read-only connections
Board Files (./kohakuboard/)
    ├── {board_id}/
    │   ├── metadata.json
    │   ├── data/           ← SQL/columnar queries here
    │   │   ├── metrics/    ← KohakuVault DB files
    │   │   └── metadata.db ← SQLite database
    │   └── media/
    │       └── *.png, *.mp4
        ↓ REST API
Vue 3 Frontend (WebGL Charts)

Key Features:

  • Zero-copy serving: Reads files directly (no database)
  • Concurrent reads: Multiple connections supported
  • Fast queries: Columnar storage for metrics
  • Static serving: Media files served directly

Data Model

Directory Structure

./kohakuboard/
└── {board_id_timestamp}/
    ├── metadata.json           # Board info, config, timestamps
    ├── data/                   # Storage backend files
    │   ├── metrics/            # (hybrid) KohakuVault columnar files
    │   │   ├── train__loss.db
    │   │   ├── val__accuracy.db
    │   │   └── ...
    │   ├── metadata.db         # (hybrid) SQLite metadata
    │   └── histograms/
    │       ├── gradients_i32.db  # int32 precision
    │       └── params_u8.db      # uint8 precision (compact)
    ├── media/                  # Content-addressed storage
    │   ├── {name}_{idx}_{step}_{sha256}.png
    │   ├── {name}_{idx}_{step}_{sha256}.mp4
    │   └── {name}_{idx}_{step}_{sha256}.wav
    └── logs/
        ├── output.log          # Captured stdout/stderr
        └── writer.log          # Writer process logs

Metadata Schema

{
  "board_id": "20250129_150423_abc123",
  "name": "cifar10-resnet18",
  "config": {
    "lr": 0.001,
    "batch_size": 128,
    "epochs": 100
  },
  "created_at": "2025-01-29T15:04:23",
  "finished_at": "2025-01-29T18:32:45",
  "status": "finished",
  "version": "0.0.1"
}

Manual Sync / Remote Sharing

Both the training-side package (kohakuboard) and the optional server (kohakuboard_server) read the exact same directory layout. To move a run between machines:

  1. Copy the entire board folder ({base_dir}/{project}/{board_id}) using cp, rsync, or any file transfer tool.
  2. Drop it into the destination data directory (the folder you pass to kobo open ... or the directory configured via KOHAKU_BOARD_DATA_DIR / --data-dir on kobo-serve).
  3. Restart the viewer or refresh the UI. The new run is immediately available.

No export/import step is required because metrics, metadata, tensors, and media already live in KohakuVault + SQLite files. The legacy kobo sync command still expects a DuckDB export and will fail on modern boards—use manual copy until the new sync API lands.


CLI Tool

# Open local viewer (no server)
kobo open ./kohakuboard --browser

# Start authenticated server (kohakuboard_server package)
kobo-serve --port 48889 --host 0.0.0.0

# Manual sync (recommended today): copy the entire board folder into the server's data dir
# (kobo sync is still wired to the legacy DuckDB exporter and will error on modern boards)

Configuration

Basic Usage

# All boards use KohakuVault + SQLite (no backend parameter needed)
board = Board(name="my-experiment", project="vision")

Advanced Options

board = Board(
    name="experiment",
    board_id="custom-id",           # Auto-generated if not provided
    config={"lr": 0.001},           # Hyperparameters
    project="custom-project",       # Sub-directory inside base_dir
    base_dir="./my-boards",         # Custom directory
    capture_output=True,            # Capture stdout/stderr
    remote_url="https://board.example.com",  # Optional future sync target
    remote_token="...",             # Token for remote sync (WIP)
    sync_enabled=False,             # Enable when remote endpoints are ready
    memory_mode=False,              # Keep data in RAM (requires sync to persist)
    annotation="debug-run",         # Suffix appended to run directory name
)

Storage Architecture:

  • KohakuVault KVault: Media blobs (K-V table with B+Tree index)
  • KohakuVault ColumnVault: Metrics/histograms (blob-based columnar)
  • Standard SQLite: Metadata (traditional relational tables)

Context Manager

with Board(name="experiment") as board:
    board.log(loss=0.5)
    # Automatic flush() and finish() on exit

API Reference

Board

Board(
    name: str | None = None,
    board_id: str | None = None,
    config: dict | None = None,
    project: str | None = None,
    base_dir: str | Path | None = None,
    capture_output: bool = True,
    remote_url: str | None = None,
    remote_token: str | None = None,
    remote_project: str | None = None,
    sync_enabled: bool = False,
    sync_interval: int = 10,
    memory_mode: bool = False,
    *,
    annotation: str | None = None,
)

Methods:

board.step() - Increment global_step

for batch_idx, batch in enumerate(train_loader):
    loss = train_step(batch)
    board.step()  # Increment ONCE per optimizer step
    board.log(**{"train/loss": loss, "train/lr": scheduler.get_last_lr()[0]})

board.log(**metrics) - Log data (non-blocking)

board.log(
    loss=0.5,
    accuracy=0.95,
    learning_rate=0.001,            # Scalars
    sample=Media(image_array),      # Images/video/audio
    predictions=Table(rows),        # Tables (optionally with Media)
    grad_norm=Histogram(values),    # Histograms
)

# Namespaces (creates tabs in UI)
board.log(**{
    "train/loss": 0.5,
    "val/accuracy": 0.95
})

# Tensor + KDE payloads (specialized viewers)
board.log(
    attention_tensor=TensorLog(tensor),
    density=KernelDensity(values, grid_size=256),
)

board.flush() - Force flush (blocks until complete)

board.flush()  # Wait for all pending writes

board.finish() - Manual cleanup (auto-called on exit)

board.finish()  # Flush buffers, close connections

Data Types

Media

from kobo.client.types import Media

# Images
board.log(
    sample_img=Media(image_array),  # numpy, PIL, torch tensor
    prediction=Media(pred_img, caption="Predicted: cat")
)

# Video
board.log(
    training_video=Media("output.mp4", media_type="video")
)

# Audio (if supported)
board.log(
    audio_sample=Media("sample.wav", media_type="audio")
)

Table

from kobo.client.types import Table

# From list of dicts
results = Table([
    {"name": "Alice", "score": 95, "pass": True},
    {"name": "Bob", "score": 87, "pass": True},
])
board.log(student_results=results)

# Tables with embedded images
predictions = Table([
    {"image": Media(img), "label": "cat", "confidence": 0.95},
    {"image": Media(img2), "label": "dog", "confidence": 0.87},
])
board.log(val_predictions=predictions)

Histogram

from kobo.client.types import Histogram

# Log gradient distributions
board.log(
    gradients=Histogram(param.grad),
    weights=Histogram(param.data)
)

# Precompute for efficiency (optional)
hist = Histogram(gradients).compute_bins()
board.log(grad_distribution=hist)

# Compact precision (75% size reduction, ~1% accuracy loss)
hist = Histogram(gradients, precision="compact")
board.log(grad_distribution=hist)

Deployment

Local Mode (Recommended)

# Install
pip install -e .

# Train
python train.py

# View results
kobo open ./kohakuboard --browser

Remote Mode (WIP)

# Run the authenticated server (still stabilizing)
kobo-serve --data-dir /var/kohakuboard --db sqlite:///kohakuboard.db

# Share boards by copying folders into /var/kohakuboard/<project>/
# Restart/reload the server to pick up new runs

See docs/kohakuboard/ for complete deployment guides.


Comparison with Alternatives

Feature WandB TensorBoard MLflow KohakuBoard
Latency ~10ms ~1ms ~5ms <0.1ms
Throughput ~1K/sec ~10K/sec ~5K/sec 20K+/sec
Offline ❌ No ✅ Yes ✅ Yes ✅ Yes
File-Based ❌ No ✅ Yes ❌ No ✅ Yes
Non-Blocking ❌ No ❌ No ❌ No ✅ Yes
Columnar Reads ❌ No ❌ No ✅ Yes ✅ Yes (KohakuVault ColumnVault)
WebGL Charts ❌ No ❌ No ❌ No ✅ Yes
100K+ Points Slow Slow Slow Fast
Self-Hosted Limited ✅ Yes ✅ Yes ✅ Yes
Setup Cloud Local Server None

Documentation


Examples

See examples/ directory:

  • kohakuboard_basic.py - Simple scalar logging
  • kohakuboard_all_media_types.py - Images, videos, tables
  • kohakuboard_cifar_training.py - Complete CIFAR-10 training example
  • kohakuboard_media_in_tables.py - Tables with embedded images
  • kohakuboard_histogram_logging.py - Gradient distribution tracking

Roadmap

✅ Complete

Client Library:

  • Non-blocking logging architecture
  • Rich data types (scalars, media, tables, histograms)
  • Three-tier SQLite architecture (KohakuVault KVault + ColumnVault + Standard SQLite)
  • Graceful shutdown with queue draining
  • Content-addressed media storage

Backend & UI:

  • FastAPI REST API
  • Vue 3 interface with dark/light mode
  • WebGL charts (100K+ points)
  • Histogram navigator
  • Media/table viewers
  • CLI tool (kobo)

🚧 In Progress

  • Remote server mode with authentication
  • Sync protocol for uploading local boards
  • Project management (group related boards)
  • Run comparison UI (side-by-side metrics)
  • Real-time streaming (live updates while training)

📋 Planned

Client Features:

  • PyTorch Lightning integration
  • Keras callback
  • Hugging Face Trainer integration
  • Custom callback system

Backend Features:

  • Multi-board comparison API
  • Advanced filtering (tags, date range)
  • Export to CSV/JSON
  • Aggregation queries

UI Features:

  • Diff viewer (compare runs)
  • Scatter plots (metric vs metric)
  • Custom dashboards
  • Annotations
  • Search and filter

Infrastructure:

  • Docker/Kubernetes deployment
  • Cloud storage backends (S3, GCS)
  • Multi-user authentication

License

KohakuBoard is a multi-component project with different licenses:

  • Client Library (kohakuboard): Apache License 2.0

    • Free for commercial and non-commercial use
    • Permissive license with minimal requirements
  • Web UI (kohaku-board-ui): AGPL-3.0

    • Free to use and modify
    • Source code disclosure required for network services
  • Server (kohakuboard_server): Kohaku Software License 1.0

    • Free for non-commercial use
    • Free for commercial use under revenue/duration limits
    • Commercial licenses available for larger deployments

Commercial Licensing: For commercial licenses or exemptions, contact kohaku@kblueleaf.net

See LICENSE for complete details.


Contributing

KohakuBoard is part of the KohakuHub ecosystem. We welcome contributions!

Before contributing:

Areas we need help:

  • 🎨 Frontend (chart improvements, UI/UX)
  • 🔧 Backend (storage backends, performance)
  • 📊 Client library (framework integrations)
  • 📚 Documentation (tutorials, guides)
  • 🧪 Testing (unit tests, benchmarks)

Support


Acknowledgments

  • KohakuVault - High-performance storage library with dual SQLite interfaces (KVault for blobs, ColumnVault for sequences)
  • Plotly.js - WebGL charts
  • Vue 3 - Modern UI framework
  • FastAPI - Backend framework

Production Ready! Core features are stable and performant. Use in real training workflows and help us improve.

About

High efficiency Local/self-hosted ML Experiment Tracking System

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published