Sparse Autoencoder for Mechanistic Interpretability

Extract interpretable features from neural network activations using sparse autoencoders — the exact technique Anthropic uses to understand Claude's internals.

Reference: Scaling Monosemanticity (Anthropic, 2024)

The Core Idea

Neural networks compress many concepts into each neuron (superposition). A sparse autoencoder decompresses them back into interpretable features:

Model Activations (d=64)  →  SAE  →  Sparse Features (d=512)
    [entangled mess]                  [clean, interpretable]

Each feature in the overcomplete representation fires for a specific, human-understandable concept.

What's Inside

Sparse autoencoder — overcomplete dictionary learning with L1 sparsity, tied weights, decoder normalization
Feature analyzer — density analysis, co-occurrence, clustering, top-activating examples
Training pipeline — L1 warmup, dead feature resampling, Adam optimizer
Complete backpropagation — hand-derived gradients, no autograd

Architecture

x → (x - b_dec) → W_enc → ReLU → sparse features
                                      ↓
                              W_dec + b_dec → reconstructed x

Loss = MSE(x, reconstructed) + λ * L1(features)

Key design choices:

Overcomplete: d_hidden >> d_input (e.g., 8x-16x expansion)
L1 sparsity: forces most features to be zero → interpretable
Decoder normalization: prevents feature magnitude collapse
Dead feature resampling: reinitializes unused features toward high-loss inputs

Quick Start

git clone https://github.com/BabyChrist666/sparse-autoencoder.git
cd sparse-autoencoder
pip install -r requirements.txt

# Run tests (27 passing)
pytest tests/ -v

Usage

from sparse_ae import SparseAutoencoder, SAEConfig, SAETrainer, SAETrainingConfig
from sparse_ae.features import FeatureAnalyzer

# Configure SAE (8x overcomplete)
config = SAEConfig(d_input=64, d_hidden=512, l1_coeff=1e-3)
sae = SparseAutoencoder(config)

# Train on model activations
train_config = SAETrainingConfig(num_steps=5000, batch_size=256)
trainer = SAETrainer(sae, train_config)
trainer.train(activations)

# Analyze learned features
analyzer = FeatureAnalyzer(sae)
output = sae.forward(activations)
analyzer.collect(activations, output.features)

report = analyzer.generate_report()
print(f"Alive features: {report['reconstruction']['alive_features']}")
print(f"Explained variance: {report['reconstruction']['explained_variance']}")

Feature Analysis Tools

Tool	What it does
`feature_density()`	How often each feature activates (frequency)
`dead_features()`	Features that never activate (wasted capacity)
`feature_cosine_similarity()`	Geometric relationships between features
`feature_cooccurrence()`	Which features activate together
`cluster_features()`	Group related features by direction similarity
`logit_attribution()`	Which features drive specific predictions

Project Structure

sparse_ae/
├── autoencoder.py    # Core SAE with forward, backward, feature stats
├── features.py       # Feature analysis, clustering, co-occurrence
└── trainer.py        # Training loop with L1 warmup + dead resampling

tests/                # 27 tests
├── test_autoencoder.py
├── test_features.py
└── test_trainer.py

Why This Matters

Sparse autoencoders are the primary tool for mechanistic interpretability — understanding WHAT neural networks learn, not just THAT they learn. By decomposing activations into sparse features, we can:

Find interpretable concepts — individual features that fire for "code", "legal text", "numbers", etc.
Map circuits — trace how features compose to produce behavior
Detect safety-relevant features — features that activate on harmful content
Steer model behavior — amplify or suppress specific features

Tech Stack

Python 3.10+
NumPy — all matrix operations
pytest — 27 tests

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
sparse_ae		sparse_ae
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Autoencoder for Mechanistic Interpretability

The Core Idea

What's Inside

Architecture

Quick Start

Usage

Feature Analysis Tools

Project Structure

Why This Matters

Tech Stack

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

BabyChrist666/sparse-autoencoder

Folders and files

Latest commit

History

Repository files navigation

Sparse Autoencoder for Mechanistic Interpretability

The Core Idea

What's Inside

Architecture

Quick Start

Usage

Feature Analysis Tools

Project Structure

Why This Matters

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages