Skip to content

Sparse autoencoder for mechanistic interpretability — extract interpretable features from neural network activations. Based on Anthropic's Scaling Monosemanticity.

Notifications You must be signed in to change notification settings

BabyChrist666/sparse-autoencoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparse Autoencoder for Mechanistic Interpretability

Extract interpretable features from neural network activations using sparse autoencoders — the exact technique Anthropic uses to understand Claude's internals.

Reference: Scaling Monosemanticity (Anthropic, 2024)

The Core Idea

Neural networks compress many concepts into each neuron (superposition). A sparse autoencoder decompresses them back into interpretable features:

Model Activations (d=64)  →  SAE  →  Sparse Features (d=512)
    [entangled mess]                  [clean, interpretable]

Each feature in the overcomplete representation fires for a specific, human-understandable concept.

What's Inside

  • Sparse autoencoder — overcomplete dictionary learning with L1 sparsity, tied weights, decoder normalization
  • Feature analyzer — density analysis, co-occurrence, clustering, top-activating examples
  • Training pipeline — L1 warmup, dead feature resampling, Adam optimizer
  • Complete backpropagation — hand-derived gradients, no autograd

Architecture

x → (x - b_dec) → W_enc → ReLU → sparse features
                                      ↓
                              W_dec + b_dec → reconstructed x

Loss = MSE(x, reconstructed) + λ * L1(features)

Key design choices:

  • Overcomplete: d_hidden >> d_input (e.g., 8x-16x expansion)
  • L1 sparsity: forces most features to be zero → interpretable
  • Decoder normalization: prevents feature magnitude collapse
  • Dead feature resampling: reinitializes unused features toward high-loss inputs

Quick Start

git clone https://github.com/BabyChrist666/sparse-autoencoder.git
cd sparse-autoencoder
pip install -r requirements.txt

# Run tests (27 passing)
pytest tests/ -v

Usage

from sparse_ae import SparseAutoencoder, SAEConfig, SAETrainer, SAETrainingConfig
from sparse_ae.features import FeatureAnalyzer

# Configure SAE (8x overcomplete)
config = SAEConfig(d_input=64, d_hidden=512, l1_coeff=1e-3)
sae = SparseAutoencoder(config)

# Train on model activations
train_config = SAETrainingConfig(num_steps=5000, batch_size=256)
trainer = SAETrainer(sae, train_config)
trainer.train(activations)

# Analyze learned features
analyzer = FeatureAnalyzer(sae)
output = sae.forward(activations)
analyzer.collect(activations, output.features)

report = analyzer.generate_report()
print(f"Alive features: {report['reconstruction']['alive_features']}")
print(f"Explained variance: {report['reconstruction']['explained_variance']}")

Feature Analysis Tools

Tool What it does
feature_density() How often each feature activates (frequency)
dead_features() Features that never activate (wasted capacity)
feature_cosine_similarity() Geometric relationships between features
feature_cooccurrence() Which features activate together
cluster_features() Group related features by direction similarity
logit_attribution() Which features drive specific predictions

Project Structure

sparse_ae/
├── autoencoder.py    # Core SAE with forward, backward, feature stats
├── features.py       # Feature analysis, clustering, co-occurrence
└── trainer.py        # Training loop with L1 warmup + dead resampling

tests/                # 27 tests
├── test_autoencoder.py
├── test_features.py
└── test_trainer.py

Why This Matters

Sparse autoencoders are the primary tool for mechanistic interpretability — understanding WHAT neural networks learn, not just THAT they learn. By decomposing activations into sparse features, we can:

  1. Find interpretable concepts — individual features that fire for "code", "legal text", "numbers", etc.
  2. Map circuits — trace how features compose to produce behavior
  3. Detect safety-relevant features — features that activate on harmful content
  4. Steer model behavior — amplify or suppress specific features

Tech Stack

  • Python 3.10+
  • NumPy — all matrix operations
  • pytest — 27 tests

License

MIT

About

Sparse autoencoder for mechanistic interpretability — extract interpretable features from neural network activations. Based on Anthropic's Scaling Monosemanticity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages