Extract interpretable features from neural network activations using sparse autoencoders — the exact technique Anthropic uses to understand Claude's internals.
Reference: Scaling Monosemanticity (Anthropic, 2024)
Neural networks compress many concepts into each neuron (superposition). A sparse autoencoder decompresses them back into interpretable features:
Model Activations (d=64) → SAE → Sparse Features (d=512)
[entangled mess] [clean, interpretable]
Each feature in the overcomplete representation fires for a specific, human-understandable concept.
- Sparse autoencoder — overcomplete dictionary learning with L1 sparsity, tied weights, decoder normalization
- Feature analyzer — density analysis, co-occurrence, clustering, top-activating examples
- Training pipeline — L1 warmup, dead feature resampling, Adam optimizer
- Complete backpropagation — hand-derived gradients, no autograd
x → (x - b_dec) → W_enc → ReLU → sparse features
↓
W_dec + b_dec → reconstructed x
Loss = MSE(x, reconstructed) + λ * L1(features)
Key design choices:
- Overcomplete: d_hidden >> d_input (e.g., 8x-16x expansion)
- L1 sparsity: forces most features to be zero → interpretable
- Decoder normalization: prevents feature magnitude collapse
- Dead feature resampling: reinitializes unused features toward high-loss inputs
git clone https://github.com/BabyChrist666/sparse-autoencoder.git
cd sparse-autoencoder
pip install -r requirements.txt
# Run tests (27 passing)
pytest tests/ -vfrom sparse_ae import SparseAutoencoder, SAEConfig, SAETrainer, SAETrainingConfig
from sparse_ae.features import FeatureAnalyzer
# Configure SAE (8x overcomplete)
config = SAEConfig(d_input=64, d_hidden=512, l1_coeff=1e-3)
sae = SparseAutoencoder(config)
# Train on model activations
train_config = SAETrainingConfig(num_steps=5000, batch_size=256)
trainer = SAETrainer(sae, train_config)
trainer.train(activations)
# Analyze learned features
analyzer = FeatureAnalyzer(sae)
output = sae.forward(activations)
analyzer.collect(activations, output.features)
report = analyzer.generate_report()
print(f"Alive features: {report['reconstruction']['alive_features']}")
print(f"Explained variance: {report['reconstruction']['explained_variance']}")| Tool | What it does |
|---|---|
feature_density() |
How often each feature activates (frequency) |
dead_features() |
Features that never activate (wasted capacity) |
feature_cosine_similarity() |
Geometric relationships between features |
feature_cooccurrence() |
Which features activate together |
cluster_features() |
Group related features by direction similarity |
logit_attribution() |
Which features drive specific predictions |
sparse_ae/
├── autoencoder.py # Core SAE with forward, backward, feature stats
├── features.py # Feature analysis, clustering, co-occurrence
└── trainer.py # Training loop with L1 warmup + dead resampling
tests/ # 27 tests
├── test_autoencoder.py
├── test_features.py
└── test_trainer.py
Sparse autoencoders are the primary tool for mechanistic interpretability — understanding WHAT neural networks learn, not just THAT they learn. By decomposing activations into sparse features, we can:
- Find interpretable concepts — individual features that fire for "code", "legal text", "numbers", etc.
- Map circuits — trace how features compose to produce behavior
- Detect safety-relevant features — features that activate on harmful content
- Steer model behavior — amplify or suppress specific features
- Python 3.10+
- NumPy — all matrix operations
- pytest — 27 tests
MIT