Training small GPT-2 style models using KANs instead of MLPs in JAX

This repository compares transformers using multilayer perceptron (MLP) and Kolmogorov-Arnold networks (KAN) layers.

Key points:

Uses Kolmogorov-Arnold Networks but with Chebyshev polynomials as the basis (inspired by this repo).
The tanh function is used to keep the activation values within [-1, 1] rather than using grids that update during training.
Both models are trained on 134M tokens of TinyStories.
They both use standard GPT-2 architecture (other than the KAN part).
The MLP version has 3.3M non-embedding weights and the KAN model has 2.5M non-embedding weights (~25% fewer).

Results:

They both achieve a final loss of ~2.46 (despite the KAN model having 25% fewer parameters!).

Hyperparameters:

Hardware: Single 1080ti GPU

Wandb: link.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
checkpoints		checkpoints
images		images
.gitignore		.gitignore
README.md		README.md
chebykan_layer.py		chebykan_layer.py
inference.py		inference.py
requirements.txt		requirements.txt
tiny_stories.py		tiny_stories.py
toy_functions.py		toy_functions.py
transformer.py		transformer.py

Provide feedback