This repository contains the code to train and test an autoregressive transformer model on chess games from scratch. I also used it to train the open-source DimChess-0.3B model.
Using this repository, I trained DimChess-0.3B, a small 0.3B chess model on 14M chess games with my personal RTX 3090 GPU during ≈260 hours.
The model is based on the transformer architecture (only the decoder part) from the paper Attention is All You Need by Google Brain (2017), with a few improvements:
-
I replaced the default normalization layer by the Root Mean Square Layer Normalization (RMSNorm) from the paper Root Mean Square Layer Normalization by Edinburgh University (2019)
-
I moved the normalization layers before the transformer blocks (instead of after) like in the paper On Layer Normalization in the Transformer Architecture by Microsoft Research (2020)
-
I replaced the ReLU activation by the SwiGLU activation from the paper GLU Variants Improve Transformer by Google (2020)
-
I implemented Grouped-Query Attention (GQA) from the paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints by Google Research (2023)
-
I replaced the absolute positional embedding by the Rotary Position Embedding (RoPE) from the paper RoFormer: Enhanced Transformer with Rotary Position Embedding by Zhuiyi Technology (2023)
-
I added mulitple input and output embeddings to use different token vocabularies at the same time (board position, piece type, capture...)
Here are the main parameters of the architecture:
Parameter | Value |
---|---|
Embedding dimension | 1,024 |
Number of layers | 24 |
Heads dimension | 64 |
Feed forward hidden dimension | 2,730 |
Number of heads | 16 |
Number of grouped heads | 4 |
Context length | 2048 |
Vocab sizes | 74, 13, 2, 2, 2, 2 |
The resulting model has 264,436,736 trainable parameters and fits on a single RTX 3090 GPU with a batch size of 4 for training using mixed precision. For inference only, the model will probably fit on any modern GPU.
The dataset I made to train this model is composed of 14M chess games from high level players for a total of 1.2B moves played between 1600 and 2024. You can download it on Hugging Face 🤗.
For the tokenization, I created a custom multi layer move to token tokenizer with 6 different vocabularies:
- Board position : 74 tokens for the 64 squares of the chess board + 10 control tokens
- Piece type : 13 tokens for the 12 different pieces + 1 null token
- Capture : 2 tokens for the capture state
- En passant : 2 tokens for the en passant state
- Check : 2 tokens for the check state
- Checkmate : 2 tokens for the checkmate state
A move is usually composed of 3 tokens (each token containing 6 layers):
- The board position of the piece to move with the piece type (
null
for the other layers) - The board position of the destination square with the piece type (can be different in case of promotion) and the different states depending on the move
- The
<m/>
token (null
for the other layers)
If the move is a castle, 2 tokens are added before the <m/>
token for the rook move.
For the training I used stochastic gradient descent with warmup and cosine decay learning rate schedules, here are the main hyperparameters:
Hyperparameter | Value |
---|---|
Batch size (tokens) | 524,288 |
Optimizer | AdamW |
Learning rate | 6.0 × 10-4 |
Warmup steps | 2,000 |
Decay steps | 28,000 |
β1 | 0.9 |
β2 | 0.95 |
ε | 10-8 |
Weight decay | 0.1 |
Gradient clipping | 1.0 |
I trained the model on my personal RTX 3090 GPU for ≈4 epochs using mixed precision and gradient accumulation to increase the speed and reduce the memory usage :
Training summary | |
---|---|
Tokens | 15,728,640,000 |
Steps | 30,000 |
FLOPs | 2.5 × 1019 |
Duration | 256 hours |
Final loss | 0.63 |
Final accuracy | 79.9 % |
Final elo | 1,741 ± 11 |
I tested the model against the Stockfish 16 chess engine configured with the UCI_Elo
parameter (from ≈1,300 to ≈3,200), the first 3 moves of each side were chosen randomly to create different games. Here are the results:
Using theses results I estimated the elo of the model to be around 1,741 (±11) but the Stockfish UCI elo metric is a bit unclear so I don't know to what extent it makes sense to compare it to the FIDE, Lichess or Chess.com ones.
The trained weights of the model are available on Google Drive, you just need to download the .pt
file of the model and put it in the models
folder.
Run the following command to install the dependencies:
$ pip install -r requirements.txt
-
Set the
STOCKFISH_PATH
constant inchess_ai/settings.py
to the path of your Stockfish engine -
Run the
create_data.ipynb
file to create the dataset -
Run the
training.ipynb
file (you can stop the training at any time and resume it later thanks to the checkpoints) -
If you don't have an overpriced 24GB GPU like me, the default settings (those used to train DimChess-0.3B) may not work for you. You can try to:
- Reduce the batch size (less stable and worse lowest point)
- Increase the accumulation steps (fix previous problems but slower)
- Reduce some architecture parameters (worse lowest point)
- Run the
testing.ipynb
file to use the models you downloaded or trained
- Angel Uriot : Creator of the project.