Training framework for small language models using SimpleStories, a large-scale synthetic dataset of over 2 million short stories in simple language.
Paper: Parameterized Synthetic Text Generation with SimpleStories
Models & Dataset: 🤗 SimpleStories on Hugging Face
Note: This implementation removes the morphological analysis functionality described in the paper (page 5), where common English affixes (prefixes like "un", "re" and suffixes like "ed", "ing", "ly") were included as part of the tokenizer's initial alphabet. Empirical testing showed the WordPiece trainer naturally discovers these morphemes during training, making explicit seeding redundant.
From the root of the repository, run one of
make install-dev # To install the package, dev requirements and pre-commit hooks
make install # To just install the package (runs `pip install -e .`)Suggested extensions and settings for VSCode are provided in .vscode/. To use the suggested
settings, copy .vscode/settings-example.json to .vscode/settings.json.
There are various make commands that may be helpful
make check # Run pre-commit on all files (i.e. pyright, ruff linter, and ruff formatter)
make type # Run pyright on all files
make format # Run ruff linter and formatter on all files
make test # Run tests that aren't marked `slow`
make test-all # Run all testspython -m simple_stories_train.train [PATH/TO/CONFIG.yaml] [--key1 value1 --key2 value2 ...]where
PATH/TO/CONFIG.yamlcontains the training config. If no path is provided, a default config will be used.--key1 value1 --key2 value2 ...override values in the config. Note that if you wish to update a nested value, you must use dotted notation (e.g.--train_dataset_config.name my_dataset).
If running on CPU, you may need to set --compile=False.
To run on multiple GPUs, use
torchrun --standalone --nproc_per_node=N -m simple_stories_train.train ...
where N is the number of GPUs to use.
To submit training jobs to a SLURM cluster:
# Submit to SLURM (8 GPUs by default)
sst-train --config_path simple_stories_train/configs/your_config.yaml
# Custom GPU count and partition
sst-train --config_path ... --n_gpus 4 --partition h200-dev --time 24:00:00
# Run locally instead of submitting to SLURM
sst-train --config_path ... --localOptions:
--config_path: Path to training config YAML (required)--n_gpus: Number of GPUs (default: 8 for SLURM, 1 for local)--partition: SLURM partition name (default: 'h200-reserved-default')--time: Job time limit in HH:MM:SS (default: '72:00:00')--job_name: Custom job name--local: Run locally instead of submitting to SLURM
To track training with Weights & Biases, you can set the WANDB_PROJECT and WANDB_API_KEY variables in
.env. API keys can be obtained from your Weights & Biases account settings.
- Training script is based on the efficient train_gpt2.py in llm.c (licensed under MIT ((c) 2024 Andrej Karpathy))
- Some model architecture implementations are based on TransformerLens (licensed under MIT ((c) 2022 TransformerLensOrg))