simple_stories_train

Training framework for small language models using SimpleStories, a large-scale synthetic dataset of over 2 million short stories in simple language.

Paper: Parameterized Synthetic Text Generation with SimpleStories
Models & Dataset: 🤗 SimpleStories on Hugging Face

Note: This implementation removes the morphological analysis functionality described in the paper (page 5), where common English affixes (prefixes like "un", "re" and suffixes like "ed", "ing", "ly") were included as part of the tokenizer's initial alphabet. Empirical testing showed the WordPiece trainer naturally discovers these morphemes during training, making explicit seeding redundant.

Installation

From the root of the repository, run one of

make install-dev  # To install the package, dev requirements and pre-commit hooks
make install  # To just install the package (runs `pip install -e .`)

Development

Suggested extensions and settings for VSCode are provided in .vscode/. To use the suggested settings, copy .vscode/settings-example.json to .vscode/settings.json.

There are various make commands that may be helpful

make check  # Run pre-commit on all files (i.e. pyright, ruff linter, and ruff formatter)
make type  # Run pyright on all files
make format  # Run ruff linter and formatter on all files
make test  # Run tests that aren't marked `slow`
make test-all  # Run all tests

Usage

Training a model

python -m simple_stories_train.train [PATH/TO/CONFIG.yaml] [--key1 value1 --key2 value2 ...]

where

PATH/TO/CONFIG.yaml contains the training config. If no path is provided, a default config will be used.
--key1 value1 --key2 value2 ... override values in the config. Note that if you wish to update a nested value, you must use dotted notation (e.g. --train_dataset_config.name my_dataset).

If running on CPU, you may need to set --compile=False.

To run on multiple GPUs, use

torchrun --standalone --nproc_per_node=N -m simple_stories_train.train ...

where N is the number of GPUs to use.

SLURM Cluster Submission

To submit training jobs to a SLURM cluster:

# Submit to SLURM (8 GPUs by default)
sst-train --config_path simple_stories_train/configs/your_config.yaml

# Custom GPU count and partition
sst-train --config_path ... --n_gpus 4 --partition h200-dev --time 24:00:00

# Run locally instead of submitting to SLURM
sst-train --config_path ... --local

Options:

--config_path: Path to training config YAML (required)
--n_gpus: Number of GPUs (default: 8 for SLURM, 1 for local)
--partition: SLURM partition name (default: 'h200-reserved-default')
--time: Job time limit in HH:MM:SS (default: '72:00:00')
--job_name: Custom job name
--local: Run locally instead of submitting to SLURM

Logging with Weights & Biases

To track training with Weights & Biases, you can set the WANDB_PROJECT and WANDB_API_KEY variables in .env. API keys can be obtained from your Weights & Biases account settings.

Acknowledgments

Training script is based on the efficient train_gpt2.py in llm.c (licensed under MIT ((c) 2024 Andrej Karpathy))
Some model architecture implementations are based on TransformerLens (licensed under MIT ((c) 2022 TransformerLensOrg))

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.github		.github
.vscode		.vscode
scripts		scripts
simple_stories_train		simple_stories_train
tests		tests
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ACCESS.md		ACCESS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
launch.sbatch		launch.sbatch
pyproject.toml		pyproject.toml
setup_pkg.sh		setup_pkg.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simple_stories_train

Installation

Development

Usage

Training a model

SLURM Cluster Submission

Logging with Weights & Biases

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

goodfire-ai/simple_stories_train

Folders and files

Latest commit

History

Repository files navigation

simple_stories_train

Installation

Development

Usage

Training a model

SLURM Cluster Submission

Logging with Weights & Biases

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages