Skip to content

Trains small LMs. Designed for training on SimpleStories

License

Notifications You must be signed in to change notification settings

goodfire-ai/simple_stories_train

 
 

Repository files navigation

simple_stories_train

Training framework for small language models using SimpleStories, a large-scale synthetic dataset of over 2 million short stories in simple language.

Paper: Parameterized Synthetic Text Generation with SimpleStories
Models & Dataset: 🤗 SimpleStories on Hugging Face

Note: This implementation removes the morphological analysis functionality described in the paper (page 5), where common English affixes (prefixes like "un", "re" and suffixes like "ed", "ing", "ly") were included as part of the tokenizer's initial alphabet. Empirical testing showed the WordPiece trainer naturally discovers these morphemes during training, making explicit seeding redundant.

Installation

From the root of the repository, run one of

make install-dev  # To install the package, dev requirements and pre-commit hooks
make install  # To just install the package (runs `pip install -e .`)

Development

Suggested extensions and settings for VSCode are provided in .vscode/. To use the suggested settings, copy .vscode/settings-example.json to .vscode/settings.json.

There are various make commands that may be helpful

make check  # Run pre-commit on all files (i.e. pyright, ruff linter, and ruff formatter)
make type  # Run pyright on all files
make format  # Run ruff linter and formatter on all files
make test  # Run tests that aren't marked `slow`
make test-all  # Run all tests

Usage

Training a model

python -m simple_stories_train.train [PATH/TO/CONFIG.yaml] [--key1 value1 --key2 value2 ...]

where

  • PATH/TO/CONFIG.yaml contains the training config. If no path is provided, a default config will be used.
  • --key1 value1 --key2 value2 ... override values in the config. Note that if you wish to update a nested value, you must use dotted notation (e.g. --train_dataset_config.name my_dataset).

If running on CPU, you may need to set --compile=False.

To run on multiple GPUs, use

torchrun --standalone --nproc_per_node=N -m simple_stories_train.train ...

where N is the number of GPUs to use.

SLURM Cluster Submission

To submit training jobs to a SLURM cluster:

# Submit to SLURM (8 GPUs by default)
sst-train --config_path simple_stories_train/configs/your_config.yaml

# Custom GPU count and partition
sst-train --config_path ... --n_gpus 4 --partition h200-dev --time 24:00:00

# Run locally instead of submitting to SLURM
sst-train --config_path ... --local

Options:

  • --config_path: Path to training config YAML (required)
  • --n_gpus: Number of GPUs (default: 8 for SLURM, 1 for local)
  • --partition: SLURM partition name (default: 'h200-reserved-default')
  • --time: Job time limit in HH:MM:SS (default: '72:00:00')
  • --job_name: Custom job name
  • --local: Run locally instead of submitting to SLURM

Logging with Weights & Biases

To track training with Weights & Biases, you can set the WANDB_PROJECT and WANDB_API_KEY variables in .env. API keys can be obtained from your Weights & Biases account settings.

Acknowledgments

  • Training script is based on the efficient train_gpt2.py in llm.c (licensed under MIT ((c) 2024 Andrej Karpathy))
  • Some model architecture implementations are based on TransformerLens (licensed under MIT ((c) 2022 TransformerLensOrg))

About

Trains small LMs. Designed for training on SimpleStories

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Other 1.0%