Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboarding tutorial and related improvements #205

Merged
merged 35 commits into from
Jun 27, 2023
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
c938e0d
add dict mapping to prompt response
samhavens May 23, 2023
ca5f7cf
comments and safety check
samhavens May 23, 2023
d3e4006
add example of fting instruct to summarize
samhavens May 23, 2023
c3f2b36
gpt2 llama
samhavens May 23, 2023
7d3328b
check for dictconfig
samhavens May 23, 2023
47203d6
auto microbatch in example
samhavens May 23, 2023
dbb6ddf
Check in tutorial draft
alextrott16 May 23, 2023
4925ad5
Add Workflow 1
alextrott16 May 23, 2023
28eeb08
Rename tutorial.md to TUTORIAL.md
alextrott16 May 23, 2023
b7032c3
Add mpt-7b yaml to finetune_example
alextrott16 May 23, 2023
58a8f74
Add Workflow 4
alextrott16 May 23, 2023
56aa05b
Minor edits
alextrott16 May 23, 2023
3e3017f
Add Workflow 3
alextrott16 May 23, 2023
4cd0433
fix domain fine tune example
samhavens May 24, 2023
25cdf49
add a bit more about sft
samhavens May 24, 2023
1b4d32f
Add top-level section text
alextrott16 May 24, 2023
1d43cf3
Merge branch 'tutorial' of https://github.com/mosaicml/llm-foundry in…
alextrott16 May 24, 2023
9cbac2d
Minor edits
alextrott16 May 24, 2023
e0240e4
YAMLs section
alextrott16 May 24, 2023
5b4c395
Merge branch 'main' into tutorial
alextrott16 May 24, 2023
7fb7c7b
Quickstart reorder
alextrott16 May 24, 2023
c8e8274
call out tutorial in main readme
alextrott16 May 24, 2023
2080c61
clarify install
alextrott16 May 24, 2023
65601ab
Fix paths with scripts/yamls to scripts/train/yamls
sashaDoubov May 26, 2023
238e9dd
Need to update max_seq_len for model if changed during finetuning
sashaDoubov May 26, 2023
ce9150e
add attn_impl explination
vchiley Jun 1, 2023
1c0746f
note sm89+ llvm warning
vchiley Jun 1, 2023
bfbc478
Update TUTORIAL.md
vchiley Jun 1, 2023
08a127f
Merge branch 'main' into tutorial
alextrott16 Jun 6, 2023
f0e0c3d
Tutorial edits
alextrott16 Jun 6, 2023
00785b5
Update TUTORIAL.md
jacobfulano Jun 8, 2023
dc96a82
updt tutorial with autocast
vchiley Jun 15, 2023
01ad300
Merge branch 'main' into tutorial
alextrott16 Jun 26, 2023
44fc1f1
check in tutorial
alextrott16 Jun 26, 2023
e8f0ee2
Minor fixes
alextrott16 Jun 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 31 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ You'll find in this repo:
* `inference/benchmarking` - profile inference latency and throughput
* `eval/` - evaluate LLMs on academic (or custom) in-context-learning tasks
* `mcli/` - launch any of these workloads using [MCLI](https://docs.mosaicml.com/projects/mcli/en/latest/) and the [MosaicML platform](https://www.mosaicml.com/platform)
* `TUTORIAL.md` - a deeper dive into the repo, example workflows, and FAQs

# MPT

Expand Down Expand Up @@ -119,15 +120,29 @@ You can select a specific commit hash such as `mosaicml/llm-foundry:1.13.1_cu117

This assumes you already have PyTorch and CMake installed.

To get started, clone this repo and install the requirements:
To get started, clone the repo and set up your environment. Instructions to do so differ slightly depending on whether you're using Docker.
### With Docker (recommended)

We *strongly* recommend working with LLM Foundry inside a Docker container (see our recommended Docker image above). If you are doing so, follow these steps to clone the repo and install the requirements.

<!--pytest.mark.skip-->
```bash
git clone https://github.com/mosaicml/llm-foundry.git
cd llm-foundry
pip install -e ".[gpu]" # or pip install -e . if no NVIDIA GPU
```

### Without Docker (not recommended)

If you choose not to use Docker, you should create and use a virtual environment.
alextrott16 marked this conversation as resolved.
Show resolved Hide resolved

<!--pytest.mark.skip-->
```bash
git clone https://github.com/mosaicml/llm-foundry.git
cd llm-foundry

# Optional: we highly recommend creating and using a virtual environment
python -m venv llmfoundry-venv
# Creating and activate a virtual environment
python3 -m venv llmfoundry-venv
source llmfoundry-venv/bin/activate

pip install -e ".[gpu]" # or pip install -e . if no NVIDIA GPU
Expand All @@ -136,15 +151,12 @@ pip install -e ".[gpu]" # or pip install -e . if no NVIDIA GPU

# Quickstart

> **Note**
> Make sure to go through the installation steps above before trying the quickstart!

Here is an end-to-end workflow for preparing a subset of the C4 dataset, training an MPT-125M model for 10 batches,
converting the model to HuggingFace format, evaluating the model on the Winograd challenge, and generating responses to prompts.

If you have a write-enabled [HuggingFace auth token](https://huggingface.co/docs/hub/security-tokens), you can optionally upload your model to the Hub! Just export your token like this:
```bash
export HUGGING_FACE_HUB_TOKEN=your-auth-token
```
and uncomment the line containing `--hf_repo_for_upload ...`.

**(Remember this is a quickstart just to demonstrate the tools -- To get good quality, the LLM must be trained for longer than 10 batches 😄)**

<!--pytest.mark.skip-->
Expand Down Expand Up @@ -191,6 +203,16 @@ python inference/hf_generate.py \

Note: the `composer` command used above to train the model refers to [Composer](https://github.com/mosaicml/composer) library's distributed launcher.

If you have a write-enabled [HuggingFace auth token](https://huggingface.co/docs/hub/security-tokens), you can optionally upload your model to the Hub! Just export your token like this:
```bash
export HUGGING_FACE_HUB_TOKEN=your-auth-token
```
and uncomment the line containing `--hf_repo_for_upload ...` in the above call to `inference/convert_composer_to_hf.py`.

# Learn more about LLM Foundry!

Check out [TUTORIAL.md](https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md) to keep learning about working with LLM Foundry. The tutorial highlights example workflows, points you to other resources throughout the repo, and answers frequently asked questions!

# Contact Us
If you run into any problems with the code, please file Github issues directly to this repo.

Expand Down
344 changes: 344 additions & 0 deletions TUTORIAL.md

Large diffs are not rendered by default.

44 changes: 42 additions & 2 deletions llmfoundry/data/finetuning/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,40 @@ def print_registered_tasks(self):
tasks = sorted(self._task_preprocessing_registry.keys())
print('\n'.join(tasks))

def get_preprocessing_fn_from_dict(self, mapping: dict):
"""Get a preprocessing function from a dictionary.

The dictionary maps column names in the dataset to "prompt" and "response".
For example,
```yaml
preprocessing_fn:
prompt: text
response: summary
```
would map the `text` column as to prompt and the `summary` column as the response.

Args:
mapping (dict): A dictionary mapping column names to "prompt" and "response".

Returns:
Callable: The preprocessing function.

Raises:
ValueError: If the mapping does not have keys "prompt" and "response".
"""

def _preprocessor(example: Dict[str, Any]) -> Dict[str, str]:
if list(mapping.keys()) != ['prompt', 'response']:
raise ValueError(
f'Expected {mapping=} to have keys "prompt" and "response".'
)
return {
'prompt': example[mapping['prompt']],
'response': example[mapping['response']]
}

return _preprocessor

def get_preprocessing_fn_from_str(self,
preprocessor: Optional[str],
dataset_name: Optional[str] = None,
Expand Down Expand Up @@ -233,8 +267,14 @@ def build_from_hf(self, cfg: DictConfig, tokenizer: Tokenizer):
dataset_name = cfg.hf_name
split = cfg.split
kwargs = cfg.get('hf_kwargs', {})
preprocessing_fn = self.get_preprocessing_fn_from_str(
cfg.get('preprocessing_fn'), dataset_name, verbose=True)
proto_preprocessing_fn = cfg.get('preprocessing_fn')
if isinstance(proto_preprocessing_fn, dict) or isinstance(
proto_preprocessing_fn, DictConfig):
preprocessing_fn = self.get_preprocessing_fn_from_dict(
proto_preprocessing_fn)
else:
preprocessing_fn = self.get_preprocessing_fn_from_str(
proto_preprocessing_fn, dataset_name, verbose=True)

dataset = hf_datasets.load_dataset(dataset_name, split=split, **kwargs)

Expand Down
7 changes: 4 additions & 3 deletions scripts/train/finetune_example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,15 @@ The contents of this directory provide a concrete example of finetuning an LLM o
Here, we have a minimal example that includes all the necessary pieces:
- `train.jsonl`: Our local dataset. (It is actually just a snippet of the ARC Easy ICL evaluation set, so it's not something we'd want train on for real.)
- `preprocessing.py`: A python file that defines the "preprocessing function" we will use to format our dataset into the required "prompt"/"response" structure.
- `gpt2-arc-easy.yaml`: The configuration YAML for finetuning a pretrained gpt2 model on our local ARC Easy snippet with our custom preprocessing function.
- `gpt2-arc-easy--cpu.yaml`: The configuration YAML for finetuning a pretrained gpt2 model on our local ARC Easy snippet with our custom preprocessing function. You can run this on CPU.
alextrott16 marked this conversation as resolved.
Show resolved Hide resolved
- `mpt-7b-arc-easy--gpu.yaml`: The configuration YAML for finetuning MPT-7B on our local ARC Easy snippet with our custom preprocessing function. This requires GPU(s).

## Quick start

<!--pytest.mark.skip-->
```bash
cd llm-foundry/scripts/train
composer train.py finetune_example/gpt2-arc-easy.yaml
composer train.py finetune_example/gpt2-arc-easy--cpu.yaml
```
That's it :)

Expand Down Expand Up @@ -98,7 +99,7 @@ Now we have a local dataset and a preprocessing function that will map examples

## Our training YAML

This is already taken care of in `gpt2-arc-easy.yaml`, specifically in the `train_loader` section that controls how the train dataloader is built in the `train.py` script. (If you also have an accompanying validation dataset, you'd make similar changes in the `eval_loader` section). Let's take a closer look.
This is already taken care of in the example YAMLs, specifically in the `train_loader` section that controls how the train dataloader is built in the `train.py` script. (If you also have an accompanying validation dataset, you'd make similar changes in the `eval_loader` section). Let's take a closer look.

<!--pytest.mark.skip-->
```yaml
Expand Down
116 changes: 116 additions & 0 deletions scripts/train/finetune_example/mpt-7b-arc-easy--gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
max_seq_len: 2048
global_seed: 17

# Run Name
run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME

# Model
model:
name: hf_causal_lm
pretrained_model_name_or_path: mosaicml/mpt-7b
pretrained: true # false: only use the architecture; true: initialize with pretrained weights
config_overrides:
attn_config:
attn_impl: triton
# Set this to `true` if using `train_loader.dataset.packing_ratio` below
attn_uses_sequence_id: false

# Tokenizer
tokenizer:
name: mosaicml/mpt-7b
kwargs:
model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
name: finetuning
dataset:
############
hf_name: json
hf_kwargs:
# Note: absolute paths for data_dir are more reliable;
# relative paths will be interpreted relative to whatever your
# working directory is when you run `train.py`
data_dir: finetune_example
# Note: `scripts/train` will be the working directory when resolving
# the preprocessing_fn import path
preprocessing_fn: finetune_example.preprocessing:multiple_choice
split: train
############
shuffle: true
max_seq_len: ${max_seq_len}
decoder_only_format: true
# # Use `python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ...`
# # to profile this run's optimal packing_ratio as it depends on GPU count,
# # batch size, sequence length
# packing_ratio:
drop_last: true
num_workers: 8

# Optimization
scheduler:
name: cosine_with_warmup
t_warmup: 100ba
alpha_f: 0.1

optimizer:
name: decoupled_adamw
lr: 6.0e-4
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0

algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0

max_duration: 1ep
eval_interval: 1
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 8

# System
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
# device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}

# loggers:
# wandb: {}

# Checkpoint to local filesystem or remote object store
# save_interval: 500ba
# save_num_checkpoints_to_keep: 1 # Important, this cleans up checkpoints saved to DISK
# save_folder: ./{run_name}/checkpoints
# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

# Load from local filesystem or remote object store
# load_path: ./gpt-125m/checkpoints/latest-rank{rank}.pt
# load_path: s3://my-bucket/my-folder/gpt-125m/checkpoints/latest-rank{rank}.pt
118 changes: 118 additions & 0 deletions scripts/train/yamls/finetune/mpt-7b_domain_adapt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
data_local: ./my-adaptation-data
data_remote: # If blank, files must be present in data_local
max_seq_len: 4096
global_seed: 17

# Run Name
run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME

# Model
model:
name: hf_causal_lm
pretrained: true
pretrained_model_name_or_path: mosaicml/mpt-7b
config_overrides:
max_seq_len: ${max_seq_len}
attn_config:
attn_impl: triton
attn_uses_sequence_id: false

# Tokenizer
tokenizer:
name: mosaicml/mpt-7b
kwargs:
model_max_length: ${max_seq_len}


# Dataloaders
train_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: train_small
shuffle: true
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
drop_last: true
num_workers: 8

eval_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: val_small
shuffle: false
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
drop_last: false
num_workers: 8

# Optimization
scheduler:
name: cosine_with_warmup
t_warmup: 100ba
alpha_f: 0.1

optimizer:
name: decoupled_adamw
lr: 5.0e-5
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0

algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0

max_duration: 3195ba # ~ 6.7B tokens
eval_interval: 500ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 1024

# System
seed: ${global_seed}
device_eval_batch_size: 8
device_train_microbatch_size: 8
# device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}

# loggers:
# wandb: {}

# Checkpoint to local filesystem or remote object store
save_interval: 1000ba
save_num_checkpoints_to_keep: 1 # Important, this cleans up checkpoints saved to DISK
save_folder: ./{run_name}/checkpoints
# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

# Load from local filesystem or remote object store
# load_path: ./gpt-7b/checkpoints/latest-rank{rank}.pt
# load_path: s3://my-bucket/my-folder/gpt-7b/checkpoints/latest-rank{rank}.pt
Loading