mosaicml · alextrott16 · Jun 27, 2023 · May 23, 2023 · May 23, 2023 · May 23, 2023
@@ -38,6 +38,7 @@ You'll find in this repo:
     * `inference/benchmarking` - profile inference latency and throughput
   * `eval/` - evaluate LLMs on academic (or custom) in-context-learning tasks
 * `mcli/` - launch any of these workloads using [MCLI](https://docs.mosaicml.com/projects/mcli/en/latest/) and the [MosaicML platform](https://www.mosaicml.com/platform)
+* `TUTORIAL.md` - a deeper dive into the repo, example workflows, and FAQs
 
 # MPT
 
@@ -119,15 +120,29 @@ You can select a specific commit hash such as `mosaicml/llm-foundry:1.13.1_cu117
 
 This assumes you already have PyTorch and CMake installed.
 
-To get started, clone this repo and install the requirements:
+To get started, clone the repo and set up your environment. Instructions to do so differ slightly depending on whether you're using Docker.
+### With Docker (recommended)
+
+We *strongly* recommend working with LLM Foundry inside a Docker container (see our recommended Docker image above). If you are doing so, follow these steps to clone the repo and install the requirements.
+
+<!--pytest.mark.skip-->
+```bash
+git clone https://github.com/mosaicml/llm-foundry.git
+cd llm-foundry
+pip install -e ".[gpu]"  # or pip install -e . if no NVIDIA GPU
+```
+
+### Without Docker (not recommended)
+
+If you choose not to use Docker, you should create and use a virtual environment.
 
 <!--pytest.mark.skip-->
 ```bash
 git clone https://github.com/mosaicml/llm-foundry.git
 cd llm-foundry
 
-# Optional: we highly recommend creating and using a virtual environment
-python -m venv llmfoundry-venv
+# Creating and activate a virtual environment
+python3 -m venv llmfoundry-venv
 source llmfoundry-venv/bin/activate
 
 pip install -e ".[gpu]"  # or pip install -e . if no NVIDIA GPU
@@ -136,15 +151,12 @@ pip install -e ".[gpu]"  # or pip install -e . if no NVIDIA GPU
 
 # Quickstart
 
+> **Note**
+> Make sure to go through the installation steps above before trying the quickstart!
+
 Here is an end-to-end workflow for preparing a subset of the C4 dataset, training an MPT-125M model for 10 batches,
 converting the model to HuggingFace format, evaluating the model on the Winograd challenge, and generating responses to prompts.
 
-If you have a write-enabled [HuggingFace auth token](https://huggingface.co/docs/hub/security-tokens), you can optionally upload your model to the Hub! Just export your token like this:
-```bash
-export HUGGING_FACE_HUB_TOKEN=your-auth-token
-```
-and uncomment the line containing `--hf_repo_for_upload ...`.
-
 **(Remember this is a quickstart just to demonstrate the tools -- To get good quality, the LLM must be trained for longer than 10 batches 😄)**
 
 <!--pytest.mark.skip-->
@@ -191,6 +203,16 @@ python inference/hf_generate.py \
 
 Note: the `composer` command used above to train the model refers to [Composer](https://github.com/mosaicml/composer) library's distributed launcher.
 
+If you have a write-enabled [HuggingFace auth token](https://huggingface.co/docs/hub/security-tokens), you can optionally upload your model to the Hub! Just export your token like this:
+```bash
+export HUGGING_FACE_HUB_TOKEN=your-auth-token
+```
+and uncomment the line containing `--hf_repo_for_upload ...` in the above call to `inference/convert_composer_to_hf.py`.
+
+# Learn more about LLM Foundry!
+
+Check out [TUTORIAL.md](https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md) to keep learning about working with LLM Foundry. The tutorial highlights example workflows, points you to other resources throughout the repo, and answers frequently asked questions!
+
 # Contact Us
 If you run into any problems with the code, please file Github issues directly to this repo.
 

@@ -161,6 +161,40 @@ def print_registered_tasks(self):
         tasks = sorted(self._task_preprocessing_registry.keys())
         print('\n'.join(tasks))
 
+    def get_preprocessing_fn_from_dict(self, mapping: dict):
+        """Get a preprocessing function from a dictionary.
+
+        The dictionary maps column names in the dataset to "prompt" and "response".
+        For example,
+            ```yaml
+            preprocessing_fn:
+                prompt: text
+                response: summary
+            ```
+        would map the `text` column as to prompt and the `summary` column as the response.
+
+        Args:
+            mapping (dict): A dictionary mapping column names to "prompt" and "response".
+
+        Returns:
+            Callable: The preprocessing function.
+
+        Raises:
+            ValueError: If the mapping does not have keys "prompt" and "response".
+        """
+
+        def _preprocessor(example: Dict[str, Any]) -> Dict[str, str]:
+            if list(mapping.keys()) != ['prompt', 'response']:
+                raise ValueError(
+                    f'Expected {mapping=} to have keys "prompt" and "response".'
+                )
+            return {
+                'prompt': example[mapping['prompt']],
+                'response': example[mapping['response']]
+            }
+
+        return _preprocessor
+
     def get_preprocessing_fn_from_str(self,
                                       preprocessor: Optional[str],
                                       dataset_name: Optional[str] = None,
@@ -233,8 +267,14 @@ def build_from_hf(self, cfg: DictConfig, tokenizer: Tokenizer):
         dataset_name = cfg.hf_name
         split = cfg.split
         kwargs = cfg.get('hf_kwargs', {})
-        preprocessing_fn = self.get_preprocessing_fn_from_str(
-            cfg.get('preprocessing_fn'), dataset_name, verbose=True)
+        proto_preprocessing_fn = cfg.get('preprocessing_fn')
+        if isinstance(proto_preprocessing_fn, dict) or isinstance(
+                proto_preprocessing_fn, DictConfig):
+            preprocessing_fn = self.get_preprocessing_fn_from_dict(
+                proto_preprocessing_fn)
+        else:
+            preprocessing_fn = self.get_preprocessing_fn_from_str(
+                proto_preprocessing_fn, dataset_name, verbose=True)
 
         dataset = hf_datasets.load_dataset(dataset_name, split=split, **kwargs)
 

@@ -13,14 +13,15 @@ The contents of this directory provide a concrete example of finetuning an LLM o
 Here, we have a minimal example that includes all the necessary pieces:
 - `train.jsonl`: Our local dataset. (It is actually just a snippet of the ARC Easy ICL evaluation set, so it's not something we'd want train on for real.)
 - `preprocessing.py`: A python file that defines the "preprocessing function" we will use to format our dataset into the required "prompt"/"response" structure.
-- `gpt2-arc-easy.yaml`: The configuration YAML for finetuning a pretrained gpt2 model on our local ARC Easy snippet with our custom preprocessing function.
+- `gpt2-arc-easy--cpu.yaml`: The configuration YAML for finetuning a pretrained gpt2 model on our local ARC Easy snippet with our custom preprocessing function. You can run this on CPU.
+- `mpt-7b-arc-easy--gpu.yaml`: The configuration YAML for finetuning MPT-7B on our local ARC Easy snippet with our custom preprocessing function. This requires GPU(s).
 
 ## Quick start
 
 <!--pytest.mark.skip-->
 ```bash
 cd llm-foundry/scripts/train
-composer train.py finetune_example/gpt2-arc-easy.yaml
+composer train.py finetune_example/gpt2-arc-easy--cpu.yaml
 ```
 That's it :)
 
@@ -98,7 +99,7 @@ Now we have a local dataset and a preprocessing function that will map examples
 
 ## Our training YAML
 
-This is already taken care of in `gpt2-arc-easy.yaml`, specifically in the `train_loader` section that controls how the  train dataloader is built in the `train.py` script. (If you also have an accompanying validation dataset, you'd make similar changes in the `eval_loader` section). Let's take a closer look.
+This is already taken care of in the example YAMLs, specifically in the `train_loader` section that controls how the  train dataloader is built in the `train.py` script. (If you also have an accompanying validation dataset, you'd make similar changes in the `eval_loader` section). Let's take a closer look.
 
 <!--pytest.mark.skip-->
 ```yaml

@@ -0,0 +1,116 @@
+max_seq_len: 2048
+global_seed: 17
+
+# Run Name
+run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME
+
+# Model
+model:
+  name: hf_causal_lm
+  pretrained_model_name_or_path: mosaicml/mpt-7b
+  pretrained: true  # false: only use the architecture; true: initialize with pretrained weights
+  config_overrides:
+    attn_config:
+      attn_impl: triton
+      # Set this to `true` if using `train_loader.dataset.packing_ratio` below
+      attn_uses_sequence_id: false
+
+# Tokenizer
+tokenizer:
+  name: mosaicml/mpt-7b
+  kwargs:
+    model_max_length: ${max_seq_len}
+
+# Dataloaders
+train_loader:
+  name: finetuning
+  dataset:
+    ############
+    hf_name: json
+    hf_kwargs:
+      # Note: absolute paths for data_dir are more reliable;
+      # relative paths will be interpreted relative to whatever your
+      # working directory is when you run `train.py`
+      data_dir: finetune_example
+    # Note: `scripts/train` will be the working directory when resolving
+    # the preprocessing_fn import path
+    preprocessing_fn: finetune_example.preprocessing:multiple_choice
+    split: train
+    ############
+    shuffle: true
+    max_seq_len: ${max_seq_len}
+    decoder_only_format: true
+    # # Use `python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ...`
+    # # to profile this run's optimal packing_ratio as it depends on GPU count,
+    # # batch size, sequence length
+    # packing_ratio:
+  drop_last: true
+  num_workers: 8
+
+# Optimization
+scheduler:
+  name: cosine_with_warmup
+  t_warmup: 100ba
+  alpha_f: 0.1
+
+optimizer:
+  name: decoupled_adamw
+  lr: 6.0e-4
+  betas:
+  - 0.9
+  - 0.95
+  eps: 1.0e-08
+  weight_decay: 0.0
+
+algorithms:
+  gradient_clipping:
+    clipping_type: norm
+    clipping_threshold: 1.0
+
+max_duration: 1ep
+eval_interval: 1
+eval_first: false
+eval_subset_num_batches: -1
+global_train_batch_size: 8
+
+# System
+seed: ${global_seed}
+device_eval_batch_size: 1
+device_train_microbatch_size: 1
+# device_train_microbatch_size: auto
+precision: amp_bf16
+
+# FSDP
+fsdp_config:
+  sharding_strategy: FULL_SHARD
+  mixed_precision: PURE
+  activation_checkpointing: true
+  activation_checkpointing_reentrant: false
+  activation_cpu_offload: false
+  limit_all_gathers: true
+  verbose: false
+
+# Logging
+progress_bar: false
+log_to_console: true
+console_log_interval: 1ba
+
+callbacks:
+  speed_monitor:
+    window_size: 10
+  lr_monitor: {}
+  memory_monitor: {}
+  runtime_estimator: {}
+
+# loggers:
+#   wandb: {}
+
+# Checkpoint to local filesystem or remote object store
+# save_interval: 500ba
+# save_num_checkpoints_to_keep: 1  # Important, this cleans up checkpoints saved to DISK
+# save_folder: ./{run_name}/checkpoints
+# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints
+
+# Load from local filesystem or remote object store
+# load_path: ./gpt-125m/checkpoints/latest-rank{rank}.pt
+# load_path: s3://my-bucket/my-folder/gpt-125m/checkpoints/latest-rank{rank}.pt
@@ -0,0 +1,118 @@
+data_local: ./my-adaptation-data
+data_remote: # If blank, files must be present in data_local
+max_seq_len: 4096
+global_seed: 17
+
+# Run Name
+run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME
+
+# Model
+model:
+  name: hf_causal_lm
+  pretrained: true
+  pretrained_model_name_or_path: mosaicml/mpt-7b
+  config_overrides:
+    max_seq_len: ${max_seq_len}
+    attn_config:
+      attn_impl: triton
+      attn_uses_sequence_id: false
+
+# Tokenizer
+tokenizer:
+  name: mosaicml/mpt-7b
+  kwargs:
+    model_max_length: ${max_seq_len}
+
+
+# Dataloaders
+train_loader:
+  name: text
+  dataset:
+    local: ${data_local}
+    remote: ${data_remote}
+    split: train_small
+    shuffle: true
+    max_seq_len: ${max_seq_len}
+    shuffle_seed: ${global_seed}
+  drop_last: true
+  num_workers: 8
+
+eval_loader:
+  name: text
+  dataset:
+    local: ${data_local}
+    remote: ${data_remote}
+    split: val_small
+    shuffle: false
+    max_seq_len: ${max_seq_len}
+    shuffle_seed: ${global_seed}
+  drop_last: false
+  num_workers: 8
+
+# Optimization
+scheduler:
+  name: cosine_with_warmup
+  t_warmup: 100ba
+  alpha_f: 0.1
+
+optimizer:
+  name: decoupled_adamw
+  lr: 5.0e-5
+  betas:
+  - 0.9
+  - 0.95
+  eps: 1.0e-08
+  weight_decay: 0.0
+
+algorithms:
+  gradient_clipping:
+    clipping_type: norm
+    clipping_threshold: 1.0
+
+max_duration: 3195ba # ~ 6.7B tokens
+eval_interval: 500ba
+eval_first: false
+eval_subset_num_batches: -1
+global_train_batch_size: 1024
+
+# System
+seed: ${global_seed}
+device_eval_batch_size: 8
+device_train_microbatch_size: 8
+# device_train_microbatch_size: auto
+precision: amp_bf16
+
+# FSDP
+fsdp_config:
+  sharding_strategy: FULL_SHARD
+  mixed_precision: PURE
+  activation_checkpointing: true
+  activation_checkpointing_reentrant: false
+  activation_cpu_offload: false
+  limit_all_gathers: true
+  verbose: false
+
+# Logging
+progress_bar: false
+log_to_console: true
+console_log_interval: 1ba
+
+callbacks:
+  speed_monitor:
+    window_size: 10
+  lr_monitor: {}
+  memory_monitor: {}
+  runtime_estimator: {}
+
+# loggers:
+#   wandb: {}
+
+# Checkpoint to local filesystem or remote object store
+save_interval: 1000ba
+save_num_checkpoints_to_keep: 1  # Important, this cleans up checkpoints saved to DISK
+save_folder: ./{run_name}/checkpoints
+# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints
+
+# Load from local filesystem or remote object store
+# load_path: ./gpt-7b/checkpoints/latest-rank{rank}.pt
+# load_path: s3://my-bucket/my-folder/gpt-7b/checkpoints/latest-rank{rank}.pt