Skip to content

Commit

Permalink
make readme more readable
Browse files Browse the repository at this point in the history
  • Loading branch information
drisspg committed Dec 20, 2023
1 parent 22a8572 commit 13df0b5
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions transformer_nuggets/llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

This directory contains the code for pretraining Llama. The model definition is from [gpt-fast](https://github.com/pytorch-labs/gpt-fast). It is slightly modified to remove the kvcache since this is not needed during pre-training.

The Tokenizer is from the original [LLama repo](https://github.com/facebookresearch/llama) and uses sentencepiece under the hood. Instead of training the tokenizer from scratch the tokenizer.bin file from llama2 release is used.
The Tokenizer is from the original [LLama repo](https://github.com/facebookresearch/llama) and uses sentencepiece under the hood. Instead of training the tokenizer from scratch the `tokenizer.bin` file from llama2 release is used.

The training loop can be found in `train.py`. It expects that the `prepare_data.py` script has been run to generate the training data. The training data is expected to be in the `data/` directory.
The training loop can be found in [`train.py`](./train.py). It expects that the [`prepare_data.py`](./prepare_data.py) script has been run to generate the training data. The training data is expected to be in the `data/` directory.

### Usage

Expand All @@ -20,15 +20,18 @@ The following paths are assumed you are in the top level `transformer_nuggets/`
#### Prepare Data

Then run the following command:

``` Shell
mkdir -p transformer_nuggets/llama/data

python transformer_nuggets/llama/prepare_data.py \
--tokenizer_path=transformer_nuggets/llama/data/tokenizer.model \
--output_dir=transformer_nuggets/llama/data/
```
This should take around 3 minutes to run and prepare the training data.

#### Train Model
To edit the training configs take a look at `transformer_nuggets/llama/train.py`. The `entrypoint` function constructs the hyperparam configs as well as the
To edit the training configs take a look at [`train.py`](./train.py). The `entrypoint` function constructs the hyper_param configs as well as the
training configs. By default this will train a 7b model and and save the checkpoints to `transformer_nuggets/llama/data/out/`. It will also save the loss
logs to `transformer_nuggets/llama/data/logs`.

Expand All @@ -39,6 +42,5 @@ python transformer_nuggets/llama/train.py \
--fp8_linear_type "delayed" --compile True
```


### Notes
To get the Llama2 tokenizer go to https://huggingface.co/meta-llama/Llama-2-7b and go through steps to obtain access. This will get you pretrained weights as well as the tokenizer.

0 comments on commit 13df0b5

Please sign in to comment.