diff --git a/README.md b/README.md index d28607fae3..b7f1e5b7aa 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # TRL - Transformer Reinforcement Learning
-TRL Banner +TRL Banner


diff --git a/docs/source/alignprop_trainer.mdx b/docs/source/alignprop_trainer.mdx index d76b5665da..a4c6b007ef 100644 --- a/docs/source/alignprop_trainer.mdx +++ b/docs/source/alignprop_trainer.mdx @@ -7,7 +7,7 @@ If your reward function is differentiable, directly backpropagating gradients from the reward models to the diffusion model is significantly more sample and compute efficient (25x) than doing policy gradient algorithm like DDPO. AlignProp does full backpropagation through time, which allows updating the earlier steps of denoising via reward backpropagation. -
+
## Getting started with `examples/scripts/alignprop.py` diff --git a/docs/source/ddpo_trainer.mdx b/docs/source/ddpo_trainer.mdx index 20dbbe82b1..0682144edb 100644 --- a/docs/source/ddpo_trainer.mdx +++ b/docs/source/ddpo_trainer.mdx @@ -6,9 +6,9 @@ | Before | After DDPO finetuning | | --- | --- | -|
|
| -|
|
| -|
|
| +|
|
| +|
|
| +|
|
| ## Getting started with Stable Diffusion finetuning with reinforcement learning diff --git a/docs/source/detoxifying_a_lm.mdx b/docs/source/detoxifying_a_lm.mdx index 4fb3741f43..fe97422889 100644 --- a/docs/source/detoxifying_a_lm.mdx +++ b/docs/source/detoxifying_a_lm.mdx @@ -83,7 +83,7 @@ As a compromise between the two we took for a context window of 10 to 15 tokens
- +
### How to deal with OOM issues @@ -101,7 +101,7 @@ and the optimizer will take care of computing the gradients in `bfloat16` precis - Use shared layers: Since PPO algorithm requires to have both the active and reference model to be on the same device, we have decided to use shared layers to reduce the memory footprint of the model. This can be achieved by specifying `num_shared_layers` argument when calling the `create_reference_model()` function. For example, if you want to share the first 6 layers of the model, you can do it like this:
- +
```python @@ -124,13 +124,13 @@ We have decided to keep 3 models in total that correspond to our best models: We have used different learning rates for each model, and have found out that the largest models were quite hard to train and can easily lead to collapse mode if the learning rate is not chosen correctly (i.e. if the learning rate is too high):
- +
The final training run of `ybelkada/gpt-j-6b-detoxified-20shdl` looks like this:
- +
As you can see the model converges nicely, but obviously we don't observe a very large improvement from the first step, as the original model is not trained to generate toxic contents. @@ -138,7 +138,7 @@ As you can see the model converges nicely, but obviously we don't observe a very Also we have observed that training with larger `mini_batch_size` leads to smoother convergence and better results on the test set:
- +
## Results @@ -159,7 +159,7 @@ We report the toxicity score of 400 sampled examples, compute its mean and stand
- +
Toxicity score with respect to the size of the model.
@@ -167,7 +167,7 @@ We report the toxicity score of 400 sampled examples, compute its mean and stand Below are few generation examples of `gpt-j-6b-detox` model:
- +
The evaluation script can be found [here](https://github.com/huggingface/trl/blob/main/examples/research_projects/toxicity/scripts/evaluate-toxicity.py). diff --git a/docs/source/dpo_trainer.mdx b/docs/source/dpo_trainer.mdx index 64ce87672f..103326fd9b 100644 --- a/docs/source/dpo_trainer.mdx +++ b/docs/source/dpo_trainer.mdx @@ -59,7 +59,7 @@ accelerate launch train_dpo.py Distributed across 8 GPUs, the training takes approximately 3 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/dpo-qwen2-reward-margin.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/dpo-qwen2-reward-margin.png) To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-DPO) performs, you can use the [TRL Chat CLI](clis#chat-interface). diff --git a/docs/source/how_to_train.md b/docs/source/how_to_train.md index bac324e43b..6ac55079b7 100644 --- a/docs/source/how_to_train.md +++ b/docs/source/how_to_train.md @@ -18,7 +18,7 @@ When training RL models, optimizing solely for reward may lead to unexpected beh However, the RL model being optimized against the reward model may learn patterns that yield high reward but do not represent good language. This can result in extreme cases where the model generates texts with excessive exclamation marks or emojis to maximize the reward. In some worst-case scenarios, the model may generate patterns completely unrelated to natural language yet receive high rewards, similar to adversarial attacks.
- +

Figure: Samples without a KL penalty from https://huggingface.co/papers/1909.08593.

diff --git a/docs/source/index.mdx b/docs/source/index.mdx index 217506448a..34cf10c476 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -1,5 +1,5 @@
- +
# TRL - Transformer Reinforcement Learning diff --git a/docs/source/kto_trainer.mdx b/docs/source/kto_trainer.mdx index 7b79268410..05de7a026d 100644 --- a/docs/source/kto_trainer.mdx +++ b/docs/source/kto_trainer.mdx @@ -51,7 +51,7 @@ accelerate launch train_kto.py Distributed across 8 x H100 GPUs, the training takes approximately 30 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/kto-qwen2-reward-margin.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kto-qwen2-reward-margin.png) To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-KTO) performs, you can use the [TRL Chat CLI](clis#chat-interface). diff --git a/docs/source/learning_tools.mdx b/docs/source/learning_tools.mdx index 7d693dd2c9..add4844e2b 100644 --- a/docs/source/learning_tools.mdx +++ b/docs/source/learning_tools.mdx @@ -69,7 +69,7 @@ The rough idea is as follows: ) ``` 4. Then generate some data such as `tasks = ["\n\nWhat is 13.1-3?", "\n\nWhat is 4*3?"]` and run the environment with `queries, responses, masks, rewards, histories = env.run(tasks)`. The environment will look for the `` token in the prompt and append the tool output to the response; it will also return the mask associated with the response. You can further use the `histories` to visualize the interaction between the model and the tool; `histories[0].show_text()` will show the text with color-coded tool output and `histories[0].show_tokens(tokenizer)` will show visualize the tokens. - ![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/learning_tools.png) + ![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/learning_tools.png) 1. Finally, we can train the model with `train_stats = ppo_trainer.step(queries, responses, rewards, masks)`. The trainer will use the mask to ignore the tool output when computing the loss, make sure to pass that argument to `step`. ## Experiment results @@ -102,7 +102,7 @@ python -m openrlbenchmark.rlops_multi_metrics \ --scan-history ``` -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/learning_tools_chart.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/learning_tools_chart.png) As we can see, while 1-2 experiments crashed for some reason, most of the runs obtained near perfect proficiency in the calculator task. @@ -147,7 +147,7 @@ The frame of rackets for all sports was traditionally made of solid wood (later We then basically deployed this snippet as a Hugging Face space [here](https://huggingface.co/spaces/vwxyzjn/pyserini-wikipedia-kilt-doc), so that we can use the space as a `transformers.Tool` later. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pyserini.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/pyserini.png) ### Experiment settings @@ -181,7 +181,7 @@ Q: """ Our experiments show that the agent can learn to use the wiki tool to answer questions. The learning curves would go up mostly, but one of the experiment did crash. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/triviaqa_learning_curves.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/triviaqa_learning_curves.png) Wandb report is [here](https://wandb.ai/costa-huang/cleanRL/reports/TriviaQA-Final-Experiments--Vmlldzo1MjY0ODk5) for further inspection. @@ -191,13 +191,13 @@ Note that the correct rate of the trained model is on the low end, which could b * **incorrect searches:** When given the question `"What is Bruce Willis' real first name?"` if the model searches for `Bruce Willis`, our wiki tool returns "Patrick Poivey (born 18 February 1948) is a French actor. He is especially known for his voice: he is the French dub voice of Bruce Willis since 1988.` But a correct search should be `Walter Bruce Willis (born March 19, 1955) is an American former actor. He achieved fame with a leading role on the comedy-drama series Moonlighting (1985–1989) and appeared in over a hundred films, gaining recognition as an action hero after his portrayal of John McClane in the Die Hard franchise (1988–2013) and other roles.[1][2]" - ![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/real_first_name.png) + ![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/real_first_name.png) * **unnecessarily long response**: The wiki tool by default sometimes output very long sequences. E.g., when the wiki tool searches for "Brown Act" * Our wiki tool returns "The Ralph M. Brown Act, located at California Government Code 54950 "et seq.", is an act of the California State Legislature, authored by Assemblymember Ralph M. Brown and passed in 1953, that guarantees the public's right to attend and participate in meetings of local legislative bodies." * [ToolFormer](https://huggingface.co/papers/2302.04761)'s wiki tool returns "The Ralph M. Brown Act is an act of the California State Legislature that guarantees the public's right to attend and participate in meetings of local legislative bodies." which is more succinct. - ![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/brown_act.png) + ![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/brown_act.png) ## (Early Experiments 🧪): solving math puzzles with python interpreter @@ -230,4 +230,4 @@ Q: """ Training experiment can be found at https://wandb.ai/lvwerra/trl-gsm8k/runs/a5odv01y -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/gms8k_learning_curve.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/gms8k_learning_curve.png) diff --git a/docs/source/lora_tuning_peft.mdx b/docs/source/lora_tuning_peft.mdx index 8906107c8e..01bef1e9e0 100644 --- a/docs/source/lora_tuning_peft.mdx +++ b/docs/source/lora_tuning_peft.mdx @@ -118,7 +118,7 @@ The `trl` library also supports naive pipeline parallelism (NPP) for large model This paradigm, termed as "Naive Pipeline Parallelism" (NPP) is a simple way to parallelize the model across multiple GPUs. We load the model and the adapters across multiple GPUs and the activations and gradients will be naively communicated across the GPUs. This supports `int8` models as well as other `dtype` models.
- +
### How to use NPP? diff --git a/docs/source/nash_md_trainer.md b/docs/source/nash_md_trainer.md index 881e57e69c..58fcca8c36 100644 --- a/docs/source/nash_md_trainer.md +++ b/docs/source/nash_md_trainer.md @@ -111,7 +111,7 @@ trainer.add_callback(completions_callback) This callback logs the model's generated completions directly to Weights & Biases. -![Logged Completions](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/wandb_completions.png) +![Logged Completions](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/wandb_completions.png) ## Example script diff --git a/docs/source/online_dpo_trainer.md b/docs/source/online_dpo_trainer.md index 49e40957c1..8ba147e780 100644 --- a/docs/source/online_dpo_trainer.md +++ b/docs/source/online_dpo_trainer.md @@ -51,7 +51,7 @@ accelerate launch train_online_dpo.py Distributed across 8 GPUs, the training takes approximately 1 hour. You can verify the training progress by checking the reward graph. An increasing trend in both the reward for rejected and chosen completions indicates that the model is improving and generating better responses over time. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/online-dpo-qwen2.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/online-dpo-qwen2.png) To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-OnlineDPO) performs, you can use the [TRL Chat CLI](clis#chat-interface). @@ -110,7 +110,7 @@ trainer.add_callback(completions_callback) This callback logs the model's generated completions directly to Weights & Biases. -![Logged Completions](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/wandb_completions.png) +![Logged Completions](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/wandb_completions.png) ## Example script @@ -265,7 +265,7 @@ plt.tight_layout() plt.show() ``` -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/online_dpo_scaling.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/online_dpo_scaling.png) The online DPO checkpoint gets increasingly more win rate as we scale up the model sizes. This is a good sign that the online DPO implementation is working as intended. diff --git a/docs/source/orpo_trainer.md b/docs/source/orpo_trainer.md index 78a68e077c..bea95c485b 100644 --- a/docs/source/orpo_trainer.md +++ b/docs/source/orpo_trainer.md @@ -54,7 +54,7 @@ accelerate launch train_orpo.py Distributed across 8 GPUs, the training takes approximately 30 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/orpo-qwen2-reward-margin.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/orpo-qwen2-reward-margin.png) To see how the [trained model](https://huggingface.co/trl-lib/Qwen2-0.5B-ORPO) performs, you can use the [TRL Chat CLI](clis#chat-interface). diff --git a/docs/source/ppo_trainer.md b/docs/source/ppo_trainer.md index a1cdc6529b..1e0faf663f 100644 --- a/docs/source/ppo_trainer.md +++ b/docs/source/ppo_trainer.md @@ -66,7 +66,7 @@ The logged metrics are as follows. Here is an example [tracked run at Weights an To help you understand what your model is doing, we periodically log some sample completions from the model. Here is an example of a completion. In an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/dd2o3g35), it looks like the following, allowing you to see the model's response at different stages of training. By default we generate `--num_sample_generations 10` during training, but you can customize the number of generations. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/ppov2_completions.gif?download=true) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/ppov2_completions.gif) In the logs the sampled generations look like @@ -210,7 +210,7 @@ The PPO checkpoint gets a 64.7% preferred rate vs the 33.0% preference rate of t Metrics: -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/pr-1540/ppov2.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/ppov2.png) ```bash diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx index 6d653ef5f3..f310a101d8 100644 --- a/docs/source/quickstart.mdx +++ b/docs/source/quickstart.mdx @@ -9,7 +9,7 @@ Fine-tuning a language model via PPO consists of roughly three steps: 3. **Optimization**: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses don't deviate too far from the reference language model. The active language model is then trained with PPO. The full process is illustrated in the following figure: - + ## Minimal example diff --git a/docs/source/rloo_trainer.md b/docs/source/rloo_trainer.md index 71d189be7d..127f297321 100644 --- a/docs/source/rloo_trainer.md +++ b/docs/source/rloo_trainer.md @@ -68,7 +68,7 @@ The logged metrics are as follows. Here is an example [tracked run at Weights an To help you understand what your model is doing, we periodically log some sample completions from the model. Here is an example of a completion. In an example [tracked run at Weights and Biases](https://wandb.ai/huggingface/trl/runs/u2sqci34), it looks like the following, allowing you to see the model's response at different stages of training. By default we generate `--num_sample_generations 10` during training, but you can customize the number of generations. -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/ppov2_completions.gif) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/ppov2_completions.gif) In the logs the sampled generations look like @@ -251,7 +251,7 @@ The RLOO checkpoint gets a 51.2% preferred rate vs the 33.0% preference rate of Metrics: -![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/benchmark/pr-1540/rloo.png) +![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/rloo.png) ```bash diff --git a/docs/source/sft_trainer.mdx b/docs/source/sft_trainer.mdx index e83088ee09..6921946c89 100644 --- a/docs/source/sft_trainer.mdx +++ b/docs/source/sft_trainer.mdx @@ -502,7 +502,7 @@ NEFTune is a technique to boost the performance of chat models and was introduce > Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.
- +
To use it in `SFTTrainer` simply pass `neftune_noise_alpha` when creating your `SFTConfig` instance. Note that to avoid any surprising behaviour, NEFTune is disabled after training to retrieve back the original behaviour of the embedding layer. @@ -527,7 +527,7 @@ trainer.train() We have tested NEFTune by training `mistralai/Mistral-7B-v0.1` on the [OpenAssistant dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) and validated that using NEFTune led to a performance boost of ~25% on MT Bench.
- +
Note however, that the amount of performance gain is _dataset dependent_ and in particular, applying NEFTune on synthetic datasets like [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) typically produces smaller gains. diff --git a/docs/source/text_environments.md b/docs/source/text_environments.md index 851020e0f5..c7b0bd0cfd 100644 --- a/docs/source/text_environments.md +++ b/docs/source/text_environments.md @@ -3,7 +3,7 @@ Text environments provide a learning ground for language agents. It allows a language model to use tools to accomplish a task such as using a Python interpreter to answer math questions or using a search index for trivia questions. Having access to tools allows language models to solve tasks that would be very hard for the models itself but can be trivial for the appropriate tools. A good example is arithmetics of large numbers that become a simple copy-paste task once you have access to a calculator.
- +
Let's dive into how text environments work and start with tools! @@ -179,13 +179,13 @@ When the model interacts inside the `TextEnvironment` it can be useful to visual You can see that the prompt is highlighted in gray, whereas system segments such as query and tool responses are highlighted in green. All segments generated by the model are highlighted in blue and in addition to the pure text output the reward is displayed as additional text in plum. Here an example of `show_text`:
- +
Sometimes there can be tricky tokenization related issues that are hidden when showing the decoded text. Thus `TextHistory` also offers an option to display the same highlighting on the tokens directly with `show_tokens`:
- +
Note that you can turn on the colour legend by passing `show_legend=True`. diff --git a/docs/source/using_llama_models.mdx b/docs/source/using_llama_models.mdx index cf602d2030..420caf1948 100644 --- a/docs/source/using_llama_models.mdx +++ b/docs/source/using_llama_models.mdx @@ -19,7 +19,7 @@ Now we can fit very large models into a single GPU, but the training might still The simplest strategy in this scenario is data parallelism: we replicate the same training setup into separate GPUs and pass different batches to each GPU. With this, you can parallelize the forward/backward passes of the model and scale with the number of GPUs. -![chapter10_ddp.png](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/stackllama/chapter10_ddp.png) +![chapter10_ddp.png](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/chapter10_ddp.png) We use either the `transformers.Trainer` or `accelerate`, which both support data parallelism without any code changes, by simply passing arguments when calling the scripts with `torchrun` or `accelerate launch`. The following runs a training script with 8 GPUs on a single machine with `accelerate` and `torchrun`, respectively. @@ -38,7 +38,7 @@ The [StackExchange dataset](https://huggingface.co/datasets/HuggingFaceH4/stack- There is nothing special about fine-tuning the model before doing RLHF - it’s just the causal language modeling objective from pretraining that we apply here. To use the data efficiently, we use a technique called packing: instead of having one text per sample in the batch and then padding to either the longest text or the maximal context of the model, we concatenate a lot of texts with a EOS token in between and cut chunks of the context size to fill the batch without any padding. -![chapter10_preprocessing-clm.png](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/stackllama/chapter10_preprocessing-clm.png) +![chapter10_preprocessing-clm.png](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/chapter10_preprocessing-clm.png) With this approach the training is much more efficient as each token that is passed through the model is also trained in contrast to padding tokens which are usually masked from the loss. If you don't have much data and are more concerned about occasionally cutting off some tokens that are overflowing the context you can also use a classical data loader. diff --git a/docs/source/xpo_trainer.mdx b/docs/source/xpo_trainer.mdx index 7516b9218d..07a76f36dc 100644 --- a/docs/source/xpo_trainer.mdx +++ b/docs/source/xpo_trainer.mdx @@ -110,7 +110,7 @@ trainer.add_callback(completions_callback) This callback logs the model's generated completions directly to Weights & Biases. -![Logged Completions](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/wandb_completions.png) +![Logged Completions](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/wandb_completions.png) ## Example script diff --git a/examples/notebooks/gpt2-sentiment.ipynb b/examples/notebooks/gpt2-sentiment.ipynb index 95f625f4f0..a5b6edc821 100644 --- a/examples/notebooks/gpt2-sentiment.ipynb +++ b/examples/notebooks/gpt2-sentiment.ipynb @@ -13,7 +13,7 @@ "metadata": {}, "source": [ "
\n", - "\n", + "\n", "

Figure: Experiment setup to tune GPT2. The yellow arrows are outside the scope of this notebook, but the trained models are available through Hugging Face.

\n", "
\n", "\n",