Improve docs #612

lvwerra · 2023-08-03T13:26:24Z

The goal of this PR is to improve the overall docs:

add links to blogs (maybe we could add thumbnails?)
improve readme
how to generate and which metrics to look for in PPO
how to use a trained model for inference (Add more docs on inference #599)

HuggingFaceDocBuilderDev · 2023-08-04T10:55:02Z

The documentation is not available anymore as the PR was closed or merged.

vwxyzjn

Looks really good! Thanks @lvwerra. I left some minor comments.

vwxyzjn · 2023-08-04T16:52:43Z

docs/source/how_to_train.md

+When performing classical supervised fine-tuning of language models, the loss (especially the validation loss) serves as a good indicator of the training progress. However, in Reinforcement Learning (RL), the loss becomes less informative about the model's performance, and its value may fluctuate while the actual performance improves.
+
+To address this, we recommend focusing on two key metrics:
+
+**Mean Reward**: The primary goal is to maximize the reward achieved by the model during RL training.
+**Objective KL Divergence**: KL divergence (Kullback-Leibler divergence) measures the dissimilarity between two probability distributions. In the context of RL training, we use it to quantify the difference between the current model and a reference model. Ideally, we want to keep the KL divergence between 0 and 10 to ensure the model's generated text remains close to what the reference model produces.


Maybe we can merge this and https://github.com/lvwerra/trl/blob/main/docs/source/logging.mdx? I think another objective is entropy, where we would want the model to be as chaotic as possible.

I added a link to the logging page, what do you think?

vwxyzjn · 2023-08-04T16:54:24Z

docs/source/how_to_train.md

+
+When training RL models, optimizing solely for reward may lead to unexpected behaviors, where the model exploits the environment in ways that don't align with good language generation. In the case of RLHF, we use a reward model trained to predict whether a generated text is highly ranked by humans.
+
+However, the RL model being optimized against the reward model may learn patterns that yield high reward but do not represent good language. This can result in extreme cases where the model generates texts with excessive exclamation marks or emojis to maximize the reward. In some worst-case scenarios, the model may generate patterns completely unrelated to natural language yet receive high rewards, similar to adversarial attacks.


We can probabaly give some examples/references. E.g., Table 10 from https://arxiv.org/pdf/1909.08593.pdf

Good idea, added!

vwxyzjn · 2023-08-04T16:56:39Z

docs/source/how_to_train.md

+Debugging the RL pipeline can be challenging due to its complexity. Here are some tips and suggestions to make the process easier:
+
+- **Start from a working example**: Begin with a working example from the trl repository and gradually modify it to fit your specific use-case. Changing everything at once can make it difficult to identify the source of potential issues. For example, you can start by replacing the model in the example and once you figure out the best hyperparameters try to switch to your dataset and reward model. If you change everything at once you won't know where a potential problem comes from.
+- **Start small, scale later**: Training large models can be very slow and take several hours or days until you see any improvement. For debugging this is not a convenient timescale so try to use small model variants during the development phase and scale up once that works. That being said you sometimes have to be careful as small models might not have the capacity to solve a complicated task either.
+- **Start simple**: Try to start with a minimal example and build complexity from there. Your use-case might require for example a complicated reward function consisting of many different rewards - try to use one signal first and see if you can optimize that and then add more complexity after that.
+- **Inspect the generations**: It's always a good idea to inspect what the model is generating. Maybe there is a big in your post-processing or your prompt. Due to bad settings you might cut-off generations too soon. These things are very hard to see on the metrics but very obvious if you look at the generations.
+- **Inspect the reward model**: If you reward is not improving over time maybe there's an issue with the reward model. You can look at extreme cases to see if it does what it should: e.g. in the sentiment case you can check if simple positive and negative examples really get different rewards. And you can look at the distribution of your dataset. Finally, maybe the reward is dominated by the query which the model can't affect so you might need to normalize this (e.g. reward of query+response minus reward of the query).


This is very nicely done!

vwxyzjn · 2023-08-04T16:57:37Z

docs/source/use_model.md

+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model_name_or_path = "path/to/your/model/or/name/on/hub"


Maybe we can provide a working example and put # "path/to/your/model/or/name/on/hub"

fixed, except for the adapter as i don't have a good example at hand :)

leandro added 5 commits August 3, 2023 15:24

WIP

9af6c14

improve inference docs

c76c1f4

improve training faq

976cae7

update toctree

ddfec12

fix toctree

ce34169

leandro added 5 commits August 4, 2023 15:28

fix improve blog

2fbbeda

improve blog

7e7f7fd

fix customization

23b2936

reword faq a bit

8844261

reword inference a bit

203cf58

lvwerra marked this pull request as ready for review August 4, 2023 14:45

lvwerra requested a review from vwxyzjn August 4, 2023 14:46

add references back

cf33bad

vwxyzjn approved these changes Aug 4, 2023

View reviewed changes

lvwerra changed the title ~~WIP: Improve docs~~ Improve docs Aug 8, 2023

leandro added 2 commits August 8, 2023 11:26

integrate feedback from code review

c8a2089

fix link in html

b3e29d1

lvwerra merged commit 3f1477c into main Aug 8, 2023

lvwerra deleted the improve-docs branch August 8, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve docs #612

Improve docs #612

lvwerra commented Aug 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 4, 2023 •

edited

Loading

vwxyzjn left a comment

vwxyzjn Aug 4, 2023

lvwerra Aug 8, 2023

vwxyzjn Aug 4, 2023

lvwerra Aug 8, 2023

vwxyzjn Aug 4, 2023

lvwerra Aug 8, 2023

vwxyzjn Aug 4, 2023

lvwerra Aug 8, 2023


		When training RL models, optimizing solely for reward may lead to unexpected behaviors, where the model exploits the environment in ways that don't align with good language generation. In the case of RLHF, we use a reward model trained to predict whether a generated text is highly ranked by humans.

		However, the RL model being optimized against the reward model may learn patterns that yield high reward but do not represent good language. This can result in extreme cases where the model generates texts with excessive exclamation marks or emojis to maximize the reward. In some worst-case scenarios, the model may generate patterns completely unrelated to natural language yet receive high rewards, similar to adversarial attacks.

Improve docs #612

Improve docs #612

Conversation

lvwerra commented Aug 3, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Aug 4, 2023 • edited Loading

vwxyzjn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra commented Aug 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 4, 2023 •

edited

Loading