TextEnvironments #424

lvwerra · 2023-06-09T16:06:10Z

This PR adds multi-turn text environment to TRL.

Target API

env = Environment(model, tokenizer, prompt, tools)

for tasks in ppo_trainer.dataloader:
    histories = env.run(tasks)
    tokens, mask = histories.get_tokens()
    ppo_trainer.step(tokens, mask=mask, histories.rewards)
    
# alternatively, this would probably be much nicer
for tasks in ppo_trainer.dataloader:
    tokens, masks, rewards, history = env.run(tasks)
    ppo_trainer.step(tokens, masks, rewards)

Todos

text to token+mask conversion | add v1 masking #429
penalty for tool calls
masking in PPOTrainer after fwd pass | add v1 masking #429
end-to-end example calculater

Current working example

from trl import TextEnvironment, TextHistory, AutoModelForCausalLMWithValueHead
from transformers import AutoModelForCausalLM, AutoTokenizer, load_tool

tool = load_tool("ybelkada/simple-calculator")

model_id = "gpt2-xl"

model = AutoModelForCausalLMWithValueHead.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

prompt = """\
What is 12.1 + 1 - 3?
<request><SimpleCalculatorTool>12.1 + 1<call>13.1<response>
<request><SimpleCalculatorTool>13.1 - 3<call>10.1<response>
Result = 10.1 <submit>

"""

reward_fn = lambda x: 1

env = TextEnvironment(model, tokenizer,[tool], reward_fn, prompt, generation_kwargs={"max_new_tokens": 32})
h = env.run(["What is 387 * 228?"])

h[0].show()

Result:

What is 12.1 + 1 - 3?
<request><SimpleCalculatorTool>12.1 + 1<call>13.1<response>
<request><SimpleCalculatorTool>13.1 - 3<call>10.1<response>
Result = 10.1 <submit>

What is 387 * 228?

<request><SimpleCalculatorTool>387 * 228<call>88236.0<response>

Result = 88236.0 <submit>
Reward: 1

Where the 88236.0<response> segement was generated by the tool call.

HuggingFaceDocBuilderDev · 2023-06-09T16:11:10Z

The documentation is not available anymore as the PR was closed or merged.

trl/environment/base.py

* add v1 masking * working v1 * adapt from suggestion * avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.` * fix masking - mask the responses from API call only * quality * address comments * Update trl/environment/base.py Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * adapt a bit * wip on tokenization/masking in textenv * small fixes * update viz * add example * print debug text and pass masks * style * format and move tensor to device * update example * update example * This seems to work * fix masking * fix rich output to console --------- Co-authored-by: Costa Huang <costa.huang@outlook.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: leandro <leandro.vonwerra@spoud.io>

* add v1 masking * working v1 * adapt from suggestion * avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.` * fix masking - mask the responses from API call only * quality * address comments * Update trl/environment/base.py Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * adapt a bit * wip on tokenization/masking in textenv * small fixes * update viz * add example * print debug text and pass masks * style * format and move tensor to device * update example * update example * This seems to work * fix masking * fix rich output to console * fix batched generation * improve stopping criteria * improve error handling in tool call --------- Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Costa Huang <costa.huang@outlook.com>

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

vwxyzjn · 2023-08-28T19:46:31Z

Thanks @younesbelkada for the comments, I have addressed most of them. @lvwerra the PR looks good. Quick question: do we want to merge docs/source/learning_tools.mdx and docs/source/text_environments.md?

lvwerra · 2023-08-29T05:30:34Z

Quick question: do we want to merge docs/source/learning_tools.mdx and docs/source/text_environments.md?

I thought we could have text_environments.md as the basic doc for how TextEnvs work and the learning_tools.mdx as the more hands-on guide also linking the experiment scripts. Wdyt?

Also @vwxyzjn would you mind fixing the scripts so they pass the quality checks. In my opinion we can also exclude them from the quality tests, no strong opinion.

docs/source/learning_tools.mdx

younesbelkada

Looking great ! Thanks for all your efforts, left 4 nits, otherwise LGTM !

examples/research_projects/tools/calculator.py

examples/research_projects/tools/python_interpreter.py

examples/research_projects/tools/triviaqa.py

trl/trainer/ppo_trainer.py

* WIP skeleton * minimal working poc * cleanup * rename variables * quick typo fix * add v1 masking (huggingface#429) * add v1 masking * working v1 * adapt from suggestion * avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.` * fix masking - mask the responses from API call only * quality * address comments * Update trl/environment/base.py Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * adapt a bit * wip on tokenization/masking in textenv * small fixes * update viz * add example * print debug text and pass masks * style * format and move tensor to device * update example * update example * This seems to work * fix masking * fix rich output to console --------- Co-authored-by: Costa Huang <costa.huang@outlook.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: leandro <leandro.vonwerra@spoud.io> * Add masking (huggingface#461) * add v1 masking * working v1 * adapt from suggestion * avoid warning `Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.` * fix masking - mask the responses from API call only * quality * address comments * Update trl/environment/base.py Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * adapt a bit * wip on tokenization/masking in textenv * small fixes * update viz * add example * print debug text and pass masks * style * format and move tensor to device * update example * update example * This seems to work * fix masking * fix rich output to console * fix batched generation * improve stopping criteria * improve error handling in tool call --------- Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Costa Huang <costa.huang@outlook.com> * fix uknown tool * fix rewards and increase bs * remove unused script * ugly WIP fix * do not return modified obj for in-place operations * do not return modified obj for in-place operations * clean up stopping criterium * push updates * push update * format, add docs * rename file * add kwargs to reward fn * simplify example * simplify example * bug fix * add a trivia example * pre-commit * max tool response length * fix regex for multi-line * refactor tool exceptions * fix exceptions in tool * add docs * fix style * make rich optional * add docstrings * add tests * add TextEnv tests (WIP) * update triviaqa code * update docs * refactor text env * update tests (WIP) * add end2end test * update docs * upload tool demo * refactor * customizable system prompt * add text env docs * update index and toc * fix `TextHistory` show methods * add max length * fix style * fix typo * refactor to kwargs in init and tasks to queries * kwargs for reward docs * Update examples/triviaqa.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update examples/tool_demo.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/learning_tools.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/learning_tools.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/learning_tools.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update docs/source/text_environments.md Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update examples/triviaqa.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update examples/triviaqa.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * move to tool folder * remove assets * remove tool demo * move rich import test to import utils * add copyright * fixes for masks in ppo trainer * add text env api docs * make precommit + add ppo test with mask * move examples and add python * fix style * update triviaqa example * add more docs * update docs * Update docs/source/learning_tools.mdx * Apply suggestions from code review * precommit --------- Co-authored-by: Costa Huang <costa.huang@outlook.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: leandro von werra <leandro@hf.co>

leandro added 3 commits May 31, 2023 19:57

WIP skeleton

ded9b3b

minimal working poc

dbbbdb3

cleanup

7ab947b

vwxyzjn reviewed Jun 12, 2023

View reviewed changes

trl/environment/base.py Outdated Show resolved Hide resolved

trl/environment/base.py Outdated Show resolved Hide resolved

trl/environment/base.py Outdated Show resolved Hide resolved

vwxyzjn added 4 commits June 12, 2023 14:43

rename variables

85e0af2

quick typo fix

600d08a

Merge branch 'main' into envs

0f9454e

Merge branch 'main' into envs

01244ec

vwxyzjn mentioned this pull request Jun 20, 2023

More complex reward mechanism #419

Closed

younesbelkada and others added 20 commits June 23, 2023 11:31

fix uknown tool

ad29c30

fix rewards and increase bs

528ba0f

remove unused script

b8a384f

ugly WIP fix

b0513c0

do not return modified obj for in-place operations

4700a86

do not return modified obj for in-place operations

461039e

clean up stopping criterium

4dfcf86

push updates

146e56a

push update

2d7aa57

Merge branch 'envs' of https://github.com/lvwerra/trl into envs

fe9173e

format, add docs

ff7d587

rename file

a1e8229

Merge branch 'main' into envs

fada4b5

add kwargs to reward fn

a4d5997

simplify example

cb48e61

simplify example

d3e07b7

bug fix

004855d

add a trivia example

1f2f2bf

vwxyzjn and others added 7 commits August 28, 2023 15:24

Update docs/source/learning_tools.mdx

e309fa8

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Update docs/source/text_environments.md

b3021bb

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Update examples/triviaqa.py

7917535

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Update examples/triviaqa.py

88c4c89

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

move to tool folder

a2ae29c

remove assets

9441604

remove tool demo

9e7a278

leandro and others added 10 commits August 29, 2023 07:38

move rich import test to import utils

4844f60

add copyright

6f5ade7

fixes for masks in ppo trainer

4da7214

add text env api docs

11f53f2

make precommit + add ppo test with mask

1641881

move examples and add python

25b2f95

fix style

6461e96

update triviaqa example

b28d29c

add more docs

70255c0

update docs

53f08de

lvwerra commented Aug 30, 2023

View reviewed changes

docs/source/learning_tools.mdx Outdated Show resolved Hide resolved

Update docs/source/learning_tools.mdx

bc0a0ca

younesbelkada approved these changes Aug 30, 2023

View reviewed changes

examples/research_projects/tools/calculator.py Show resolved Hide resolved

examples/research_projects/tools/python_interpreter.py Show resolved Hide resolved

examples/research_projects/tools/triviaqa.py Show resolved Hide resolved

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved

younesbelkada and others added 2 commits August 30, 2023 11:28

Apply suggestions from code review

782aee9

precommit

b7351eb

younesbelkada merged commit 9d09b3e into main Aug 30, 2023

younesbelkada deleted the envs branch August 30, 2023 09:44

younesbelkada mentioned this pull request Aug 30, 2023

[PPOTrainer] A workaround for failing log_stats #708

Merged

lvwerra mentioned this pull request Dec 21, 2023

PPO for conversation datasets #1102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextEnvironments #424

TextEnvironments #424

lvwerra commented Jun 9, 2023 •

edited by vwxyzjn

Loading

HuggingFaceDocBuilderDev commented Jun 9, 2023 •

edited

Loading

vwxyzjn commented Aug 28, 2023

lvwerra commented Aug 29, 2023

younesbelkada left a comment

TextEnvironments #424

TextEnvironments #424

Conversation

lvwerra commented Jun 9, 2023 • edited by vwxyzjn Loading

HuggingFaceDocBuilderDev commented Jun 9, 2023 • edited Loading

vwxyzjn commented Aug 28, 2023

lvwerra commented Aug 29, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

lvwerra commented Jun 9, 2023 •

edited by vwxyzjn

Loading

HuggingFaceDocBuilderDev commented Jun 9, 2023 •

edited

Loading