Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instability with the sentiment analysis example #417

Closed
vwxyzjn opened this issue Jun 8, 2023 · 3 comments
Closed

Instability with the sentiment analysis example #417

vwxyzjn opened this issue Jun 8, 2023 · 3 comments

Comments

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Jun 8, 2023

Hi all, I ran the latest TRL (after merging #410) with the sentiment analysis example for ten random seeds. I noticed 3 out 10 experiments would experience some amount of negative KL explosion, though one of them recovered 🫠... Related #235 (comment)

pip install openrlbenchmark==0.2.1a2
python -m openrlbenchmark.rlops_trl_multi_metrics \
    --filters '?we=costa-huang&wpn=trl&ceik=tracker_project_name&cen=log_with&metrics=env/reward_mean&metrics=env/reward_std&metrics=objective/kl_coef&metrics=objective/kl&metrics=objective/entropy&metrics=ppo/std_scores&metrics=ppo/mean_scores&metrics=ppo/learning_rate&metrics=ppo/mean_non_score_reward&metrics=ppo/loss/value&metrics=ppo/loss/total&metrics=ppo/loss/policy&metrics=ppo/policy/advantages_mean&metrics=ppo/policy/approxkl&metrics=ppo/policy/clipfrac&metrics=ppo/policy/entropy&metrics=ppo/returns/mean&metrics=ppo/returns/var' \
        'wandb?tag=gpt2-sentiment&cl=sentiment analysis (PR-410)' \
    --env-ids trl \
    --check-empty-runs \
    --pc.ncols 5 \
    --pc.ncols-legend 1 \
    --output-filename static/0compare \
    --scan-history
image image
@lvwerra
Copy link
Member

lvwerra commented Jun 19, 2023

Can you try without batched generation (in case you are using it)?

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Jun 20, 2023

https://wandb.ai/costa-huang/trl/runs/dak1k9il/code?workspace=user-costa-huang has the code associated with the run. The relevant code is

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # Get response from gpt2
    response_tensors = ppo_trainer.generate(
        query_tensors, return_prompt=False, length_sampler=output_length_sampler, **generation_kwargs
    )
    batch["response"] = tokenizer.batch_decode(response_tensors)

    # Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    # Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

Does it use batched generation?

@lvwerra
Copy link
Member

lvwerra commented Jul 5, 2023

This should have been solved with #487 so closing for now.

@lvwerra lvwerra closed this as completed Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants