Instability with the sentiment analysis example #417

vwxyzjn · 2023-06-08T13:19:38Z

Hi all, I ran the latest TRL (after merging #410) with the sentiment analysis example for ten random seeds. I noticed 3 out 10 experiments would experience some amount of negative KL explosion, though one of them recovered 🫠... Related #235 (comment)

pip install openrlbenchmark==0.2.1a2
python -m openrlbenchmark.rlops_trl_multi_metrics \
    --filters '?we=costa-huang&wpn=trl&ceik=tracker_project_name&cen=log_with&metrics=env/reward_mean&metrics=env/reward_std&metrics=objective/kl_coef&metrics=objective/kl&metrics=objective/entropy&metrics=ppo/std_scores&metrics=ppo/mean_scores&metrics=ppo/learning_rate&metrics=ppo/mean_non_score_reward&metrics=ppo/loss/value&metrics=ppo/loss/total&metrics=ppo/loss/policy&metrics=ppo/policy/advantages_mean&metrics=ppo/policy/approxkl&metrics=ppo/policy/clipfrac&metrics=ppo/policy/entropy&metrics=ppo/returns/mean&metrics=ppo/returns/var' \
        'wandb?tag=gpt2-sentiment&cl=sentiment analysis (PR-410)' \
    --env-ids trl \
    --check-empty-runs \
    --pc.ncols 5 \
    --pc.ncols-legend 1 \
    --output-filename static/0compare \
    --scan-history

The text was updated successfully, but these errors were encountered:

lvwerra · 2023-06-19T13:50:26Z

Can you try without batched generation (in case you are using it)?

vwxyzjn · 2023-06-20T13:06:25Z

https://wandb.ai/costa-huang/trl/runs/dak1k9il/code?workspace=user-costa-huang has the code associated with the run. The relevant code is

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # Get response from gpt2
    response_tensors = ppo_trainer.generate(
        query_tensors, return_prompt=False, length_sampler=output_length_sampler, **generation_kwargs
    )
    batch["response"] = tokenizer.batch_decode(response_tensors)

    # Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    # Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

Does it use batched generation?

lvwerra · 2023-07-05T09:02:37Z

This should have been solved with #487 so closing for now.

vwxyzjn mentioned this issue Jun 9, 2023

Add warning when KL-divergence is negative #235

Closed

lvwerra closed this as completed Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instability with the sentiment analysis example #417

Instability with the sentiment analysis example #417

vwxyzjn commented Jun 8, 2023 •

edited

Loading

lvwerra commented Jun 19, 2023

vwxyzjn commented Jun 20, 2023

lvwerra commented Jul 5, 2023

Instability with the sentiment analysis example #417

Instability with the sentiment analysis example #417

Comments

vwxyzjn commented Jun 8, 2023 • edited Loading

lvwerra commented Jun 19, 2023

vwxyzjn commented Jun 20, 2023

lvwerra commented Jul 5, 2023

vwxyzjn commented Jun 8, 2023 •

edited

Loading