You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are several optimizations to our PPO recipe which could help push it closer to SOTA in terms of performance. There are also several pieces of documentation we could offer alongside this recipe to increase visibility and improve accessibility. These are non-comprehensive and not all required.
Documentation
Recipe documentation page which sufficiently explains how to use the recipe, including:
model and dataset requirements
explanation of all recipe parameters and algorithm hyperparameters
I think the results from this page all use LoRA. Nonetheless, it's one of the only sources of compute useage for a modern RLHF implementation.
*It's unclear what size of reward model is used here. Throughout the blogpost they use reward model sizes << policy model sizes.
They also state:
For now, we suggest that users use "Total-GPU-Memory-in-GB / 6" as the upper parameter bound in billions for the sum of the actor model and critical model, for safety. Nevertheless, users are welcome to try the real limit.
Which gives 13.B for the combined memory of both actor + critic model on a single A100 80GB.
(From deepspeed link above) - granted these aren't strictly performance opt:
Exponential Moving Average (EMA) collection, where an EMA based checkpoint can be chosen for the final evaluation.
Mixture Training, which mixes the pretraining objective (i.e., the next word prediction) with the PPO objective to prevent regression performance on public benchmarks like SQuAD2.0.
Not sure if it will be useful for you, but there are 8-bit and 4-bit AdamW in torchao https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim. Both support FSDP1/2.
8-bit AdamW should match bnb and 4-bit version should match lpmm exactly. They are included in torchao 0.4 release, but there is a bug in handling LR schedule (fixed in main branch).
There are several optimizations to our PPO recipe which could help push it closer to SOTA in terms of performance. There are also several pieces of documentation we could offer alongside this recipe to increase visibility and improve accessibility. These are non-comprehensive and not all required.
Documentation
Optimizations
Rough benchmarks from deepspeed
I think the results from this page all use LoRA. Nonetheless, it's one of the only sources of compute useage for a modern RLHF implementation.
*It's unclear what size of reward model is used here. Throughout the blogpost they use reward model sizes << policy model sizes.
They also state:
Which gives 13.B for the combined memory of both actor + critic model on a single A100 80GB.
Compile issues @ebsmothers
Enable compile for batched RLHF generation utils #1402
Enable compile for trajectory generation step?
Enable compile for loss step?
?? how else can we make inference go fast?
Reference +/ reward model offload to CPU @ebsmothers
Optimizer offload to CPU (Add CPU offload optimizer from torchao #1351) (to benchmark once it lands)
(From deepspeed link above) - granted these aren't strictly performance opt:
cc @kartikayk
The text was updated successfully, but these errors were encountered: