Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds REINFORCE algorithm #357

Merged
merged 11 commits into from
Nov 22, 2024
Merged

Conversation

abukharin3
Copy link
Contributor

What does this PR do ?

Adds the REINFORCE algorithm.

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

NAME="2p_reinforce"

PARAMETERS

RM_NEMO_FILE="/path/to/trained_rm.nemo"

ACTOR_NEMO_FILE="/path/to/sft_model.nemo"

TRAIN_DATA_PATH="/path/to/train_prompts.jsonl"
VALID_DATA_PATH="/path/to/test_prompts.jsonl"

RESULTS_DIR="/path/to/results_dir"
mkdir -p $RESULTS_DIR

GPFS="/path/to/nemo-aligner-repo"
MOUNTS="--container-mounts=MOUNTS" # mounts

CONTAINER=<<>> # use the latest NeMo Training container, Aligner will work there

PROJECT=reinforce_run

CRITIC_LOG_DIR="${RESULTS_DIR}/critic_results"
CRITIC_OUTFILE="${CRITIC_LOG_DIR}/critic_output_%j_%t.log"
CRITIC_ERRFILE="${CRITIC_LOG_DIR}/critic_error_%j_%t.err"
REWARD_PORT=5567
CRITIC_CONFIG_PATH="${GPFS}/examples/nlp/gpt/conf"
CRITIC_CONFIG_NAME="inference_rm"

CONF_DIR="${GPFS}/examples/nlp/gpt/conf"
CONFIG_NAME="gpt_reinforce_actor"

mkdir -p $CRITIC_LOG_DIR

CRITIC_NAME="${NAME}_critic"

read -r -d '' cmd_critic_inference <<EOF
cd ${GPFS}
&& export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/serve_reward_model.py
--config-path=${CRITIC_CONFIG_PATH}
--config-name=${CRITIC_CONFIG_NAME}
trainer.num_nodes=1
trainer.devices=8
++model.tensor_model_parallel_size=4
rm_model_file=${RM_NEMO_FILE}
inference.port=${REWARD_PORT}
EOF

srun --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &

sleep 30

ACTOR_LOG_DIR="${RESULTS_DIR}/actor_results"
CHECKPOINT_DIR="${ACTOR_LOG_DIR}/checkpoints"
TENSOBOARD_DIR="${ACTOR_LOG_DIR}/tensorboard"

NUM_ROLLOUTS=16
NORMALIZE="True"
ACTOR_LR="1e-6"
ACTOR_GBS=16
KL=0.01
USE_FLASK=False

mkdir -p $ACTOR_LOG_DIR
mkdir -p $TENSOBOARD_DIR
mkdir -p $CHECKPOINT_DIR

ACTOR_NAME="${NAME}_actor"

host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"

read -r -d '' cmd_reinforce <<EOF
cd ${GPFS}
export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/train_gpt_reinforce_actor.py
"model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}"
pretrained_checkpoint.restore_from_path="${PRETRAINED_ACTOR_NEMO_FILE}"
exp_manager.checkpoint_callback_params.save_top_k=1
exp_manager.explicit_log_dir="${RESULTS_DIR}"
trainer.reinforce.max_epochs=1
trainer.reinforce.max_steps=313
trainer.reinforce.val_check_interval=4
trainer.num_nodes=1
trainer.devices=8
trainer.reinforce.trt_llm.enable=True
trainer.reinforce.trt_llm.reshard=True
trainer.reinforce.trt_llm.unload_engine_train=False
++model.tensor_model_parallel_size=4
++model.reinforce.num_rollout_samples=${NUM_ROLLOUTS}
model.global_batch_size=${ACTOR_GBS}
model.micro_batch_size=1
model.optim.lr="\${multiply:${ACTOR_LR},1.001}"
model.optim.sched.warmup_steps=0
model.optim.sched.constant_steps=312
model.optim.sched.min_lr=${ACTOR_LR}
model.optim.weight_decay=0.01
model.reinforce.rollout_micro_batch_size=16
model.reinforce.forward_micro_batch_size=16
model.reinforce.val_rollout_micro_batch_size=8
model.data.data_impl=jsonl
remote_rm.reward_model.ip=${host_reward}
remote_rm.reward_model.port=${REWARD_PORT}
++model.reinforce.length_params.max_length=2048
trainer.reinforce.initial_policy_kl_penalty="${KL}"
++model.optim.bucket_cap_mb=200
++model.dist_ckpt_format=zarr
++model.optim.overlap_grad_sync=False
++model.optim.contiguous_grad_buffer=True
++model.enable_nge=True
trainer.reinforce.batch_iterator.use_flask=${USE_FLASK}
trainer.reinforce.rollout_batch_seq_length=4096
EOF

srun --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_reinforce}" &

wait

Before your PR is "Ready for review"

Pre checks:

  • [Y] Make sure you read and followed Contributor guidelines
  • [N] Did you write any new necessary tests?
  • [Y] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

  • [Y] Does the trainer resume and restore model state all states?
  • [Y] Does the trainer support all parallelism techniques(PP, TP, DP)?
  • [Y] Does the trainer support max_steps=-1 and validation?
  • [Y] Does the trainer only call APIs defined in alignable_interface.py?
  • [&] Does the trainer have proper logging?

@github-actions github-actions bot added documentation Improvements or additions to documentation Utils Algorithms labels Oct 23, 2024
@abukharin3 abukharin3 changed the title Reinforce pr REINFORCE Oct 23, 2024
@terrykong terrykong changed the title REINFORCE feat: adds REINFORCE algorithm Oct 24, 2024
@zredeaux07 zredeaux07 marked this pull request as ready for review October 24, 2024 18:47
zredeaux07
zredeaux07 previously approved these changes Oct 24, 2024
Copy link

@zredeaux07 zredeaux07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just found a few minor nits.

Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

Please add a functional test for reinforce. You may use the ppo one as reference.

You also need to rebase onto main to make sure REINFORCE works with the latest state of TRTLLM (v13).

The biggest thing about this PR that I have concerns is just the huge overlap between the models since almost all the code is shared. I think we should try to create a common class for all models needing generation.

docs/user-guide/reinforce.rst Outdated Show resolved Hide resolved
docs/user-guide/reinforce.rst Outdated Show resolved Hide resolved
examples/nlp/gpt/train_gpt_reinforce_actor.py Outdated Show resolved Hide resolved
nemo_aligner/algorithms/reinforce.py Outdated Show resolved Hide resolved
nemo_aligner/utils/ppo_utils.py Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_reinforce_actor.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_reinforce_actor.yaml Outdated Show resolved Hide resolved
examples/nlp/gpt/conf/gpt_reinforce_actor.yaml Outdated Show resolved Hide resolved
nemo_aligner/algorithms/reinforce.py Outdated Show resolved Hide resolved
@abukharin3 abukharin3 force-pushed the reinforce-pr branch 2 times, most recently from e851b6d to 01bf36b Compare November 12, 2024 18:00
@terrykong terrykong added the Run CICD Set + un-set to retrigger label Nov 14, 2024
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing, just leaving a list of the outstanding TODOs:

Signed-off-by: Terry Kong <terryk@nvidia.com>
abukharin and others added 8 commits November 22, 2024 10:15
Signed-off-by: Terry Kong <terryk@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: abukharin <abukharin@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: abukharin <abukharin@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: abukharin <abukharin@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: abukharin <abukharin@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: abukharin <abukharin@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Nov 22, 2024
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added Run CICD Set + un-set to retrigger and removed Run CICD Set + un-set to retrigger labels Nov 22, 2024
@terrykong terrykong enabled auto-merge (squash) November 22, 2024 21:42
@terrykong terrykong merged commit 716e503 into NVIDIA:main Nov 22, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms CI documentation Improvements or additions to documentation Run CICD Set + un-set to retrigger Utils
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants