A new policy gradient algorithm, SPG, which reduces bias by optimizing sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.
To setup the environment, run;
conda env create -f env.yml
conda activate spg
Then download the base model LLaDA-8B-Instruct in SAVE_DIR/hf_models/
.
The code is inside the spg
directory. spg/slurm_scripts
contains the slurm scripts we used to run the RL experiments over four benchmarks. You need to change the saving directory SAVE_DIR
for all the scripts.
Reward dynamics of SPG w/ Mixture during RL training, compared with D1, WD1, and UniGRPO:
The evaluation code is inside the eval
directory.
- Run the evaluation scripts:
sbatch_eval_llada.sh
for LLaDA-8B-Instruct;sbatch_eval_llada1.5.sh
for LLaDA-1.5; files insideeval_d1
for the d1 baseline; files insideeval_eubo
for SPG w/ EUBO; files insideeval_mix
for SPG w/ Mixture. You need to change the saving directorySAVE_DIR
for all the scripts. - The evaluation file will only save the generations; use the parser to calculate accuracy.
- For example, baseline generations are in the
eval_results/eval_results_gsm8k_llada
directory. Usepython parse_and_get_acc.py
to print the accuracy.
This codebase is developed on top of d1 (Zhao et.al, 2025).
SPG is MIT licensed, as found in the LICENSE file.