SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

A new policy gradient algorithm, SPG, which reduces bias by optimizing sandwiched variational bounds based on reward and utilizes a block-wise masking technique to improve training efficiency and stability.

Environment Setup

To setup the environment, run;

conda env create -f env.yml
conda activate spg

Then download the base model LLaDA-8B-Instruct in SAVE_DIR/hf_models/.

SPG

The code is inside the spg directory. spg/slurm_scripts contains the slurm scripts we used to run the RL experiments over four benchmarks. You need to change the saving directory SAVE_DIR for all the scripts.

Reward dynamics of SPG w/ Mixture during RL training, compared with D1, WD1, and UniGRPO:

Evaluation

The evaluation code is inside the eval directory.

Run the evaluation scripts: sbatch_eval_llada.sh for LLaDA-8B-Instruct; sbatch_eval_llada1.5.sh for LLaDA-1.5; files inside eval_d1 for the d1 baseline; files inside eval_eubo for SPG w/ EUBO; files inside eval_mix for SPG w/ Mixture. You need to change the saving directory SAVE_DIR for all the scripts.
The evaluation file will only save the generations; use the parser to calculate accuracy.
For example, baseline generations are in the eval_results/eval_results_gsm8k_llada directory. Use python parse_and_get_acc.py to print the accuracy.

Acknowledgement

This codebase is developed on top of d1 (Zhao et.al, 2025).

License

SPG is MIT licensed, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
eval		eval
media		media
spg		spg
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Environment Setup

SPG

Evaluation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

facebookresearch/SPG

Folders and files

Latest commit

History

Repository files navigation

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Environment Setup

SPG

Evaluation

Acknowledgement

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages