From f2484c32221d88661bedbff98c860b48e0b0f9bb Mon Sep 17 00:00:00 2001
From: Xuechen Li <12689993+lxuechen@users.noreply.github.com>
Date: Sat, 2 Dec 2023 14:40:33 -0800
Subject: [PATCH] chore: finalize dpo

---
 README.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/README.md b/README.md
index a711e6a0..8b23b90a 100644
--- a/README.md
+++ b/README.md
@@ -294,6 +294,17 @@ bash examples/scripts/rlhf_quark.sh \
   <kl_coef>
 ```
 
+### [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290)
+
+To replicate our DPO results for the AlpacaFarm evaluation suite, run
+
+```bash
+bash examples/scripts/rlhf_quark.sh \
+  <your_output_dir_for_dpo> \
+  <your_wandb_run_name> \
+  <your_output_dir_for_sft10k>
+```
+
 ### OpenAI models
 
 To run the OpenAI reference models with our prompts and decoding hyperparameters, run