GitHub - choucaicai/drpo-align: code for Optimizing Human Preference Alignment through Direct Ranking with Differentiable NDCG

Overview

This project introduces a novel approach to optimize preference alignment using a diffNDCG. Our method allows for end-to-end training of ranking models while directly optimizing for NDCG, leading to improved performance in preference alignment tasks.

Prerequisites

Installation

To set up the project environment:

Clone the repository:
Install request environment

transformers                      4.44.0
accelerate                        0.33.0
bitsandbytes                      0.41.0

Usage

Training

Trained SFT model can be found in huggingface.

Place the downloaded dataset in the data folder for default configuration. If you opt to store the dataset in a different directory, you'll need to modify the corresponding path in src/load_data.py, such as:

HH_RANK_BASE = "your path"

To train the model, run:

bash train.sh

Reward Model Evaluation

To evaluate a trained model, the first step is to generate sample responses.

cd eval
Using hf sample

    accelerate launch --main_process_port 29500 eval.py \
        --config-path="eval/eval_configs/hf_config" \
        ++mode="sample" \
        ++datasets="[hh_sample512]" \
        ++n_samples=512 \
        ++model.eval_batch_size=4 \
        ++samples_dir="samples/" \
        ++exp_name="model_name" \
        ++model.name_or_path=model_path_name

Using vllm sample

    python eval.py \
        --config-path="eval_configs/vllm_config_t1.0" \
        ++mode="vllm_sample" \
        ++datasets="[hh_sample512]" \
        ++n_samples=512 \
        ++model.eval_batch_size=4 \
        ++samples_dir="samples/" \
        ++exp_name="model_name" \
        ++model.name_or_path=model_path_name

Calculate the win rate using the Reward model

Download reward model from huggingface
Execute the command python tools/compare_reward.py --samples_file "$file" --output_path "$output_result" to compare sample responses with chosen responses.
Execute the command python tools/compare_reward_sft.py --samples_file "$file" --sft_file "$sft_file" --output_path "$output_result" to compare sample responses with other responses such as SFT samples.

GPT Evaluation

Modify eval/tools/api_client.py

api_key=  "Your Openai Key"
api_url=  "base url"

Execute the command python tools/compare.py -f "$file" -mc 512 -bk chosen -ck policy --r "$result_file" --judge "$judge_bot". where
- $file is the path to the file containing sampled responses.
- $result_file specifies the location where the results will be saved.
- $judge_bot determines the model used for judging, with "gpt-4o" as the default option.

Project Structure

drpo-align
│
├── data/ # Training data
├── eval/ # evaluation scripts
├── recipes/ # Training configs
├── scripts/ # Training scripts
├── src/
│ ├── load_data/ # Data loading and preprocessing
│ ├── losses/ # LTR losses
│ ├── rank_utils/ # LTR utils
│ └── scores/ # define scores
│ └── trainer/ # list trainer

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
draw		draw
eval		eval
recipes		recipes
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Prerequisites

Installation

Usage

Training

Reward Model Evaluation

GPT Evaluation

Project Structure

About

Releases

Packages

Languages

choucaicai/drpo-align

Folders and files

Latest commit

History

Repository files navigation

Overview

Prerequisites

Installation

Usage

Training

Reward Model Evaluation

GPT Evaluation

Project Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages