Skip to content

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

License

Notifications You must be signed in to change notification settings

ContextualAI/CLAIR_and_APO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Alignment is underspecified with regard to preference and training objectives. We tackle this along two predominant axes: alignment data and alignment algorithms.

First, we introduce Contrastive Learning from AI Revisions (CLAIR). CLAIR uses a secondary AI system to minimally revise a solution A→A’ such that the resulting preference A < A’ is much more contrastive and precise.

Second, we introduce Anchored Preference Optimization (APO). APO uses simple constraints during training to account for the relationship between the model and preference data.

Contrastive Learning from AI Revisions
Anchored Preference Optimization

A: Preference pairs can vary along irrelevant axes, Contrastive Learning from AI Revisions (CLAIR) creates a targeted preference signal instead. B: The quality of the model can impact alignment training, Anchored Preference Optimization (APO) explicitly accounts for this.

Compared to conventional methods, we’ve observed a ~2x performance boost on MixEval-Hard for continued alignment of Llama-3-8B-Instruct.

CLAIR and APO performance boost

Contrastive Learning From AI Revisions (CLAIR)

We've given a reference implementation of CLAIR in this notebook. Results are cached so you can run it without an API key.

Open in Colab

Anchored Preference Optimization (APO)

APO is integrated in the TRL repository. First, install trl. Then, run either APO-zero (apo_zero) or APO-down (apo_down) using the trl dpo command.

pip install git+https://github.com/huggingface/trl.git
trl dpo \
    --loss_type apo_zero \
    --dataset_name ContextualAI/ultrafeedback_clair_32k \
    --model_name_or_path facebook/opt-125m \
    --output_dir results

Unpaired APO (similar to KTO), coming soon to TRL

trl kto \
    --loss_type apo_zero_unpaired \
    --dataset_name ContextualAI/ultrafeedback_clair_32k \
    --model_name_or_path facebook/opt-125m \
    --output_dir results

Citation

If you found CLAIR and APO useful, please cite:

@misc{doosterlinck2024anchored,
      title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment}, 
      author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},
      year={2024},
      eprint={2408.06266},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.06266}, 
}

About

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published