Skip to content

Files

Latest commit

85e8a93 · Apr 7, 2025

History

History

This is the unofficial repository for the book: Large Language Models: Apply and Implement Strategies for Large Language Models (Apress). The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters. If you are looking for the official repository for the book, with the original notebooks, you should visit the Apress repository, where you can find all the notebooks in their original format as they appear in the book. Buy it at: [Amazon] [Springer]

In this small project we will create a new model alingning a couple of models a microsoft-phi-3-model and a gemma-2b-it with DPO and then publish it to Hugging Face.

Alignment is usually the final step taken when creating a model, after fine-tuning. Many people believe that the true revolution of GPT-3.5 was due to the alignment process that OpenAI carried out: Reinforcement Learning by Human Feedback, or RLFH.

RLHF proved to be a highly efficient technique for controlling the model's responses, and at first, it seemed that it had to be the price to pay for any model that wanted to compete with GPT-3.5.

Recently RLHF has been displaced by a technique that achieves the same result in a much more efficient way: DPO - Direct Preference Optimization.

The implementation of DPO that we'll be using in the notebooks is the one developed by Hugging Face in their TRL library, which stands for Transformer Reinforcement Learning. DPO can be considered a reinforcement learning technique, where the model is rewarded during its training phase based on its responses.

This library greatly simplifies the implementation of DPO. All you have to do is specify the model you want to fine-tune and provide it with a dataset in the necessary format.

The dataset to be used should have three columns:

  • Prompt: The prompt used.
  • Chosen: The desired response.
  • Rejected: An undesired response.

The dataset selected for this example is argilla/distilabel-capybara-dpo-7k-binarized

If you'd like to take a look at the models resulting from two hours of training on an A100 GPU with the Argila dataset, you can do so on its Hugging Face page.

Article Notebook
WIP Aligning with DPO a phi3-3 model.
Aligning with DPO a Gemma model.