P2-MHF

modified notebooks to solve github rendering problem

Apr 7, 2025

85e8a93 · Apr 7, 2025

Name	Name	Last commit message	Last commit date
parent directory ..
7_2_Aligning_DPO_phi3.ipynb	7_2_Aligning_DPO_phi3.ipynb	modified notebooks to solve github rendering problem	Apr 7, 2025
Aligning_DPO_open_gemma-2b-it.ipynb	Aligning_DPO_open_gemma-2b-it.ipynb	modified notebooks to solve github rendering problem	Apr 7, 2025
Aligning_DPO_phi3.ipynb	Aligning_DPO_phi3.ipynb	modified notebooks to solve github rendering problem	Apr 7, 2025
readme.md	readme.md	Update readme.md	Oct 4, 2024

readme.md

This is the unofficial repository for the book: Large Language Models: Apply and Implement Strategies for Large Language Models (Apress). The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters. If you are looking for the official repository for the book, with the original notebooks, you should visit the Apress repository, where you can find all the notebooks in their original format as they appear in the book. Buy it at: [Amazon] [Springer]

Create and publish an LLM.

In this small project we will create a new model alingning a couple of models a microsoft-phi-3-model and a gemma-2b-it with DPO and then publish it to Hugging Face.

Alignment is usually the final step taken when creating a model, after fine-tuning. Many people believe that the true revolution of GPT-3.5 was due to the alignment process that OpenAI carried out: Reinforcement Learning by Human Feedback, or RLFH.

RLHF proved to be a highly efficient technique for controlling the model's responses, and at first, it seemed that it had to be the price to pay for any model that wanted to compete with GPT-3.5.

Recently RLHF has been displaced by a technique that achieves the same result in a much more efficient way: DPO - Direct Preference Optimization.

The implementation of DPO that we'll be using in the notebooks is the one developed by Hugging Face in their TRL library, which stands for Transformer Reinforcement Learning. DPO can be considered a reinforcement learning technique, where the model is rewarded during its training phase based on its responses.

This library greatly simplifies the implementation of DPO. All you have to do is specify the model you want to fine-tune and provide it with a dataset in the necessary format.

The dataset to be used should have three columns:

Prompt: The prompt used.
Chosen: The desired response.
Rejected: An undesired response.

The dataset selected for this example is argilla/distilabel-capybara-dpo-7k-binarized

If you'd like to take a look at the models resulting from two hours of training on an A100 GPU with the Argila dataset, you can do so on its Hugging Face page.

Article	Notebook
WIP	Aligning with DPO a phi3-3 model. Aligning with DPO a Gemma model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

P2-MHF

P2-MHF

readme.md

Create and publish an LLM.

Collapse file tree

Files

P2-MHF

Directory actions

More options

Directory actions

More options

Latest commit

History

P2-MHF

Folders and files

parent directory

readme.md

Create and publish an LLM.