-
Notifications
You must be signed in to change notification settings - Fork 956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding dpo training #1209
base: main
Are you sure you want to change the base?
Adding dpo training #1209
Conversation
Wow! Thank you so much for your PR! I've been wanting to try this feature for a long time! If possible, could you upload the dataset to Hugging Face? Just like mlx-community/wikisql. This way, others can get the process running super quickly! And there's no need to worry about finding datasets in the early stages either! Adjust as follows: python -m mlx_lm.lora \
--model mlx-community/Josiefied-Qwen2.5-0.5B-Instruct-abliterated-v1-4bit \
--train \
- --data /Users/gokdenizgulmez/Desktop/dpo_test_data \
+ --data mlx-community/dpo-dataset \
--iters 100 \
--batch-size 1 \
--num-layers 1 \
--val-batches 1 \
--steps-per-report 1 \
--adapter-path /Users/gokdenizgulmez/Desktop/test-dpo \
--max-seq-length 1024 \
--grad-checkpoint \
--training-mode dpo \
--fine-tune-type lora \
--dpo-loss-type sigmoid \
--beta 0.1 \
--steps-per-eval 50 |
@madroidmaq yea, I’ll do that when I’m home 👍 |
You can get the dataset via
|
This is amazing! Thanks @Goekdeniz-Guelmez TOP TOP TOP! |
Thanks so much!!! @ivanfioravanti |
Indeed. Very long-awaited, considering (TBH) the current architectural brittleness of "the state of the art" in HF-based preference optimization. I've been holding off doing any PO training with those tools until it can be done natively in mlx, and I'm glad we now have PRs for this. Thank you for your service, sir. |
@chimezie thanks for the response! I completely agree, DPO training was long overdue :) |
Model: Prompt: Before: After: Training args:
Output:
|
@Goekdeniz-Guelmez For this PR and #1210 , it would be useful to also report the reward accuracies and margins as well, since those are the primary measures for the preference optimization. See how they are calculated in HF's trl DPO trainer, for example |
|
You rock @Goekdeniz-Guelmez 🔥 |
Training:
Output: