You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This DPO documentation page suggests that the best way to merge adaptors is to merge them into a quantized model and then dequantize, quoting this tweet with anecdotal evidence (no measurements).
This goes against common intuition (QLoRA trains high-precision adapters to compensate for quantization precision losses, so merging them into a highly quantized is highly unintuitive). It also goes against some more in-depth investigations and does not mention principled research approaches, e.g. QALoRA.
Moreover, the script linked as an example goes against the recommendation in the tweet, first DEquantizing the model and then merging. Maybe there was simply a typo in the referenced tweet.
I also don't fully understand where the 1-2% performance loss figure comes from in the next sentence. Maybe it's a coincidence, but this is similar to the reported advantage of QALoRA vs regular merging, but QALoRA is not the same thing as what the tweet suggests.
I submitted a pull request with a slightly revised version of this and the next paragraphs (#1325).
The text was updated successfully, but these errors were encountered:
This DPO documentation page suggests that the best way to merge adaptors is to merge them into a quantized model and then dequantize, quoting this tweet with anecdotal evidence (no measurements).
This goes against common intuition (QLoRA trains high-precision adapters to compensate for quantization precision losses, so merging them into a highly quantized is highly unintuitive). It also goes against some more in-depth investigations and does not mention principled research approaches, e.g. QALoRA.
Moreover, the script linked as an example goes against the recommendation in the tweet, first DEquantizing the model and then merging. Maybe there was simply a typo in the referenced tweet.
I also don't fully understand where the 1-2% performance loss figure comes from in the next sentence. Maybe it's a coincidence, but this is similar to the reported advantage of QALoRA vs regular merging, but QALoRA is not the same thing as what the tweet suggests.
I submitted a pull request with a slightly revised version of this and the next paragraphs (#1325).
The text was updated successfully, but these errors were encountered: