From bfe2d2f41dad53d950112b866c6b4cc41b81821b Mon Sep 17 00:00:00 2001 From: MKhalusova Date: Tue, 18 Apr 2023 08:17:35 -0400 Subject: [PATCH 1/5] WIP LoRA conceptual guide --- docs/source/conceptual_guides/lora.mdx | 50 ++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) create mode 100644 docs/source/conceptual_guides/lora.mdx diff --git a/docs/source/conceptual_guides/lora.mdx b/docs/source/conceptual_guides/lora.mdx new file mode 100644 index 0000000000..a36ee8dab2 --- /dev/null +++ b/docs/source/conceptual_guides/lora.mdx @@ -0,0 +1,50 @@ + + +# LoRA + +This conceptual guide gives a brief overview of [LoRA](https://arxiv.org/abs/2106.09685), a technique that accelerates +the fine-tuning of large models while consuming less memory. + +To make the fine-tuning more efficient, the original model's weight matrix is represented with two smaller +matrices (called **update matrices**) through low-rank decomposition. These new matrices can be trained to adapt to the +new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive +any further adjustments. To produce the final results, both the original and the adapted weights are combined. + +This approach has a number of advantages: + +* LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters +* The original pre-trained weights are kept frozen, and you can have many lightweight and portable LoRA models for various downstream tasks built on top of them +* LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them + +In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable +parameters. However, for simplicity and further parameter efficiency, in Transformer models LoRA is typically applied to +attention blocks only. The number of trainable parameters in a LoRA model depends on the size of the low-rank update +matrices, which is determined mainly by the rank `r` and the shape of the original weight matrix. + + + +## Common LoRA parameters in PEFT + +- `r`: the rank of the update matrices, expressed in `int`. Lower rank results in smaller update matrices with fewer trainable parameters. +- `target_modules`: The modules to use as the base build LoRA update matrices. E.g. attention blocks. +- `alpha`: LoRA scaling factor. +- `bias`: Specifies if the `bias` parameters should be trained. Can be `'none'`, `'all'` or `'lora_only'`. +- `modules_to_save`: List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model's custom head that is randomly initialized for the fine-tuning task. + +## LoRA examples + +Image classification +Semantic segmentation + +While the original paper focuses on language models, the technique can be applied to any dense layers in deep learning +models. As such, you can also apply this technique to diffusion models. From 47bf02395eb55d8d0681575ae20028b74e45fcc1 Mon Sep 17 00:00:00 2001 From: MKhalusova Date: Tue, 18 Apr 2023 13:40:26 -0400 Subject: [PATCH 2/5] conceptual guide for LoRA --- docs/source/_toctree.yml | 5 +++++ docs/source/conceptual_guides/lora.mdx | 29 +++++++++++++++++--------- 2 files changed, 24 insertions(+), 10 deletions(-) diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 8090d858cd..ac20d9b1c1 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -20,6 +20,11 @@ - local: task_guides/ptuning-seq-classification title: P-tuning for sequence classification +- title: Conceptual guides + sections: + - local: conceptual_guides/lora + title: LoRA + - title: Reference sections: - local: package_reference/peft_model diff --git a/docs/source/conceptual_guides/lora.mdx b/docs/source/conceptual_guides/lora.mdx index a36ee8dab2..afb33734f1 100644 --- a/docs/source/conceptual_guides/lora.mdx +++ b/docs/source/conceptual_guides/lora.mdx @@ -15,25 +15,32 @@ specific language governing permissions and limitations under the License. This conceptual guide gives a brief overview of [LoRA](https://arxiv.org/abs/2106.09685), a technique that accelerates the fine-tuning of large models while consuming less memory. -To make the fine-tuning more efficient, the original model's weight matrix is represented with two smaller +To make the fine-tuning more efficient, LoRA's approach is to represent the original model's weight matrix with two smaller matrices (called **update matrices**) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive any further adjustments. To produce the final results, both the original and the adapted weights are combined. This approach has a number of advantages: -* LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters -* The original pre-trained weights are kept frozen, and you can have many lightweight and portable LoRA models for various downstream tasks built on top of them -* LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them +* LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters. +* The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them. +* LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them. In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, in Transformer models LoRA is typically applied to -attention blocks only. The number of trainable parameters in a LoRA model depends on the size of the low-rank update -matrices, which is determined mainly by the rank `r` and the shape of the original weight matrix. +attention blocks only. The resulting number of trainable parameters in a LoRA model depends on the size of the low-rank +update matrices, which is determined mainly by the rank `r` and the shape of the original weight matrix. +## Common LoRA parameters in PEFT +As with other methods supported by PEFT, to fine-tune a model using LoRA, you need to: -## Common LoRA parameters in PEFT +1. Instantiate a base model. +2. Create a configuration (`LoraConfig`) where you define LoRA-specific parameters. +3. Wrap the base model with `get_peft_model()` to get a trainable `PeftModel`. +4. Train the `PeftModel` as you normally would train the base model. + +`LoraConfig` allows you to control how LoRA is applied to the base model through the following parameters: - `r`: the rank of the update matrices, expressed in `int`. Lower rank results in smaller update matrices with fewer trainable parameters. - `target_modules`: The modules to use as the base build LoRA update matrices. E.g. attention blocks. @@ -43,8 +50,10 @@ matrices, which is determined mainly by the rank `r` and the shape of the origin ## LoRA examples -Image classification -Semantic segmentation +For an example of LoRA method application to various downstream tasks, please refer to the following guides: + +* [Image classification using LoRA](../task_guides/image_classification_lora) +* [Semantic segmentation](../task_guides/semantic_segmentation_lora) While the original paper focuses on language models, the technique can be applied to any dense layers in deep learning -models. As such, you can also apply this technique to diffusion models. +models. As such, you can leverage this technique with diffusion models. See [Dreambooth fine-tuning with LoRA](../task_guides/task_guides/dreambooth_lora) task guide for an example. From 04bbff62104369e47dd88a3da8c4a8c1321bb80c Mon Sep 17 00:00:00 2001 From: Maria Khalusova Date: Wed, 19 Apr 2023 08:35:18 -0400 Subject: [PATCH 3/5] Update docs/source/conceptual_guides/lora.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/conceptual_guides/lora.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conceptual_guides/lora.mdx b/docs/source/conceptual_guides/lora.mdx index afb33734f1..7e080190f8 100644 --- a/docs/source/conceptual_guides/lora.mdx +++ b/docs/source/conceptual_guides/lora.mdx @@ -15,7 +15,7 @@ specific language governing permissions and limitations under the License. This conceptual guide gives a brief overview of [LoRA](https://arxiv.org/abs/2106.09685), a technique that accelerates the fine-tuning of large models while consuming less memory. -To make the fine-tuning more efficient, LoRA's approach is to represent the original model's weight matrix with two smaller +To make fine-tuning more efficient, LoRA's approach is to represent the original model's weight matrix with two smaller matrices (called **update matrices**) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive any further adjustments. To produce the final results, both the original and the adapted weights are combined. From 1dd201a7207e1e3730a13073bd06e0227c30c2fe Mon Sep 17 00:00:00 2001 From: Maria Khalusova Date: Wed, 19 Apr 2023 08:35:34 -0400 Subject: [PATCH 4/5] Update docs/source/conceptual_guides/lora.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/conceptual_guides/lora.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/conceptual_guides/lora.mdx b/docs/source/conceptual_guides/lora.mdx index 7e080190f8..4b191b909b 100644 --- a/docs/source/conceptual_guides/lora.mdx +++ b/docs/source/conceptual_guides/lora.mdx @@ -43,7 +43,7 @@ As with other methods supported by PEFT, to fine-tune a model using LoRA, you ne `LoraConfig` allows you to control how LoRA is applied to the base model through the following parameters: - `r`: the rank of the update matrices, expressed in `int`. Lower rank results in smaller update matrices with fewer trainable parameters. -- `target_modules`: The modules to use as the base build LoRA update matrices. E.g. attention blocks. +- `target_modules`: The modules (for example, attention blocks) to apply the LoRA update matrices. - `alpha`: LoRA scaling factor. - `bias`: Specifies if the `bias` parameters should be trained. Can be `'none'`, `'all'` or `'lora_only'`. - `modules_to_save`: List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model's custom head that is randomly initialized for the fine-tuning task. From ab025448e8a36cf9f5e4f07ab1fad5df49bb71ad Mon Sep 17 00:00:00 2001 From: MKhalusova Date: Wed, 19 Apr 2023 08:40:46 -0400 Subject: [PATCH 5/5] feedback addressed --- docs/source/conceptual_guides/lora.mdx | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/source/conceptual_guides/lora.mdx b/docs/source/conceptual_guides/lora.mdx index 4b191b909b..5b18303b9e 100644 --- a/docs/source/conceptual_guides/lora.mdx +++ b/docs/source/conceptual_guides/lora.mdx @@ -15,7 +15,7 @@ specific language governing permissions and limitations under the License. This conceptual guide gives a brief overview of [LoRA](https://arxiv.org/abs/2106.09685), a technique that accelerates the fine-tuning of large models while consuming less memory. -To make fine-tuning more efficient, LoRA's approach is to represent the original model's weight matrix with two smaller +To make fine-tuning more efficient, LoRA's approach is to represent the weight updates with two smaller matrices (called **update matrices**) through low-rank decomposition. These new matrices can be trained to adapt to the new data while keeping the overall number of changes low. The original weight matrix remains frozen and doesn't receive any further adjustments. To produce the final results, both the original and the adapted weights are combined. @@ -25,6 +25,8 @@ This approach has a number of advantages: * LoRA makes fine-tuning more efficient by drastically reducing the number of trainable parameters. * The original pre-trained weights are kept frozen, which means you can have multiple lightweight and portable LoRA models for various downstream tasks built on top of them. * LoRA is orthogonal to many other parameter-efficient methods and can be combined with many of them. +* Performance of models fine-tuned using LoRA is comparable to the performance of fully fine-tuned models. +* LoRA does not add any inference latency because adapter weights can be merged with the base model. In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, in Transformer models LoRA is typically applied to