diff --git a/docs/source/api_ref_modules.rst b/docs/source/api_ref_modules.rst index cc9a493147..f360b4f02c 100644 --- a/docs/source/api_ref_modules.rst +++ b/docs/source/api_ref_modules.rst @@ -71,6 +71,7 @@ PEFT Components :nosignatures: peft.LoRALinear + peft.DoRALinear peft.AdapterModule peft.get_adapter_params peft.set_trainable_params diff --git a/docs/source/recipes/lora_finetune_single_device.rst b/docs/source/recipes/lora_finetune_single_device.rst index 83d7a385c0..4b4d476058 100644 --- a/docs/source/recipes/lora_finetune_single_device.rst +++ b/docs/source/recipes/lora_finetune_single_device.rst @@ -44,6 +44,7 @@ see our documentation for the different PEFT training paradigms we support: * :ref:`glossary_lora` * :ref:`glossary_qlora` +* :ref:`glossary_dora` Many of our other memory optimization features can be used in this recipe. You can learn more about all of our memory optimization features in our :ref:`memory optimization overview`. diff --git a/docs/source/tutorials/memory_optimizations.rst b/docs/source/tutorials/memory_optimizations.rst index 04644093a9..cc71df53ae 100644 --- a/docs/source/tutorials/memory_optimizations.rst +++ b/docs/source/tutorials/memory_optimizations.rst @@ -21,6 +21,7 @@ To make things easy, we've summarized these components in the following table: ":ref:`glossary_opt_in_bwd`", "Helps reduce memory usage when using stateful optimizers, particularly when full-finetuning large models with high gradient memory usage. This is not compatible with ``gradient_accumulation_steps``, so training may slow down due to reduced model throughput." ":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training." ":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware." + ":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates." .. note:: @@ -108,7 +109,7 @@ checkpointing, where all activations will either be recomputed later in the back To enable activation offloading, use the ``enable_activation_offloading`` config entry or flag in our lora finetuning single device recipe, e.g. ``enable_activation_offloading=True``. To allow -usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907. +usage of streams, make sure you are on a torch version later than PyTorch 2.5.0. .. _glossary_grad_accm: @@ -278,6 +279,7 @@ These are all specified under the ``model`` flag or config entry, i.e: .. code-block:: yaml model: + _component_: torchtune.models.llama3.lora_llama3_8b apply_lora_to_mlp: True model.lora_attn_modules: ["q_proj", "k_proj", "v_proj"] @@ -292,7 +294,24 @@ Secondly, parameters which control the scale of the impact of LoRA on the model: to your specific use case. Typically, one jointly changes ``lora_rank`` and ``lora_alpha`` together, where ``lora_alpha ~= 2*lora_rank``. * ``lora_dropout`` introduces dropout in the LoRA layers to help regularize training. We default to 0.0 for all of our models. -As above, these parameters are also specified under the ``model`` flag or config entry. +As above, these parameters are also specified under the ``model`` flag or config entry: + +.. code-block:: bash + + tune run lora_finetune_single_device --config llama3/8B_lora_single_device \ + model.apply_lora_to_mlp=True \ + model.lora_attn_modules=["q_proj","k_proj","v_proj"] \ + model.lora_rank=32 \ + model.lora_alpha=64 + +.. code-block:: yaml + + model: + _component_: torchtune.models.llama3.lora_llama3_8b + apply_lora_to_mlp: True + lora_attn_modules: ["q_proj", "k_proj", "v_proj"] + lora_rank: 32 + lora_alpha: 64 .. note:: @@ -323,18 +342,98 @@ You can finetune using QLoRA with any of our LoRA recipes, i.e. recipes with the QLoRA-enabled model builders, which we support for all our models, and also use the ``qlora_`` prefix, e.g. the :func:`torchtune.models.llama3.llama3_8b` model has a corresponding :func:`torchtune.models.llama3.qlora_llama3_8b`. We aim to provide a comprehensive set of configurations to allow you to get started with training with QLoRA quickly, -just specify any config with ``_qlora`` in its name, e.g: +just specify any config with ``_qlora`` in its name. +All the rest of the LoRA parameters remain the same for QLoRA - check out the section above on :ref:`LoRA ` +to see how to configure these parameters. + +To configure from the command line: .. code-block:: bash - tune run lora_finetune_single_device --config llama3/8B_qlora_single_device + tune run lora_finetune_single_device --config llama3/8B_qlora_single_device \ + model.apply_lora_to_mlp=True \ + model.lora_attn_modules=["q_proj","k_proj","v_proj"] \ + model.lora_rank=32 \ + model.lora_alpha=64 + -All the rest of the LoRA parameters remain the same for QLoRA - check out the section above on :ref:`LoRA ` -to see how to configure. +or, by modifying a config: + +.. code-block:: yaml + + model: + _component_: torchtune.models.qlora_llama3_8b + apply_lora_to_mlp: True + lora_attn_modules: ["q_proj", "k_proj", "v_proj"] + lora_rank: 32 + lora_alpha: 64 + +.. _glossary_dora: + +Weight-Decomposed Low-Rank Adaptation (DoRA) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +*What's going on here?* + +`DoRA `_ is another PEFT technique which builds on-top of LoRA by +further decomposing the pre-trained weights into two components: magnitude and direction. The magnitude component +is a scalar vector that adjusts the scale, while the direction component corresponds to the original LoRA decomposition and +updates the orientation of weights. + +DoRA adds a small overhead to LoRA training due to the addition of the magnitude parameter, but it has been shown to +improve the performance of LoRA, particularly at low ranks. + +*Sounds great! How do I use it?* + +Much like LoRA and QLoRA, you can finetune using DoRA with any of our LoRA recipes. We use the same model builders for LoRA +as we do for DoRA, so you can use the ``lora_`` version of any model builder with ``use_dora=True``. For example, to finetune +:func:`torchtune.models.llama3.llama3_8b` with DoRA, you would use :func:`torchtune.models.llama3.lora_llama3_8b` with ``use_dora=True``: + +.. code-block:: bash + + tune run lora_finetune_single_device --config llama3/8B_lora_single_device \ + model.use_dora=True + +.. code-block:: yaml + + model: + _component_: torchtune.models.lora_llama3_8b + use_dora: True + +Since DoRA extends LoRA, the parameters for :ref:`customizing LoRA ` are identical. You can also quantize the base model weights like in :ref:`glossary_qlora` by using ``quantize=True`` to reap +even more memory savings! + +.. code-block:: bash + + tune run lora_finetune_single_device --config llama3/8B_lora_single_device \ + model.apply_lora_to_mlp=True \ + model.lora_attn_modules=["q_proj","k_proj","v_proj"] \ + model.lora_rank=16 \ + model.lora_alpha=32 \ + model.use_dora=True \ + model.quantize_base=True + +.. code-block:: yaml + + model: + _component_: torchtune.models.lora_llama3_8b + apply_lora_to_mlp: True + lora_attn_modules: ["q_proj", "k_proj", "v_proj"] + lora_rank: 16 + lora_alpha: 32 + use_dora: True + quantize_base: True + + +.. note:: + + Under the hood, we've enabled DoRA by adding the :class:`~torchtune.modules.peft.DoRALinear` module, which we swap + out for :class:`~torchtune.modules.peft.LoRALinear` when ``use_dora=True``. .. _glossary_distrib: + .. TODO .. Distributed diff --git a/torchtune/modules/peft/dora.py b/torchtune/modules/peft/dora.py index 153b3c78e1..52ad9c7321 100644 --- a/torchtune/modules/peft/dora.py +++ b/torchtune/modules/peft/dora.py @@ -18,15 +18,14 @@ class DoRALinear(nn.Module, AdapterModule): - """LoRA linear layer as introduced in `LoRA: Low-Rank Adaptation of Large Language Models `_. - - LoRA perturbs a given layer via a low-rank approximation where only - the rank decomposition matrices are trainable. In a linear layer instead of - :math:`x \\mapsto W_0x` a LoRALinear layer is defined as - :math:`x \\mapsto W_0x + (\\alpha / r)BAx`, where :math:`r` is the rank of - the matrices :math:`A` and :math:`B` and :math:`\\alpha` is a scaling factor. - As in the original implementation, we support dropout before multiplication - by the low-rank matrices. + """DoRA linear layer as introduced in + `DoRA: Weight-Decomposed Low-Rank Adaptation of Large Language Models `_. + + DoRA (Weight-Decomposed Low-Rank Adaptation) fine-tunes a layer by decomposing the pre-trained weights + into two components: magnitude and direction. The magnitude component is a learnable scalar vector + that scales each output channel, while the direction component, modified via LoRA, adjusts the orientation + of weights. By scaling the LoRA update component :math:`BAx` with the `magnitude` vector, DoRA allows the model + to apply distinct scaling adjustments across different output dimensions. Args: in_dim (int): input dimension