Skip to content

Commit befdcc7

Browse files
felipemello1Felipe MelloSalmanMohammadiebsmothers
authored andcommitted
update memory optimization tutorial (#1948)
Co-authored-by: Felipe Mello <felipemello@fb.com> Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com> Co-authored-by: ebsmothers <ebs@meta.com>
1 parent b9d78e3 commit befdcc7

File tree

1 file changed

+50
-42
lines changed

1 file changed

+50
-42
lines changed

docs/source/tutorials/memory_optimizations.rst

Lines changed: 50 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,16 @@ To make things easy, we've summarized these components in the following table:
1414
:header: "Component", "When to use?"
1515
:widths: auto
1616

17-
":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. If you're struggling with training stability or accuracy due to precision, fp32 may help, but will significantly increase memory usage and decrease training speed."
18-
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and need to handle larger batch sizes or longer context lengths. Be aware that it may slow down training speed."
19-
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but comes at the cost of training speed due to the overhead of moving tensors between GPU VRAM and CPU. This can also be used alongside activation checkpointing."
20-
":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. Often preferable to activation checkpointing for better training speed."
21-
":ref:`glossary_low_precision_opt`", "When you need to further reduce memory usage beyond using ``bf16`` by reducing the precision in the optimizer states. Note that lower precision optimizers may reduce training stability/accuracy."
22-
":ref:`glossary_opt_in_bwd`", "Helps reduce memory usage when using stateful optimizers, particularly when full-finetuning large models with high gradient memory usage. This is not compatible with ``gradient_accumulation_steps``, so training may slow down due to reduced model throughput."
23-
":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance."
24-
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training."
25-
":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware."
26-
":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates."
17+
":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter instead of 4 bytes when using ``float32``."
18+
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and want to use a larger model, batch size or context length. Be aware that it will slow down training speed."
19+
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing."
20+
":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. Not compatible with optimizer in backward. Use it when you can already fit at least one sample without OOMing, but not enough of them."
21+
":ref:`glossary_low_precision_opt`", "Use when you want to reduce the size of the optimizer state. This is relevant when training large models and using optimizers with momentum, like Adam. Note that lower precision optimizers may reduce training stability/accuracy."
22+
":ref:`glossary_opt_in_bwd`", "Use it when you have large gradients and can fit a large enough batch size, since this is not compatible with ``gradient_accumulation_steps``."
23+
":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough."
24+
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy"
25+
":ref:`glossary_qlora`", "When you are training a large model, since quantization will save 1.5 bytes * (# of model parameters), at the potential cost of some training speed and accuracy."
26+
":ref:`glossary_dora`", "a variant of LoRA that may improve model performance at the cost of slightly more memory."
2727

2828

2929
.. note::
@@ -83,8 +83,7 @@ and in most cases training can slow down quite a bit as a result of this activat
8383

8484
*Sounds great! How do I use it?*
8585

86-
To enable activation checkpointing, use the ``enable_activation_checkpointing`` config entry or flag
87-
in any of our recipes, e.g. ``enable_activation_checkpointing=True``.
86+
To enable activation checkpointing, use ``enable_activation_checkpointing=True``.
8887

8988
.. _glossary_act_off:
9089

@@ -104,14 +103,13 @@ This setting is especially helpful for larger batch sizes, or longer context len
104103
While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in
105104
torchtune uses multiple CUDA streams (when available) in order to overlap the extra communication with the computation
106105
to hide the extra runtime. As the communication workload is variable depending on the number and size of tensors being
107-
offloaded, it is common to not offload every single activation. In fact, one can use offloading in conjunction with activations
108-
checkpointing, where all activations will either be recomputed later in the backward or brought back from the CPU.
106+
offloaded, we do not recommend using it unless :ref:`glossary_act_ckpt` is also enabled, in which case only the checkpointed
107+
tensors will be offloaded.
109108

110109
*Sounds great! How do I use it?*
111110

112-
To enable activation offloading, use the ``enable_activation_offloading`` config entry or flag
113-
in our lora finetuning single device recipe, e.g. ``enable_activation_offloading=True``. To allow
114-
usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.
111+
To enable activation offloading, use ``enable_activation_offloading=True``. If you are on torch
112+
version later than PyTorch 2.5.0, it will allow the usage of multiple CUDA streams automatically.
115113

116114
.. _glossary_grad_accm:
117115

@@ -143,10 +141,8 @@ If you're using one of our distributed recipes, simply multiply by the number of
143141

144142
``total_batch_size = batch_size * gradient_accumulation_steps * num_devices``
145143

146-
Gradient accumulation is especially useful when you are memory constrained. In this case,
147-
accumulating gradients might give you better training speed than enabling :ref:`activation
148-
checkpointing <glossary_act_ckpt>`, since activation checkpointing reduces memory consumption at the cost of repeated
149-
computations.
144+
Gradient accumulation is especially useful when you can fit at least one sample in your GPU. In this case, artificially increasing the batch by
145+
accumulating gradients might give you faster training speeds than using other memory optimization techniques that trade-off memory for speed, like :ref:`activation checkpointing <glossary_act_ckpt>`.
150146

151147
*Sounds great! How do I use it?*
152148

@@ -168,25 +164,35 @@ Lower Precision Optimizers
168164
*What's going on here?*
169165

170166
In addition to :ref:`reducing model and optimizer precision <glossary_precision>` during training, we can further reduce precision in our optimizer states.
171-
All of our single-device fine-tuning recipes support lower-precision optimizers from the `bitsandbytes <https://huggingface.co/docs/bitsandbytes/main/en/index>`_ library -
172-
a good place to start might be the ``AdamW8bit`` and ``PagedAdamW8bit`` optimizers, which we've tested our recipes with.
167+
All of our recipes support lower-precision optimizers from the `torchao <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim>`_ library.
168+
For single device recipes, we also support `bitsandbytes <https://huggingface.co/docs/bitsandbytes/main/en/index>`_.
169+
170+
A good place to start might be the :class:`torchao.prototype.low_bit_optim.torchao.AdamW8bit` and :class:`bitsandbytes.optim.PagedAdamW8bit` optimizers.
171+
Both reduce memory by quantizing the optimizer state dict. Paged optimizers will also offload to CPU if there isn't enough GPU memory available. In practice,
172+
you can expect higher memory savings from bnb's PagedAdamW8bit but higher training speed from torchao's AdamW8bit.
173173

174174
*Sounds great! How do I use it?*
175175

176-
To use this in your recipes, make sure you have installed bitsandbytes (``pip install bitsandbytes``). Then, enable
176+
To use this in your recipes, make sure you have installed torchao (``pip install torchao``) or bitsandbytes (``pip install bitsandbytes``). Then, enable
177177
a low precision optimizer using the :ref:`cli_label`:
178178

179+
179180
.. code-block:: bash
180181
181182
tune run <RECIPE> --config <CONFIG> \
182-
optimizer=bitsandbytes.optim.PagedAdamW
183+
optimizer=torchao.prototype.low_bit_optim.torchao.AdamW8bit
184+
185+
.. code-block:: bash
186+
187+
tune run <RECIPE> --config <CONFIG> \
188+
optimizer=bitsandbytes.optim.PagedAdamW8bit
183189
184190
or by directly :ref:`modifying a config file<config_tutorial_label>`:
185191

186192
.. code-block:: yaml
187193
188194
optimizer:
189-
_component_: bitsandbytes.optim.PagedAdamW
195+
_component_: bitsandbytes.optim.PagedAdamW8bit
190196
lr: 2e-5
191197
192198
.. _glossary_opt_in_bwd:
@@ -213,10 +219,9 @@ To understand how this works, we encourage you to read through the relevant PyTo
213219

214220
.. todo ref full finetune recipe doc
215221
216-
In torchtune, you can enable this feature using the ``optimizer_in_bwd`` flag, which is currently only supported in our
217-
single-device full finetune recipe. This feature works best when optimizer memory is particularly large;
218-
e.g. when using a stateful optimizer with a model with a lot of parameters, and when you don't need to use
219-
:ref:`gradient accumulation <glossary_grad_accm>`.
222+
In torchtune, you can enable this feature using the ``optimizer_in_bwd`` flag. This feature works best when using a stateful optimizer
223+
with a model with a lot of parameters, and when you don't need to use :ref:`gradient accumulation <glossary_grad_accm>`.
224+
You won't see meaningful impact when finetuning LoRA recipes, since in this case the number of parameters being updated are small.
220225

221226
.. _glossary_cpu_offload:
222227

@@ -232,6 +237,9 @@ through the `CPUOffloadOptimizer <https://github.com/pytorch/ao/tree/main/torcha
232237
This optimizer can wrap any base optimizer and works by keeping the optimizer states and performing the optimizer step on CPU, thus reducing
233238
GPU memory usage by the size of the optimizer states. Additionally, we can also offload gradients to the CPU by using `offload_gradients=True`.
234239

240+
If finetuning on a single-device, another option is to use the ``PagedAdamW8bit`` from bitsandbytes, mentioned :ref:`above <glossary_low_precision_opt>`, which will *only* offload to CPU
241+
when there is not enough GPU available.
242+
235243
*Sounds great! How do I use it?*
236244

237245
To use this optimizer in your recipes, set the ``optimizer`` key in your config to :class:`torchao.prototype.low_bit_optim.CPUOffloadOptimizer`, which
@@ -272,10 +280,10 @@ or using it directly in your code, which allows you to change the base optimizer
272280
273281
Some helpful hints from the ``torchao`` `CPUOffloadOptimizer page <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload>`_:
274282

275-
* The CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) use full ``bf16`` training so that parameters, gradients, and optimizer states are in ``bf16``; and (2) give GPU more work per optimizer step (e.g. larger batch size with activation checkpointing, gradient accumulation).
283+
* The CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) use full ``bf16`` training so that parameters, gradients, and optimizer states are in ``bf16``; and (2) give GPU more work per optimizer step to amortize the offloading time (e.g. larger batch size with activation checkpointing, gradient accumulation).
276284
* Gradient accumulation should always be set to 1 when ``offload_gradients=True``, as gradients are cleared on GPU every backward pass.
277285
* This optimizer works by keeping a copy of parameters and pre-allocating gradient memory on CPU. Therefore, expect your RAM usage to increase by 4x model size.
278-
* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` in any distributed recipe. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details
286+
* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` instead. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details and `FSDP1 vs FSDP2 <https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md>`_ to see how they differ.
279287

280288

281289
.. _glossary_peft:
@@ -339,20 +347,20 @@ These are all specified under the ``model`` flag or config entry, i.e:
339347
340348
tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
341349
model.apply_lora_to_mlp=True \
342-
model.lora_attn_modules=["q_proj","k_proj","v_proj"]
350+
model.lora_attn_modules=["q_proj","k_proj","v_proj","output_proj"]
343351
344352
.. code-block:: yaml
345353
346354
model:
347355
_component_: torchtune.models.llama3.lora_llama3_8b
348356
apply_lora_to_mlp: True
349-
model.lora_attn_modules: ["q_proj", "k_proj", "v_proj"]
357+
model.lora_attn_modules: ["q_proj", "k_proj", "v_proj","output_proj"]
350358
351359
Secondly, parameters which control the scale of the impact of LoRA on the model:
352360

353361
* ``lora_rank: int`` affects the scale of the LoRA decomposition, where ``lora_rank << in_dim`` and ``lora_rank << out_dim``
354362
\- the dimensions of an arbitrary linear layer in the model. Concretely, ``lora_rank`` reduces the number of gradients stored
355-
in a linear fashion from ``in_dim * out_dim`` to ``lora_rank * (in_dim + out_dim)``. Typically, we have ``lora_rank in [8, 128]``.
363+
in a linear fashion from ``in_dim * out_dim`` to ``lora_rank * (in_dim + out_dim)``. Typically, we have ``lora_rank in [8, 256]``.
356364
* ``lora_alpha: float`` affects the magnitude of the LoRA updates. A larger alpha results in larger updates to the base model weights
357365
, potentially at the cost of training stability, conversely, smaller alpha can stabilize training at the cost of slower learning.
358366
We provide default settings for these parameters which we've tested with all of our models, but we encourage you to adjust them
@@ -365,7 +373,7 @@ As above, these parameters are also specified under the ``model`` flag or config
365373
366374
tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
367375
model.apply_lora_to_mlp=True \
368-
model.lora_attn_modules=["q_proj","k_proj","v_proj"] \
376+
model.lora_attn_modules=["q_proj","k_proj","v_proj","output_proj"] \
369377
model.lora_rank=32 \
370378
model.lora_alpha=64
371379
@@ -374,7 +382,7 @@ As above, these parameters are also specified under the ``model`` flag or config
374382
model:
375383
_component_: torchtune.models.llama3.lora_llama3_8b
376384
apply_lora_to_mlp: True
377-
lora_attn_modules: ["q_proj", "k_proj", "v_proj"]
385+
lora_attn_modules: ["q_proj", "k_proj", "v_proj","output_proj"]
378386
lora_rank: 32
379387
lora_alpha: 64
380388
@@ -390,16 +398,16 @@ Quantized Low Rank Adaptation (QLoRA)
390398

391399
*What's going on here?*
392400

393-
`QLoRA <https://arxiv.org/abs/2305.14314>`_ is an enhancement on top of `LoRA <https://arxiv.org/abs/2106.09685>`_
401+
`QLoRA <https://arxiv.org/abs/2305.14314>`_ is a memory enhancement on top of `LoRA <https://arxiv.org/abs/2106.09685>`_
394402
that maintains the frozen model parameters from LoRA in 4-bit quantized precision, thereby reducing memory usage.
395403
This is enabled through a novel 4-bit NormalFloat (NF4) data type proposed by the authors, which allows for 4-8x less
396404
parameter memory usage whilst retaining model accuracy. You can read our tutorial on :ref:`finetuning Llama2 with QLoRA<qlora_finetune_label>`
397405
for a deeper understanding of how it works.
398406

399-
When considering using QLoRA to reduce memory usage, it's worth noting that QLoRA prevents accuracy degradation during quantization
400-
by up-casting quantized parameters to the original higher precision datatype during model forward passes - this up-casting may
401-
incur penalties to training speed. The :ref:`relevant section <qlora_compile_label>` in our QLoRA tutorial demonstrates the usage of ``torch.compile``
402-
to address this by speeding up training.
407+
When considering using QLoRA to reduce memory usage, it's worth noting that QLoRA is slower than LoRA and may not be worth it if
408+
the model you are finetuning is small. In numbers, QLoRA saves roughly 1.5 bytes * (# of model parameters). Also, although QLoRA quantizes the model,
409+
it minimizes accuracy degradation by up-casting quantized parameters to the original higher precision datatype during model forward passes - this up-casting may incur penalties to training speed.
410+
The :ref:`relevant section <qlora_compile_label>` in our QLoRA tutorial demonstrates the usage of ``torch.compile`` to address this by speeding up training.
403411

404412
*Sounds great! How do I use it?*
405413

0 commit comments

Comments
 (0)