Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,8 @@ torchtune tutorials.
tutorials/qlora_finetune
tutorials/qat_finetune
tutorials/e2e_flow
tutorials/memory_optimizations
tutorials/llama_kd_tutorial
tutorials/memory_optimizations

.. toctree::
:glob:
Expand Down
73 changes: 69 additions & 4 deletions docs/source/tutorials/memory_optimizations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,11 @@ To make things easy, we've summarized these components in the following table:

":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. If you're struggling with training stability or accuracy due to precision, fp32 may help, but will significantly increase memory usage and decrease training speed."
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and need to handle larger batch sizes or longer context lengths. Be aware that it may slow down training speed."
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but comes at the cost of training speed due to the overhead of moving tensors between GPU VRAM and CPU. This can also be used alongside activation checkpointing."
":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. Often preferable to activation checkpointing for better training speed."
":ref:`glossary_low_precision_opt`", "When you need to further reduce memory usage beyond using ``bf16`` by reducing the precision in the optimizer states. Note that lower precision optimizers may reduce training stability/accuracy."
":ref:`glossary_opt_in_bwd`", "Helps reduce memory usage when using stateful optimizers, particularly when full-finetuning large models with high gradient memory usage. This is not compatible with ``gradient_accumulation_steps``, so training may slow down due to reduced model throughput."
":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance."
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training."
":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware."

Expand Down Expand Up @@ -95,7 +97,7 @@ efficiency technique that allows saving GPU VRAM by temporarily moving activatio
them back when needed in the backward pass.

See `PyTorch autograd hook tutorial <https://pytorch.org/tutorials/intermediate/autograd_saved_tensors_hooks_tutorial.html#saving-tensors-to-cpu>`_
for more details about how this is implemented through saved_tensors_hooks.
for more details about how this is implemented through :func:`torch.autograd.graph.saved_tensors_hooks`.

This setting is especially helpful for larger batch sizes, or longer context lengths when you're memory constrained.
While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in
Expand Down Expand Up @@ -154,10 +156,13 @@ All of our finetuning recipes support simulating larger batch sizes by accumulat

Gradient accumulation should always be set to 1 when :ref:`fusing the optimizer step into the backward pass <glossary_opt_in_bwd>`.

Optimizers
----------

.. _glossary_low_precision_opt:

Lower Precision Optimizers
--------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^

*What's going on here?*

Expand Down Expand Up @@ -186,7 +191,7 @@ or by directly :ref:`modifying a config file<config_tutorial_label>`:
.. _glossary_opt_in_bwd:

Fusing Optimizer Step into Backward Pass
----------------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*What's going on here?*

Expand All @@ -208,10 +213,70 @@ To understand how this works, we encourage you to read through the relevant PyTo
.. todo ref full finetune recipe doc

In torchtune, you can enable this feature using the ``optimizer_in_bwd`` flag, which is currently only supported in our
single-device full finetune recipe. This feature works best when gradient memory is particularly large;
single-device full finetune recipe. This feature works best when optimizer memory is particularly large;
e.g. when using a stateful optimizer with a model with a lot of parameters, and when you don't need to use
:ref:`gradient accumulation <glossary_grad_accm>`.

.. _glossary_cpu_offload:

Offloading Optimizer/Gradient states to CPU
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

*What's going on here?*

We've mentioned above the concept of optimizer states - memory used by the stateful optimizers to maintain a state of gradient statistics, and
model gradients - tensors used to store gradients when we perform model backwards passes. We support using CPU offloading in our single-device recipes
through the `CPUOffloadOptimizer <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload>`_ from ``torchao``.

This optimizer can wrap any base optimizer and works by keeping the optimizer states and performing the optimizer step on CPU, thus reducing
GPU memory usage by the size of the optimizer states. Additionally, we can also offload gradients to the CPU by using `offload_gradients=True`.

*Sounds great! How do I use it?*

To use this optimizer in your recipes, set the ``optimizer`` key in your config to :class:`torchao.prototype.low_bit_optim.CPUOffloadOptimizer`, which
will use the :class:`torch.optim.AdamW` optimizer with ``fused=True`` as the base optimizer. For example, to use this optimizer to offload
both optimizer states and gradients to CPU:

.. code-block:: bash

tune run <RECIPE> --config <CONFIG> \
optimizer=optimizer=torchao.prototype.low_bit_optim.CPUOffloadOptimizer \
optimizer.offload_gradients=True \
lr=4e-5


or by directly :ref:`modifying a config file<config_tutorial_label>`:

.. code-block:: yaml

optimizer:
_component_: torchao.prototype.low_bit_optim.CPUOffloadOptimizer
offload_gradients: True
# additional key-word arguments can be passed to torch.optim.AdamW
lr: 4e-5

or using it directly in your code, which allows you to change the base optimizer:

.. code-block:: python

from torchao.prototype.low_bit_optim import CPUOffloadOptimizer
from torch.optim import Adam

optimizer = CPUOffloadOptimizer(
model.parameters(), # your model here
Adam,
lr=1e-5,
fused=True
)

Some helpful hints from the ``torchao`` `CPUOffloadOptimizer page <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload>`_:

* The CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) use full ``bf16`` training so that parameters, gradients, and optimizer states are in ``bf16``; and (2) give GPU more work per optimizer step (e.g. larger batch size with activation checkpointing, gradient accumulation).
* Gradient accumulation should always be set to 1 when ``offload_gradients=True``, as gradients are cleared on GPU every backward pass.
* This optimizer works by keeping a copy of parameters and pre-allocating gradient memory on CPU. Therefore, expect your RAM usage to increase by 4x model size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it exactly 4x model size? Or roughly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the ao docs yeah

To minimize the amount of CPU<->GPU data transfer, we keep a copy of parameters and pre-allocate gradients memory on CPU. Therefore, expect your RAM usage to increase by 2x model size + optimizer state (which is 2x model size for Adam).

and since we always use adam it is 4x

* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` in any distributed recipe. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details


.. _glossary_peft:

Parameter Efficient Fine-Tuning (PEFT)
Expand Down
Loading