You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/tutorials/memory_optimizations.rst
+16-2Lines changed: 16 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,11 +16,11 @@ To make things easy, we've summarized these components in the following table:
16
16
17
17
":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. If you're struggling with training stability or accuracy due to precision, fp32 may help, but will significantly increase memory usage and decrease training speed."
18
18
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and need to handle larger batch sizes or longer context lengths. Be aware that it may slow down training speed."
19
-
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but comes at the cost of training speed due to the overhead from moving tensors between GPU VRAM and CPU. This can also be used alongside activation checkpointing."
19
+
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but comes at the cost of training speed due to the overhead of moving tensors between GPU VRAM and CPU. This can also be used alongside activation checkpointing."
20
20
":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. Often preferable to activation checkpointing for better training speed."
21
21
":ref:`glossary_low_precision_opt`", "When you need to further reduce memory usage beyond using ``bf16`` by reducing the precision in the optimizer states. Note that lower precision optimizers may reduce training stability/accuracy."
22
22
":ref:`glossary_opt_in_bwd`", "Helps reduce memory usage when using stateful optimizers, particularly when full-finetuning large models with high gradient memory usage. This is not compatible with ``gradient_accumulation_steps``, so training may slow down due to reduced model throughput."
23
-
":ref:`glossary_cpu_offload`", "Offloads optimizer states, and, optionally, gradients, to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance."
23
+
":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance."
24
24
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training."
25
25
":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware."
26
26
@@ -255,6 +255,20 @@ or by directly :ref:`modifying a config file<config_tutorial_label>`:
255
255
# additional key-word arguments can be passed to torch.optim.AdamW
256
256
lr: 4e-5
257
257
258
+
or using it directly in your code, which allows you to change the base optimizer:
259
+
260
+
.. code-block:: python
261
+
262
+
from torchao.prototype.low_bit_optim import CPUOffloadOptimizer
263
+
from torch.optim import Adam
264
+
265
+
optimizer = CPUOffloadOptimizer(
266
+
model.parameters(), # your model here
267
+
Adam,
268
+
lr=1e-5,
269
+
fused=True
270
+
)
271
+
258
272
Some helpful hints from the ``torchao`` `CPUOffloadOptimizer page <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload>`_:
259
273
260
274
* The CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) use full ``bf16`` training so that parameters, gradients, and optimizer states are in ``bf16``; and (2) give GPU more work per optimizer step (e.g. larger batch size with activation checkpointing, gradient accumulation).
0 commit comments