nitaroonie

SalmanMohammadi · SalmanMohammadi · commit 8cafa6ff08ba · 2024-10-29T14:23:03.000Z
diff --git a/docs/source/tutorials/memory_optimizations.rst b/docs/source/tutorials/memory_optimizations.rst
@@ -16,11 +16,11 @@ To make things easy, we've summarized these components in the following table:
 
    ":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. If you're struggling with training stability or accuracy due to precision, fp32 may help, but will significantly increase memory usage and decrease training speed."
    ":ref:`glossary_act_ckpt`", "Use when you're memory constrained and need to handle larger batch sizes or longer context lengths. Be aware that it may slow down training speed."
-   ":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but comes at the cost of training speed due to the overhead from moving tensors between GPU VRAM and CPU. This can also be used alongside activation checkpointing."
+   ":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but comes at the cost of training speed due to the overhead of moving tensors between GPU VRAM and CPU. This can also be used alongside activation checkpointing."
    ":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. Often preferable to activation checkpointing for better training speed."
    ":ref:`glossary_low_precision_opt`", "When you need to further reduce memory usage beyond using ``bf16`` by reducing the precision in the optimizer states. Note that lower precision optimizers may reduce training stability/accuracy."
    ":ref:`glossary_opt_in_bwd`", "Helps reduce memory usage when using stateful optimizers, particularly when full-finetuning large models with high gradient memory usage. This is not compatible with ``gradient_accumulation_steps``, so training may slow down due to reduced model throughput."
-   ":ref:`glossary_cpu_offload`", "Offloads optimizer states, and, optionally, gradients, to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance."
+   ":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance."
    ":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training."
    ":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware."
 
@@ -255,6 +255,20 @@ or by directly :ref:`modifying a config file<config_tutorial_label>`:
     # additional key-word arguments can be passed to torch.optim.AdamW
     lr: 4e-5
 
+or using it directly in your code, which allows you to change the base optimizer:
+
+.. code-block:: python
+
+ from torchao.prototype.low_bit_optim import CPUOffloadOptimizer
+ from torch.optim import Adam
+
+ optimizer = CPUOffloadOptimizer(
+     model.parameters(), # your model here
+     Adam,
+     lr=1e-5,
+     fused=True
+ )
+
 Some helpful hints from the ``torchao`` `CPUOffloadOptimizer page <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload>`_:
 
 * The CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) use full ``bf16`` training so that parameters, gradients, and optimizer states are in ``bf16``; and (2) give GPU more work per optimizer step (e.g. larger batch size with activation checkpointing, gradient accumulation).