pytorch · SalmanMohammadi · Aug 24, 2024 · Jul 22, 2024 · Jul 25, 2024 · Jul 26, 2024
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -11,11 +11,11 @@ Please link to any issues this PR addresses.
 What are the changes made in this PR?
 
 #### Test plan
-Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a [contributing page](../CONTRIBUTING.md) for some guidance on contributing.)
+Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a [contributing page](https://github.com/pytorch/torchtune/blob/main/CONTRIBUTING.md) for some guidance on contributing.)
 
 - [ ] run pre-commit hooks and linters (make sure you've first installed via `pre-commit install`)
-- [ ] add [unit tests](../tests/torchtune) for any new functionality
-- [ ] update [docstrings](../docs/source) for any new or updated methods or classes
+- [ ] add [unit tests](https://github.com/pytorch/torchtune/tree/main/tests/torchtune) for any new functionality
+- [ ] update [docstrings](https://github.com/pytorch/torchtune/tree/main/docs/source) for any new or updated methods or classes
 - [ ] run unit tests via `pytest tests`
 - [ ] run recipe tests via `pytest tests -m integration_test`
 - [ ] manually run any new or modified recipes with sufficient proof of correctness

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -66,6 +66,8 @@ Each API and class should be clearly documented. Well-documented code is easier
 
 Documentation is written in [RST](https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html) format.
 
+If you've contributed a new recipe, please ensure you've created a corresponding recipe doc file in [the recipes directory](docs/source/recipes), and updated the [recipe overview page](docs/source/recipes/recipes_overview.rst), and the [index sidebar](docs/source/index.rst). You can find a template to fill in [here](docs/source/_templates/_recipe_template.rst).
+
 ### Adding a new class/method to the API References
 Once you've added an API that is meant to be exposed publically, you should add it to the appropriate rst file. For example, any new API within the [configs/](torchtune/configs)
 directory should be added to `api_ref_configs.rst`, [data/](torchtune/data) should be added to `api_ref_data.rst`, [datasets](torchtune/datasets) should be added to

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -97,6 +97,16 @@ torchtune tutorials.
    tutorials/first_finetune_tutorial
    tune_cli
 
+.. toctree::
+   :glob:
+   :maxdepth: 1
+   :caption: Finetuning Recipes
+   :hidden:
+
+   recipes/recipes_overview
+   recipes/lora_finetune_single_device
+   recipes/qat_distributed
+
 .. toctree::
    :glob:
    :maxdepth: 1
@@ -110,6 +120,7 @@ torchtune tutorials.
    tutorials/e2e_flow
    tutorials/datasets
    tutorials/chat
+   tutorials/memory_optimizations
 
 .. toctree::
    :glob:

diff --git a/docs/source/overview.rst b/docs/source/overview.rst
@@ -32,6 +32,8 @@ Excited? To get started, checkout some of our tutorials, including:
 - our :ref:`LoRA tutorial <lora_finetune_label>` to learn about parameter-efficient finetuning with torchtune.
 - our :ref:`QLoRA tutorial <qlora_finetune_label>` to attain maximal memory efficiency with torchtune.
 
+Eager for more? Check out our :ref:`recipes index<recipes_overview_label>` to see all the fine-tuning techniques we support.
+
 Key Concepts
 ------------
 

diff --git a/docs/source/recipes/lora_finetune_single_device.rst b/docs/source/recipes/lora_finetune_single_device.rst
@@ -0,0 +1,67 @@
+.. _lora_finetune_recipe_label:
+
+=============================
+LoRA Single Device Finetuning
+=============================
+
+This recipe supports finetuning on next-token prediction tasks using parameter efficient fine-tuning techniques (PEFT)
+such as `LoRA <https://arxiv.org/abs/2106.09685>`_ and `QLoRA <https://arxiv.org/abs/2305.14314>`_.  These techniques
+significantly reduce memory consumption during training whilst still maintaining competitive performance.
+
+We provide pre-tested out-of-the-box configs which you can get up and running with the latest `Llama models <https://llama.meta.com/>`_
+in just two steps:
+
+.. note::
+
+    You may need to be granted access to the Llama model you're interested in. See
+    :ref:`here <download_llama_label>` for details on accessing gated repositories.
+
+
+.. code-block:: bash
+
+    tune download meta-llama/Meta-Llama-3.1-8B-Instruct \
+    --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
+    --ignore-patterns "original/consolidated.00.pth"
+
+    tune run lora_finetune_single_device \
+    --config llama3_1/8B_lora_single_device
+
+You can quickly customize this recipe through the :ref:`cli_label`. For example, when fine-tuning with LoRA, you can adjust the layers which LoRA are applied to,
+and the scale of the imapct of LoRA during training:
+
+.. code-block:: bash
+
+    tune run lora_finetune_single_device \
+    --config llama3_1/8B_lora_single_device \
+    --model.lora_attn_modules=["q_proj", "k_proj", "v_proj"] \
+    --model.apply_lora_to_mlp=True \
+    --model.lora_rank=64 \
+    --model.lora_alpha=128
+
+This configuration in particular results in a aggressive LoRA policy which
+will tradeoff higher training accuracy with increased memory usage and slower training.
+
+For a deeper understanding of the different levers you can pull when using this recipe,
+see our documentation for the different PEFT training paradigms we support:
+
+* :ref:`glossary_lora`
+* :ref:`glossary_qlora`
+
+Many of our other memory optimization features can be used in this recipe, too:
+
+* Adjust :ref:`model precision <glossary_precision>`.
+* Use :ref:`activation checkpointing <glossary_act_ckpt>`.
+* Enable :ref:`gradient accumulation <glossary_grad_accm>`.
+* Use :ref:`lower precision optimizers <glossary_low_precision_opt>`. However, note that since LoRA
+  significantly reduces memory usage due to gradient state, you will likely not need this
+  feature.
+
+You can learn more about all of our memory optimization features in our  :ref:`memory optimization overview<memory_optimization_overview_label>`.
+
+Interested in seeing this recipe in action? Check out some of our tutorials to show off how it can be used:
+
+* :ref:`Finetuning Llama2 with LoRA<lora_finetune_label>`
+* :ref:`End-to-End Workflow with torchtune<dataset_tutorial_label>`
+* :ref:`Fine-tuning Llama3 with Chat Data<chat_tutorial_label>`
+* :ref:`Meta Llama3 in torchtune<llama3_label>`
+* :ref:`Fine-Tune Your First LLM<finetune_llama_label>`
diff --git a/docs/source/recipes/qat_distributed.rst b/docs/source/recipes/qat_distributed.rst
@@ -0,0 +1,88 @@
+.. _qat_distributed_recipe_label:
+
+=============================================
+Distributed Quantization-Aware Training (QAT)
+=============================================
+
+QAT allows for taking advantage of memory-saving optimizations from quantization at inference time, without significantly
+degrading model performance. In torchtune, we use `torchao <https://github.com/pytorch/ao>`_ to implement QAT.
+This works by :ref:`simulating quantization numerics during fine-tuning <what_is_qat_label>`. While this may introduce memory and
+compute overheads during training, our tests found that QAT significantly reduced performance degradation in evaluations of
+quantized model, without compromising on model size reduction gains.
+
+.. note::
+
+  The `PyTorch blogpost <https://pytorch.org/blog/quantization-aware-training/>`_ on QAT provides further insight into how QAT works.
+
+
+We provide pre-tested out-of-the-box configs which you can get up and running with the latest `Llama models <https://llama.meta.com/>`_
+in just two steps:
+
+.. note::
+
+    You may need to be granted access to the Llama model you're interested in. See
+    :ref:`here <download_llama_label>` for details on accessing gated repositories.
+
+.. code-block:: bash
+
+    tune download meta-llama/Meta-Llama-3-8B-Instruct  \
+    --output-dir /tmp/Meta-Llama-3-8B-Instruct \
+    --ignore-patterns "original/consolidated.00.pth" \
+    --HF_TOKEN <HF_TOKEN>
+
+    tune run --nproc_per_node 6 qat_distributed \
+    --config llama3/8B_qat_full
+
+.. note::
+  This workload requires at least 6 GPUs, each with VRAM of at least 80GB.
+
+
+Currently, the main lever you can pull for QAT is by using *delayed fake quantization*.
+Delayed fake quantization allows for control over the step after which fake quantization occurs.
+Empirically, allowing the model to finetune without fake quantization initially allows the
+weight and activation values to stabilize before fake quantizing them, potentially leading
+to improved quantized accuracy. This can be specified through ``fake_quant_after_n_steps``. To
+provide you with an idea of how to roughly configure this parameter, we've achieved best results with
+``fake_quant_after_n_steps ~= total_steps // 2``.
+
+In the future we plan to support different quantization strategies. For now, note that you'll need at least
+``torch>=2.4.0`` to use the `Int8DynActInt4WeightQATQuantizer <https://github.com/pytorch/ao/blob/08024c686fdd3f3dc2817094f817f54be7d3c4ac/torchao/quantization/prototype/qat/api.py#L35>`_
+strategy. Generally, the pipeline for training, quantizing, and evaluating a model using QAT is:
+
+#. Run the ``qat_distributed`` recipe using the above command, or by following the tutorial. By default, this will use ``Int8DynActInt4WeightQATQuantizer``.
+#. This produces an un-quantized model in the original data type. To get an actual quantized model, follow this with
+   ``tune run quantize`` while specifying the same quantizer in the config, e.g.
+
+   .. code-block:: yaml
+
+     # QAT specific args
+     quantizer:
+       _component_: torchtune.utils.quantization.Int8DynActInt4WeightQATQuantizer
+       groupsize: 256
+
+#. :ref:`Evaluate<qat_eval_label>` or `run inference <https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md#generate>`_
+   using your your quantized model by specifying the corresponding post-training quantizer:
+
+   .. code-block:: yaml
+
+     quantizer:
+       _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
+       groupsize: 256
+
+.. note::
+
+  We're using config files to show how to customize the recipe in these examples. Check out the
+  :ref:`configs tutorial <config_tutorial_label>` to learn more.
+
+Many of our other memory optimization features can be used in this recipe, too:
+
+* Adjust :ref:`model precision <glossary_precision>`.
+* Use :ref:`activation checkpointing <glossary_act_ckpt>`.
+* Enable :ref:`gradient accumulation <glossary_grad_accm>`.
+* Use :ref:`lower precision optimizers <glossary_low_precision_opt>`.
+
+You can learn more about all of our memory optimization features in our  :ref:`memory optimization overview<memory_optimization_overview_label>`.
+
+Interested in seeing this recipe in action? Check out some of our tutorials to show off how it can be used:
+
+* :ref:`qat_finetune_label`
diff --git a/docs/source/recipes/recipes_overview.rst b/docs/source/recipes/recipes_overview.rst
@@ -0,0 +1,53 @@
+.. _recipes_overview_label:
+
+================
+Recipes Overview
+================
+
+Recipes are the primary entry points for torchtune users.
+These can be thought of as **hackable, singularly-focused scripts for interacting with LLMs** including fine-tuning,
+inference, evaluation, and quantization.
+
+Each recipe consists of three components:
+
+* **Configurable parameters**, specified through yaml configs and command-line overrides
+* **Recipe script**, entry-point which puts everything together including parsing and validating configs, setting up the environment, and correctly using the recipe class
+* **Recipe class**, core logic needed for fine-tuning, exposed through a set of APIs
+
+.. note::
+
+  To learn more about the concept of "recipes", check out our technical deep-dive: :ref:`recipe_deepdive`.
+
+
+Supervised Finetuning
+---------------------
+
+torchtune provides built-in recipes for finetuning on single device, on multiple devices with `FSDP <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`_,
+using a variety of :ref:`memory optimization features <memory_optimization_overview_label>`. Our  fine-tuning recipes support all of our models and all our dataset types.
+This includes continued pre-training, and various supervised funetuning paradigms, which can be customized through our datasets. Check out our
+:ref:`dataset tutorial <dataset_tutorial_label>` for more information.
+
+Our supervised fine-tuning recipes include:
+
+* :ref:`Single-device <lora_finetune_recipe_label>` LoRA fine-tuning.
+* :ref:`Distributed Quantization-Aware Training<qat_distributed_recipe_label>`.
+
+.. Alignment finetuning
+.. --------------------
+.. Interested in alignment fine-tuning? You've come to the right place! We support the following alignment techniques:
+
+.. Direct Preference Optimixation (DPO) Fine-Tuning
+.. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. `Direct Preference Optimixation <https://arxiv.org/abs/2305.18290>`_ (DPO) stype techniques allow for aligning language models with respect
+.. to a reward model objective function without the use of reinforcement learning. We support DPO preference fine-tuning with:
+
+..   * :ref:`Single-device <lora_finetune_recipe_label>` and :ref:`multi-device <lora_finetune_recipe_label>` LoRA finetuning.
+
+.. note::
+
+  Want to learn more about a certain recipe, but can't find the documentation here?
+  Not to worry! Our recipe documentation is currently in construction - come back soon
+  to see documentation of your favourite fine-tuning techniques.
+
+  .. interested in contributing documentation? Check out our issue here TODO (SalmanMohammadi)
diff --git a/docs/source/tutorials/chat.rst b/docs/source/tutorials/chat.rst
@@ -1,3 +1,5 @@
+.. _chat_tutorial_label:
+
 =================================
 Fine-tuning Llama3 with Chat Data
 =================================

diff --git a/docs/source/tutorials/first_finetune_tutorial.rst b/docs/source/tutorials/first_finetune_tutorial.rst
@@ -66,13 +66,12 @@ Each recipe consists of three components:
 
 .. note::
 
+  Check out our :ref:`recipes index<recipes_overview_label>` to see all the fine-tuning techniques we support.
   To learn more about the concept of "recipes", check out our technical deep-dive: :ref:`recipe_deepdive`.
 
 torchtune provides built-in recipes for finetuning on single device, on multiple devices with `FSDP <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`_,
-using memory efficient techniques like `LoRA <https://arxiv.org/abs/2106.09685>`_, and more! You can view all built-in recipes `on GitHub <https://github.com/pytorch/torchtune/tree/main/recipes>`_. You can also utilize the
-:ref:`tune ls <tune_ls_label>` command to print out all recipes and corresponding configs.
-
-.. TODO (SalmanMohammadi) point to recipe index page here.
+using memory efficient techniques like `LoRA <https://arxiv.org/abs/2106.09685>`_, and more! Check out all our built-in recipes in our :ref:`recipe index<recipes_overview_label>`. You can also utilize the
+:code:`tune ls` command to print out all recipes and corresponding configs.
 
 .. code-block:: bash
 

diff --git a/docs/source/tutorials/lora_finetune.rst b/docs/source/tutorials/lora_finetune.rst
@@ -301,6 +301,7 @@ A comparison of the (smoothed) loss curves between this run and our baseline ove
     to generate similar loss curves, but you will need to install W&B and setup an account separately. For more details on
     using W&B in torchtune, see our ":ref:`wandb_logging`" recipe.
 
+.. _lora_tutorial_memory_tradeoff_label:
 
 Trading off memory and model performance with LoRA
 --------------------------------------------------