meta-pytorch · RdoubleA · Oct 28, 2024 · Oct 24, 2024 · Oct 24, 2024 · Oct 24, 2024
diff --git a/docs/source/api_ref_data.rst b/docs/source/api_ref_data.rst
@@ -6,8 +6,6 @@ torchtune.data
 
 .. currentmodule:: torchtune.data
 
-.. _chat_formats:
-
 Text templates
 --------------
 
@@ -18,14 +16,12 @@ and models.
     :toctree: generated/
     :nosignatures:
 
-    InstructTemplate
     GrammarErrorCorrectionTemplate
     SummarizeTemplate
     QuestionAnswerTemplate
     PromptTemplate
     PromptTemplateInterface
     ChatMLTemplate
-    ChatFormat
 
 Types
 -----
@@ -37,18 +33,6 @@ Types
     Message
     Role
 
-Converters
-----------
-
-Converts data from common JSON formats into a torchtune :class:`Message`.
-
-.. autosummary::
-    :toctree: generated/
-    :nosignatures:
-
-    get_sharegpt_messages
-    get_openai_messages
-
 .. _message_transforms_ref:
 
 Message transforms

diff --git a/docs/source/api_ref_datasets.rst b/docs/source/api_ref_datasets.rst
@@ -6,11 +6,11 @@ torchtune.datasets
 
 .. currentmodule:: torchtune.datasets
 
-For a detailed general usage guide, please see our :ref:`datasets tutorial <dataset_tutorial_label>`.
+For a detailed general usage guide, please see :ref:`datasets_overview`.
 
 
 Text datasets
-------------------
+-------------
 
 torchtune supports several widely used text-only datasets to help quickly bootstrap your fine-tuning.
 

diff --git a/docs/source/basics/packing.rst b/docs/source/basics/packing.rst
@@ -0,0 +1,54 @@
+.. _packing_usage_label:
+
+==============
+Sample packing
+==============
+
+Sample packing involves concatenating multiple samples from your dataset into a single sequence, upto a maximum
+sequence length. This requires some pre-processing of the dataset which may
+slow down time-to-first-batch, but can introduce significant training speedups
+depending on the dataset. In torchtune, sample packing is done by iterating through your dataset and performing
+greedy packing upon dataset initialization. You can use sample packing with any of the single dataset builders by passing in
+:code:`packed=True`.
+
+To set the max sequence length to pack to, make sure to define ``max_seq_len`` on your tokenizer.
+
+.. code-block:: python
+
+    from torchtune.datasets import alpaca_dataset, PackedDataset
+    from torchtune.models.llama3 import llama3_tokenizer
+
+    # Load in tokenizer
+    tokenizer = llama3_tokenizer(
+        path="/tmp/Llama-3.2-1B-Instruct/original/tokenizer.model",
+        max_seq_len=8192,
+    )
+    dataset = alpaca_dataset(
+        tokenizer=tokenizer,
+        packed=True,
+    )
+    print(isinstance(dataset, PackedDataset))  # True
+
+.. code-block:: yaml
+
+    # YAML config
+    tokenizer:
+      _component_: torchtune.models.llama3.llama3_tokenizer
+      path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
+      max_seq_len: 8192
+
+    dataset:
+      _component_: torchtune.datasets.alpaca_dataset
+      packed: True
+
+.. code-block:: bash
+
+    # Command line
+    tune run full_finetune_single_device --config llama3_2/1B_full_single_device \
+    dataset.packed=True tokenizer.max_seq_len=8192
+
+torchtune will automatically handle document masking and relative position IDs when sample packing is enabled
+to prevent different irrelevant samples from cross-attending. This is done via PyTorch's `Flex Attention <https://pytorch.org/blog/flexattention/#document-maskingjagged-sequences>`_,
+which enables the use of flash attention with non-causal masks. If your hardware does not support Flex Attention
+(for CUDA devices, it must be Turing or above), standard SDPA with memory-efficient attention will be used as a fallback,
+while retaining the document masking and relative position IDs.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -131,6 +131,7 @@ torchtune tutorials.
    basics/message_transforms
    basics/tokenizers
    basics/prompt_templates
+   basics/packing
 
 .. toctree::
    :glob:
@@ -144,7 +145,6 @@ torchtune tutorials.
    tutorials/qlora_finetune
    tutorials/qat_finetune
    tutorials/e2e_flow
-   tutorials/datasets
    tutorials/memory_optimizations
    tutorials/llama_kd_tutorial
 

diff --git a/docs/source/recipes/lora_finetune_single_device.rst b/docs/source/recipes/lora_finetune_single_device.rst
@@ -51,7 +51,6 @@ Interested in seeing this recipe in action? Check out some of our tutorials to s
 
 * :ref:`Finetuning Llama2 with LoRA<lora_finetune_label>`
 * :ref:`Finetuning Llama2 with QLoRA<qlora_finetune_label>`
-* :ref:`End-to-End Workflow with torchtune<dataset_tutorial_label>`
 * :ref:`Fine-tuning Llama3 with Chat Data<chat_tutorial_label>`
 * :ref:`Meta Llama3 in torchtune<llama3_label>`
 * :ref:`Fine-Tune Your First LLM<finetune_llama_label>`
diff --git a/docs/source/tutorials/chat.rst b/docs/source/tutorials/chat.rst
@@ -18,7 +18,7 @@ custom chat dataset for fine-tuning Llama3 Instruct.
 
     .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
 
-      * Be familiar with :ref:`configuring datasets<dataset_tutorial_label>`
+      * Be familiar with :ref:`configuring datasets<chat_dataset_usage_label>`
       * Know how to :ref:`download Llama3 Instruct weights <llama3_label>`