Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 0 additions & 16 deletions docs/source/api_ref_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ torchtune.data

.. currentmodule:: torchtune.data

.. _chat_formats:

Text templates
--------------

Expand All @@ -18,14 +16,12 @@ and models.
:toctree: generated/
:nosignatures:

InstructTemplate
GrammarErrorCorrectionTemplate
SummarizeTemplate
QuestionAnswerTemplate
PromptTemplate
PromptTemplateInterface
ChatMLTemplate
ChatFormat

Types
-----
Expand All @@ -37,18 +33,6 @@ Types
Message
Role

Converters
----------

Converts data from common JSON formats into a torchtune :class:`Message`.

.. autosummary::
:toctree: generated/
:nosignatures:

get_sharegpt_messages
get_openai_messages

.. _message_transforms_ref:

Message transforms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we call these ToMessage transforms to convey that they convey immediately that they convert data to message format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used "Message transforms" throughout the docs, so I'll leave updating all those references for a future PR and keep this as is

Expand Down
4 changes: 2 additions & 2 deletions docs/source/api_ref_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ torchtune.datasets

.. currentmodule:: torchtune.datasets

For a detailed general usage guide, please see our :ref:`datasets tutorial <dataset_tutorial_label>`.
For a detailed general usage guide, please see :ref:`datasets_overview`.


Text datasets
------------------
-------------

torchtune supports several widely used text-only datasets to help quickly bootstrap your fine-tuning.

Expand Down
54 changes: 54 additions & 0 deletions docs/source/basics/packing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
.. _packing_usage_label:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you snuck this in here you sneaky lil man

i love it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh nice

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I just now realized the joke


==============
Sample packing
==============

Sample packing involves concatenating multiple samples from your dataset into a single sequence, upto a maximum
sequence length. This requires some pre-processing of the dataset which may
slow down time-to-first-batch, but can introduce significant training speedups
depending on the dataset. In torchtune, sample packing is done by iterating through your dataset and performing
greedy packing upon dataset initialization. You can use sample packing with any of the single dataset builders by passing in
:code:`packed=True`.

To set the max sequence length to pack to, make sure to define ``max_seq_len`` on your tokenizer.

.. code-block:: python

from torchtune.datasets import alpaca_dataset, PackedDataset
from torchtune.models.llama3 import llama3_tokenizer

# Load in tokenizer
tokenizer = llama3_tokenizer(
path="/tmp/Llama-3.2-1B-Instruct/original/tokenizer.model",
max_seq_len=8192,
)
dataset = alpaca_dataset(
tokenizer=tokenizer,
packed=True,
)
print(isinstance(dataset, PackedDataset)) # True

.. code-block:: yaml

# YAML config
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
max_seq_len: 8192

dataset:
_component_: torchtune.datasets.alpaca_dataset
packed: True

.. code-block:: bash

# Command line
tune run full_finetune_single_device --config llama3_2/1B_full_single_device \
dataset.packed=True tokenizer.max_seq_len=8192

torchtune will automatically handle document masking and relative position IDs when sample packing is enabled
to prevent different irrelevant samples from cross-attending. This is done via PyTorch's `Flex Attention <https://pytorch.org/blog/flexattention/#document-maskingjagged-sequences>`_,
which enables the use of flash attention with non-causal masks. If your hardware does not support Flex Attention
(for CUDA devices, it must be Turing or above), standard SDPA with memory-efficient attention will be used as a fallback,
while retaining the document masking and relative position IDs.
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ torchtune tutorials.
basics/message_transforms
basics/tokenizers
basics/prompt_templates
basics/packing

.. toctree::
:glob:
Expand All @@ -144,7 +145,6 @@ torchtune tutorials.
tutorials/qlora_finetune
tutorials/qat_finetune
tutorials/e2e_flow
tutorials/datasets
tutorials/memory_optimizations
tutorials/llama_kd_tutorial

Expand Down
1 change: 0 additions & 1 deletion docs/source/recipes/lora_finetune_single_device.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ Interested in seeing this recipe in action? Check out some of our tutorials to s

* :ref:`Finetuning Llama2 with LoRA<lora_finetune_label>`
* :ref:`Finetuning Llama2 with QLoRA<qlora_finetune_label>`
* :ref:`End-to-End Workflow with torchtune<dataset_tutorial_label>`
* :ref:`Fine-tuning Llama3 with Chat Data<chat_tutorial_label>`
* :ref:`Meta Llama3 in torchtune<llama3_label>`
* :ref:`Fine-Tune Your First LLM<finetune_llama_label>`
2 changes: 1 addition & 1 deletion docs/source/tutorials/chat.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ custom chat dataset for fine-tuning Llama3 Instruct.

.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites

* Be familiar with :ref:`configuring datasets<dataset_tutorial_label>`
* Be familiar with :ref:`configuring datasets<chat_dataset_usage_label>`
* Know how to :ref:`download Llama3 Instruct weights <llama3_label>`


Expand Down
Loading
Loading