Skip to content

Commit d3039da

Browse files
authored
Remove ChatFormat, InstructTemplate, old message converters (#1895)
1 parent 2c948c6 commit d3039da

File tree

18 files changed

+82
-1016
lines changed

18 files changed

+82
-1016
lines changed

docs/source/api_ref_data.rst

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ torchtune.data
66

77
.. currentmodule:: torchtune.data
88

9-
.. _chat_formats:
10-
119
Text templates
1210
--------------
1311

@@ -18,14 +16,12 @@ and models.
1816
:toctree: generated/
1917
:nosignatures:
2018

21-
InstructTemplate
2219
GrammarErrorCorrectionTemplate
2320
SummarizeTemplate
2421
QuestionAnswerTemplate
2522
PromptTemplate
2623
PromptTemplateInterface
2724
ChatMLTemplate
28-
ChatFormat
2925

3026
Types
3127
-----
@@ -37,18 +33,6 @@ Types
3733
Message
3834
Role
3935

40-
Converters
41-
----------
42-
43-
Converts data from common JSON formats into a torchtune :class:`Message`.
44-
45-
.. autosummary::
46-
:toctree: generated/
47-
:nosignatures:
48-
49-
get_sharegpt_messages
50-
get_openai_messages
51-
5236
.. _message_transforms_ref:
5337

5438
Message transforms

docs/source/api_ref_datasets.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ torchtune.datasets
66

77
.. currentmodule:: torchtune.datasets
88

9-
For a detailed general usage guide, please see our :ref:`datasets tutorial <dataset_tutorial_label>`.
9+
For a detailed general usage guide, please see :ref:`datasets_overview`.
1010

1111

1212
Text datasets
13-
------------------
13+
-------------
1414

1515
torchtune supports several widely used text-only datasets to help quickly bootstrap your fine-tuning.
1616

docs/source/basics/packing.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
.. _packing_usage_label:
2+
3+
==============
4+
Sample packing
5+
==============
6+
7+
Sample packing involves concatenating multiple samples from your dataset into a single sequence, upto a maximum
8+
sequence length. This requires some pre-processing of the dataset which may
9+
slow down time-to-first-batch, but can introduce significant training speedups
10+
depending on the dataset. In torchtune, sample packing is done by iterating through your dataset and performing
11+
greedy packing upon dataset initialization. You can use sample packing with any of the single dataset builders by passing in
12+
:code:`packed=True`.
13+
14+
To set the max sequence length to pack to, make sure to define ``max_seq_len`` on your tokenizer.
15+
16+
.. code-block:: python
17+
18+
from torchtune.datasets import alpaca_dataset, PackedDataset
19+
from torchtune.models.llama3 import llama3_tokenizer
20+
21+
# Load in tokenizer
22+
tokenizer = llama3_tokenizer(
23+
path="/tmp/Llama-3.2-1B-Instruct/original/tokenizer.model",
24+
max_seq_len=8192,
25+
)
26+
dataset = alpaca_dataset(
27+
tokenizer=tokenizer,
28+
packed=True,
29+
)
30+
print(isinstance(dataset, PackedDataset)) # True
31+
32+
.. code-block:: yaml
33+
34+
# YAML config
35+
tokenizer:
36+
_component_: torchtune.models.llama3.llama3_tokenizer
37+
path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
38+
max_seq_len: 8192
39+
40+
dataset:
41+
_component_: torchtune.datasets.alpaca_dataset
42+
packed: True
43+
44+
.. code-block:: bash
45+
46+
# Command line
47+
tune run full_finetune_single_device --config llama3_2/1B_full_single_device \
48+
dataset.packed=True tokenizer.max_seq_len=8192
49+
50+
torchtune will automatically handle document masking and relative position IDs when sample packing is enabled
51+
to prevent different irrelevant samples from cross-attending. This is done via PyTorch's `Flex Attention <https://pytorch.org/blog/flexattention/#document-maskingjagged-sequences>`_,
52+
which enables the use of flash attention with non-causal masks. If your hardware does not support Flex Attention
53+
(for CUDA devices, it must be Turing or above), standard SDPA with memory-efficient attention will be used as a fallback,
54+
while retaining the document masking and relative position IDs.

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ torchtune tutorials.
131131
basics/message_transforms
132132
basics/tokenizers
133133
basics/prompt_templates
134+
basics/packing
134135

135136
.. toctree::
136137
:glob:
@@ -144,7 +145,6 @@ torchtune tutorials.
144145
tutorials/qlora_finetune
145146
tutorials/qat_finetune
146147
tutorials/e2e_flow
147-
tutorials/datasets
148148
tutorials/memory_optimizations
149149
tutorials/llama_kd_tutorial
150150

docs/source/recipes/lora_finetune_single_device.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,6 @@ Interested in seeing this recipe in action? Check out some of our tutorials to s
5151

5252
* :ref:`Finetuning Llama2 with LoRA<lora_finetune_label>`
5353
* :ref:`Finetuning Llama2 with QLoRA<qlora_finetune_label>`
54-
* :ref:`End-to-End Workflow with torchtune<dataset_tutorial_label>`
5554
* :ref:`Fine-tuning Llama3 with Chat Data<chat_tutorial_label>`
5655
* :ref:`Meta Llama3 in torchtune<llama3_label>`
5756
* :ref:`Fine-Tune Your First LLM<finetune_llama_label>`

docs/source/tutorials/chat.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ custom chat dataset for fine-tuning Llama3 Instruct.
1818

1919
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
2020

21-
* Be familiar with :ref:`configuring datasets<dataset_tutorial_label>`
21+
* Be familiar with :ref:`configuring datasets<chat_dataset_usage_label>`
2222
* Know how to :ref:`download Llama3 Instruct weights <llama3_label>`
2323

2424

0 commit comments

Comments
 (0)