|
| 1 | +.. _packing_usage_label: |
| 2 | + |
| 3 | +============== |
| 4 | +Sample packing |
| 5 | +============== |
| 6 | + |
| 7 | +Sample packing involves concatenating multiple samples from your dataset into a single sequence, upto a maximum |
| 8 | +sequence length. This requires some pre-processing of the dataset which may |
| 9 | +slow down time-to-first-batch, but can introduce significant training speedups |
| 10 | +depending on the dataset. In torchtune, sample packing is done by iterating through your dataset and performing |
| 11 | +greedy packing upon dataset initialization. You can use sample packing with any of the single dataset builders by passing in |
| 12 | +:code:`packed=True`. |
| 13 | + |
| 14 | +To set the max sequence length to pack to, make sure to define ``max_seq_len`` on your tokenizer. |
| 15 | + |
| 16 | +.. code-block:: python |
| 17 | +
|
| 18 | + from torchtune.datasets import alpaca_dataset, PackedDataset |
| 19 | + from torchtune.models.llama3 import llama3_tokenizer |
| 20 | +
|
| 21 | + # Load in tokenizer |
| 22 | + tokenizer = llama3_tokenizer( |
| 23 | + path="/tmp/Llama-3.2-1B-Instruct/original/tokenizer.model", |
| 24 | + max_seq_len=8192, |
| 25 | + ) |
| 26 | + dataset = alpaca_dataset( |
| 27 | + tokenizer=tokenizer, |
| 28 | + packed=True, |
| 29 | + ) |
| 30 | + print(isinstance(dataset, PackedDataset)) # True |
| 31 | +
|
| 32 | +.. code-block:: yaml |
| 33 | +
|
| 34 | + # YAML config |
| 35 | + tokenizer: |
| 36 | + _component_: torchtune.models.llama3.llama3_tokenizer |
| 37 | + path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model |
| 38 | + max_seq_len: 8192 |
| 39 | +
|
| 40 | + dataset: |
| 41 | + _component_: torchtune.datasets.alpaca_dataset |
| 42 | + packed: True |
| 43 | +
|
| 44 | +.. code-block:: bash |
| 45 | +
|
| 46 | + # Command line |
| 47 | + tune run full_finetune_single_device --config llama3_2/1B_full_single_device \ |
| 48 | + dataset.packed=True tokenizer.max_seq_len=8192 |
| 49 | +
|
| 50 | +torchtune will automatically handle document masking and relative position IDs when sample packing is enabled |
| 51 | +to prevent different irrelevant samples from cross-attending. This is done via PyTorch's `Flex Attention <https://pytorch.org/blog/flexattention/#document-maskingjagged-sequences>`_, |
| 52 | +which enables the use of flash attention with non-causal masks. If your hardware does not support Flex Attention |
| 53 | +(for CUDA devices, it must be Turing or above), standard SDPA with memory-efficient attention will be used as a fallback, |
| 54 | +while retaining the document masking and relative position IDs. |
0 commit comments