pytorch · RdoubleA · Sep 21, 2024 · Sep 13, 2024 · Sep 13, 2024 · Sep 13, 2024
diff --git a/docs/source/basics/chat_datasets.rst b/docs/source/basics/chat_datasets.rst
@@ -22,6 +22,8 @@ The primary entry point for fine-tuning with chat datasets in torchtune is the :
 builder. This lets you specify a local or Hugging Face dataset that follows the chat data format
 directly from the config and train your LLM on it.
 
+.. _example_chat:
+
 Example chat dataset
 --------------------
 

diff --git a/docs/source/basics/instruct_datasets.rst b/docs/source/basics/instruct_datasets.rst
@@ -14,6 +14,8 @@ The primary entry point for fine-tuning with instruct datasets in torchtune is t
 builder. This lets you specify a local or Hugging Face dataset that follows the instruct data format
 directly from the config and train your LLM on it.
 
+.. _example_instruct:
+
 Example instruct dataset
 ------------------------
 

diff --git a/docs/source/basics/message_transforms.rst b/docs/source/basics/message_transforms.rst
@@ -0,0 +1,103 @@
+.. _message_transform_usage_label:
+
+==================
+Message Transforms
+==================
+
+Message transforms perform the conversion of raw sample dictionaries from your dataset into torchtune's
+:class:`~torchtune.data.Message` structure. Once you data is represented as Messages, torchtune will handle
+tokenization and preparing it for the model.
+
+.. TODO (rafiayub): place an image here to depict overall pipeline
+
+
+Configuring message transforms
+------------------------------
+Most of our built-in message transforms contain parameters for controlling input masking (``train_on_input``),
+adding a system prompt (``new_system_prompt``), and changing the expected column names (``column_map``).
+These are exposed in our dataset builders :func:`~torchtune.datasets.instruct_dataset` and :func:`~torchtune.datasets.chat_dataset`
+so you don't have to worry about the message transform itself and can configure this directly from the config.
+You can see :ref:`example_instruct` or :ref:`example_chat` for more details.
+
+
+Custom message transforms
+-------------------------
+If our built-in message transforms do not configure for your particular dataset well,
+you can create your own class with full flexibility. Simply inherit from the :class:`~torchtune.modules.transforms.Transform`
+class and add your code in the ``__call__`` method.
+
+A simple contrived example would be to take one column from the dataset as the user message and another
+column as the model response. Indeed, this is quite similar to :class:`~torchtune.data.InputOutputToMessages`.
+
+.. code-block:: python
+
+    from torchtune.modules.transforms import Transform
+    from torchtune.data import Message
+    from typing import Any, Mapping
+
+    class MessageTransform(Transform):
+        def __call__(self, sample: Mapping[str, Any]) -> Mapping[str, Any]:
+            return [
+                Message(
+                    role="user",
+                    content=sample["input"],
+                    masked=True,
+                    eot=True,
+                ),
+                Message(
+                    role="assistant",
+                    content=sample["output"],
+                    masked=False,
+                    eot=True,
+                ),
+            ]
+
+    sample = {"input": "hello world", "output": "bye world"}
+    transform = MessageTransform()
+    messages = transform(sample)
+    print(messages)
+    # [<torchtune.data._messages.Message at 0x7fb0a10094e0>,
+    # <torchtune.data._messages.Message at 0x7fb0a100a290>]
+    for msg in messages:
+        print(msg.role, msg.text_content)
+    # user hello world
+    # assistant bye world
+
+See :ref:`creating_messages` for more details on how to manipulate :class:`~torchtune.data.Message` objects.
+
+To use this for your dataset, you must create a custom dataset builder that uses the underlying
+dataset class, :class:`~torchtune.datasets.SFTDataset`.
+
+.. code-block:: python
+
+    # In data/dataset.py
+    from torchtune.datasets import SFTDataset
+
+    def custom_dataset(tokenizer, **load_dataset_kwargs) -> SFTDataset:
+        message_transform = MyMessageTransform()
+        return SFTDataset(
+            source="json",
+            data_files="data/my_data.json",
+            split="train",
+            message_transform=message_transform,
+            model_transform=tokenizer,
+            **load_dataset_kwargs,
+        )
+
+This can be used directly from the config.
+
+.. code-block:: yaml
+
+    dataset:
+      _component_: data.dataset.custom_dataset
+
+
+Example message transforms
+--------------------------
+- Instruct
+    - :class:`~torchtune.data.InputOutputToMessages`
+- Chat
+    - :class:`~torchtune.data.ShareGPTToMessages`
+    - :class:`~torchtune.data.JSONToMessages`
+- Preference
+    - :class:`~torchtune.data.ChosenRejectedToMessages`
diff --git a/docs/source/basics/messages.rst b/docs/source/basics/messages.rst
@@ -0,0 +1,240 @@
+.. _messages_usage_label:
+
+========
+Messages
+========
+
+Messages are a core component in torchtune that govern how text and multimodal content is tokenized. It serves as the common interface
+for all tokenizer and datasets APIs to operate on. Messages contain information about the text content, which role is sending the text
+content, and other information relevant for special tokens in model tokenizers. For more information about the individual parameters
+for Messages, see the API ref for :class:`~torchtune.data.Message`.
+
+.. _creating_messages:
+
+Creating Messages
+-----------------
+
+Messages can be created via the standard class constructor or directly from a dictionary.
+
+.. code-block:: python
+
+    from torchtune.data import Message
+
+    msg = Message(
+        role="user",
+        content="Hello world!",
+        masked=True,
+        eot=True,
+        ipython=False,
+    )
+    # This is identical
+    msg = Message.from_dict(
+        {
+            "role": "user",
+            "content": "Hello world!",
+            "masked": True,
+            "eot": True,
+            "ipython": False,
+        },
+    )
+    print(msg.content)
+    # [{'type': 'text', 'content': 'Hello world!'}]
+
+Content is formatted as a list of dictionaries. This is because Messages can also contain multimodal content, such as images.
+
+Images in Messages
+^^^^^^^^^^^^^^^^^^
+For multimodal datasets, you need to add the image as a :class:`~PIL.Image.Image` to the corresponding :class:`~torchtune.data.Message`.
+To add it to the beginning of the message, simply prepend it to the content list.
+
+.. code-block:: python
+
+    import PIL
+    from torchtune.data import Message
+
+    img_msg = Message(
+        role="user",
+        content=[
+            {
+                "type": "image",
+                # Place your image here
+                "content": PIL.Image.new(mode="RGB", size=(4, 4)),
+            },
+            {"type": "text", "content": "What's in this image?"},
+        ],
+    )
+
+This will indicate to the model tokenizers where to add the image special token and will be processed by the model transform
+appropriately.
+
+In many cases, you will have an image path instead of a raw :class:`~PIL.Image.Image`. You can use the :func:`~torchtune.data.load_image`
+utility for both local paths and remote paths.
+
+.. code-block:: python
+
+    import PIL
+    from torchtune.data import Message, load_image
+
+    image_path = "path/to/image.jpg"
+    img_msg = Message(
+        role="user",
+        content=[
+            {
+                "type": "image",
+                # Place your image here
+                "content": load_image(image_path),
+            },
+            {"type": "text", "content": "What's in this image?"},
+        ],
+    )
+
+If your dataset contain image tags, or placeholder text to indicate where in the text the image should be inserted,
+you can use the :func:`~torchtune.data.format_content_with_images` to split the text into the correct content list
+that you can pass into the content field of Message.
+
+.. code-block:: python
+
+    import PIL
+    from torchtune.data import format_content_with_images
+
+    content = format_content_with_images(
+        "<|image|>hello <|image|>world",
+        image_tag="<|image|>",
+        images=[PIL.Image.new(mode="RGB", size=(4, 4)), PIL.Image.new(mode="RGB", size=(4, 4))]
+    )
+    print(content)
+    # [
+    #     {"type": "image", "content": <PIL.Image.Image>},
+    #     {"type": "text", "content": "hello "},
+    #     {"type": "image", "content": <PIL.Image.Image>},
+    #     {"type": "text", "content": "world"}
+    # ]
+
+Message transforms
+^^^^^^^^^^^^^^^^^^
+Message transforms are convenient utilities to format raw data into a list of torchtune :class:`~torchtune.data.Message`
+objects.
+
+.. code-block:: python
+
+    from torchtune.data import InputOutputToMessages
+
+    sample = {
+        "input": "What is your name?",
+        "output": "I am an AI assistant, I don't have a name."
+    }
+    transform = InputOutputToMessages()
+    output = transform(sample)
+    for message in output["messages"]:
+        print(message.role, message.text_content)
+    # user What is your name?
+    # assistant I am an AI assistant, I don't have a name.
+
+See :ref:`message_transform_usage_label` for more discussion.
+
+
+Formatting messages with prompt templates
+-----------------------------------------
+
+Prompt templates provide a way to format messages into a structured text template. You can simply call any class that inherits
+from :class:`~torchtune.data.PromptTemplateInterface` on a list of Messages and it will add the appropriate text to the content
+list.
+
+.. code-block:: python
+
+    from torchtune.models.mistral import MistralChatTemplate
+    from torchtune.data import Message
+
+    msg = Message(
+        role="user",
+        content="Hello world!",
+        masked=True,
+        eot=True,
+        ipython=False,
+    )
+    template = MistralChatTemplate()
+    templated_msg = template([msg])
+    print(templated_msg[0].content)
+    # [{'type': 'text', 'content': '[INST] '},
+    # {'type': 'text', 'content': 'Hello world!'},
+    # {'type': 'text', 'content': ' [/INST] '}]
+
+Accessing text content in messages
+----------------------------------
+.. code-block:: python
+
+    from torchtune.models.mistral import MistralChatTemplate
+    from torchtune.data import Message
+
+    msg = Message(
+        role="user",
+        content="Hello world!",
+        masked=True,
+        eot=True,
+        ipython=False,
+    )
+    template = MistralChatTemplate()
+    templated_msg = template([msg])
+    print(templated_msg[0].text_content)
+    # [INST] Hello world! [/INST]
+
+Accessing images in messages
+----------------------------
+.. code-block:: python
+
+    from torchtune.data import Message
+    import PIL
+
+    msg = Message(
+        role="user",
+        content=[
+            {
+                "type": "image",
+                # Place your image here
+                "content": PIL.Image.new(mode="RGB", size=(4, 4)),
+            },
+            {"type": "text", "content": "What's in this image?"},
+        ],
+    )
+    if msg.contains_media:
+        print(msg.get_media())
+    # [<PIL.Image.Image image mode=RGB size=4x4 at 0x7F8D27E72740>]
+
+Tokenizing messages
+-------------------
+All model tokenizers have a ``tokenize_messsages`` method that converts a list of
+:class:`~torchtune.data.Message` objects into token IDs and a loss mask.
+
+.. code-block:: python
+
+    from torchtune.models.mistral import mistral_tokenizer
+    from torchtune.data import Message
+
+    m_tokenizer = mistral_tokenizer(
+        path="/tmp/Mistral-7B-v0.1/tokenizer.model",
+        prompt_template="torchtune.models.mistral.MistralChatTemplate",
+        max_seq_len=8192,
+    )
+    msgs = [
+        Message(
+            role="user",
+            content="Hello world!",
+            masked=True,
+            eot=True,
+            ipython=False,
+        ),
+        Message(
+            role="assistant",
+            content="Hi, I am an AI assistant.",
+            masked=False,
+            eot=True,
+            ipython=False,
+        )
+    ]
+    tokens, mask = m_tokenizer.tokenize_messages(msgs)
+    print(tokens)
+    # [1, 733, 16289, 28793, 22557, 1526, 28808, 28705, 733, 28748, 16289, 28793, 15359, 28725, 315, 837, 396, 16107, 13892, 28723, 2]
+    print(mask)  # User message is masked from the loss
+    # [True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, False, False, False, False]
+    print(m_tokenizer.decode(tokens))
+    # [INST] Hello world!  [/INST] Hi, I am an AI assistant.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -113,6 +113,8 @@ torchtune tutorials.
    :caption: Basics
    :hidden:
 
+   basics/messages
+   basics/message_transforms
    basics/instruct_datasets
    basics/chat_datasets
    basics/tokenizers