-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Messages and message transforms docs #1574
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1574
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5be46fc with merge base 9a863c8 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1574 +/- ##
===========================================
- Coverage 70.96% 26.23% -44.73%
===========================================
Files 295 295
Lines 15080 15080
===========================================
- Hits 10701 3956 -6745
- Misses 4379 11124 +6745
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Rafi! It is well written, but like in the other PR review, i think the user can benefit from seeing more code, end2end, and less paragraphs.
The end2end, IMO, is important, showing the creation of a dummy dataset, etc. I think that an image with a mental model would be super helpful too
|
||
Messages are a core component in torchtune that govern how text and multimodal content is tokenized. It serves as the common interface | ||
for all tokenizer and datasets APIs to operate on. Messages contain information about the text content, which role is sending the text | ||
content, and other information relevant for special tokens in model tokenizers. For more information about the individual parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as before. IMO this should be 1 or 2 lines + maybe image with pipeline + code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think this is pretty concise as is. Will add a TODO to add an image. Planning to make an image for the whole data pipeline and I'll place parts of that where relevant in a follow up
|
||
.. code-block:: python | ||
|
||
from torchtune.data import Message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Maybe giving a bit of context in the code, showing which class creates or receives a message, i think it would make it easier for the user to understand how things connect end2end.
Content is formatted as a list of dictionaries. This is because Messages can also contain multimodal content, such as images. | ||
|
||
Images in Messages | ||
^^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is great, but i think we need a bit of end2end, showing how to create the dataset and where messages are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was actually thinking of a section "Tokenizing Messages" or similar to show how they are consumed
docs/source/basics/messages.rst
Outdated
Message transforms | ||
^^^^^^^^^^^^^^^^^^ | ||
Message transforms are convenient utilities to format raw data into a list of torchtune :class:`~torchtune.data.Message` | ||
objects. See :ref:`message_transform_usage_label` for more discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message transforms and "Formatting messages with prompt templates" feels a bit out of place. Maybe it should be part of the previous sections
- :class:`~torchtune.data.ShareGPTToMessages` | ||
- :class:`~torchtune.data.JSONToMessages` | ||
- Preference | ||
- :class:`~torchtune.data.ChosenRejectedToMessages` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add the stack exchange paired transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I didn't expose that class since it's specific to stack exchange paired. I think in the future it might be nice to generalize that class
We should TODO to write something about custom preference dataset transforms. |
Context
What is the purpose of this PR? Is it to
Describe a fundamental component,
Message
, and how to create, format, and tokenize messages. and the concept of message transformsTest plan
doc build