-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodal dataset builder + docs #1667
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1667
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit dc7204a with merge base c3ff864 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
dataset: | ||
_component_: torchtune.datasets.multimodal.multimodal_chat_dataset | ||
source: json | ||
data_files: data/my_data.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a dummy 2-example-file of what my_data.json should look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's shown above?
dialogue: conversations | ||
image_path: image | ||
image_dir: /home/user/dataset/ | ||
image_tag: "<image>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it will leade to unexpected behaviors, the burden is on the user for setting this correctly.. I don't know if there's an easy way to check for those edge cases.
@@ -335,14 +336,22 @@ class ShareGPTToMessages(Transform): | |||
] | |||
|
|||
Args: | |||
train_on_input (bool): whether the prompt should remain unmasked. Default: False | |||
train_on_input (bool): whether the prompt should remain unmasked. For multimodal datasets, ``train_on_input`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add why this is always false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure really, I think it's because it's not functionally useful? @pbontrager
|
||
return {"messages": messages} | ||
|
||
|
||
# TODO: point to Flamingo model transform as an example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to address this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after certain events have transpired
**load_dataset_kwargs: Dict[str, Any], | ||
) -> SFTDataset: | ||
""" | ||
Configure a text+image dataset with conversations between user and model assistant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It bothers me a bit that its called multimodal, but it actually means text+image. Not asking for changes, just sharing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice! thanks for putting this up. I didnt have time to finish reviewin "multimodal_chat_dataset". Approving to unblock.
Context
What is the purpose of this PR? Is it to
There is currently no way to load a custom multimodal dataset. The current workflow is users will write their own transform and builder - which is already a big lift - but on top of that our custom component imports don't even work (#1540). So basically, it's impossible to use anything other than llava instruct and cauldron. This sucks.
I added a quick
multimodal_chat_dataset
builder. It assumes conversational data with images, much like llava instruct and ShareGPT4V. To keep changes minimal, I only support ShareGPT format. We can expand to other chat formats or to instruct data later.This unfortunately required me to generalize the LlavaInstructToMessages (cc @joecummings, I agree we shouldn't generalize too early but in this case I think it's warranted). Again, to keep changes minimal, I simply merged LlavaInstructToMessages with ShareGPTToMessages. There were identical in handling text, the only difference was the image loading logic.
With this builder, now I finally have something to point to and discuss in the multimodal dataset docs, and users have some avenue to finetune multimodal on their own data. A bit of extra work, but worthwhile imo.
Changelog
What are the changes made in this PR?
multimodal_chat_dataset
builderTest plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example