-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation dataset format #2020
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really awesome docs @qgallouedec - thanks for bringing some order to the chaos of dataset formats ❤️ !
Everything LGTM, with one main question about what is meant by "standard dataset format". What I'm wondering in particular is whether we expect users to preformat their datasets for each trainer or whether we accept some formats like messages
as column that are automatically formatted in our scripts.
Apart from this, feel free to merge with the nits!
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Yes... "standard" is in fact "non-conversational". But it's weird to define something by something it's not.
Currently we expect users to preformat their datasets. But I think supporting everything in the trainers can make sense: from datasets import Dataset
standard_dataset = Dataset.from_dict(
{
"prompt": ["The sky is", "The sun is"],
"completion": [" blue.", " in the sky."],
}
)
AnyTrlTrainer(..., train_dataset=standard_dataset) # ok
conversational_dataset = Dataset.from_dict({
"prompt": [
[{"role": "user", "content": "What color is the sky?"}], [{"role": "user", "content": "Where is the sun?"}],
],
"completion": [
[{"role": "assistant", "content": "It is blue."}], [{"role": "assistant", "content": "In the sky."}],
],
})
AnyTrlTrainer(..., train_dataset=conversational_dataset) # currently not ok, but we can support it in the future. |
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Ah, that makes sense - thanks for the clarification. One thing I'm wondering is whether there's any real advantage to support standard vs conversational, since technically all standard datasets can be converted to conversational by wrapping the prompt, completion etc in the messages format. Given that most models are chat models, one approach would be to support conversational datasets natively in the trainers, but allow users to also provide a preprocessed / tokenized dataset if they wish more flexibility. This also is related to #1646 |
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Help for review
New section in the doc
New data utils
maybe_apply_chat_template
,maybe_extract_prompt
,maybe_unpair_preference_dataset
,apply_chat_template
,extract_prompt
andunpair_preference_dataset
.Update dataset example files
As explained in this section we provide script for converting dataset to TRL style.
The goal is, for every dataset in
trl-lib
to have its corresponding script inexamples/datasets
examples/datasets
to closely match the documented format.Start to update some example scripts
We have to update all scripts to make sure they comply with the new style.
I prefer dedicating further PR to update them all
SIMPLE_QUERY_CHAT_TEMPLATE
to include a prompt generation