Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat dataset + SlimOrca refactor + more templates #576

Merged
merged 6 commits into from
Mar 28, 2024

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Mar 24, 2024

Context

Chat and conversational data is one of the most common datasets that OSS users want to fine-tune on. Including tools and abstractions that empower users to quickly configure their own chat dataset without the overhead of data preprocessing can be immensely valuable.

The challenge here is designing an API that is general enough to apply to many chat datasets but not too rigid that it adds friction to the developer workflow. This is what I would primarily like early feedback on. You can see an example of how it generalizes with the slimorca_dataset builder.

Challenge: Conversational data can take many different formats and it's difficult to anticipate most or all of them

This is the biggest hurdle, but if we engineer a well-designed solution it would make users' lives significantly easier, or at least provide strong guidelines for how to customize to their own dataset. The approach we take here is to define a few lightweight abstractions:

Role = Literal["system", "user", "assistant"]

class Message(TypedDict):
    role: Role
    content: str

Dialogue = List[Message]

These are not new ideas, these were taken straight from Meta's llama inference repo. Axolotl also does something similar. We need to enforce a particular format so that other components can be easily designed around this assumption, and it's not entirely unreasonable to place the burden on users to format their data in this way. This tradeoff is preferable to designing for ANY type of conversation format, or multiple branching if-else statements.

The user will need to do this via convert_to_dialogue, a mandatory Callable parameter. The contract is pretty clear: process a Sample and return a Dialogue. You can see an example in the sharegpt_to_llama_dialogue transform. Users may typically want to transform their data anyway as a preprocessing step before templating and tokenization; this parameter simply takes the place of that.

Challenge: Multi-turn conversations

Handling multiple turns requires template each turn individually, and simultaneously respecting max sequence length, which can easily lead to a convoluted for loop. I think the approach here ended up being relatively straight-forward, but I need feedback here to see if I missed any edge cases.

Challenge: Interop with sample packing

This is still something I'm working through, so it is TBD.

Changelog

  • Added ChatDataset abstraction and unit tests
  • Refactored SlimOrcaDataset -> slimorca_dataset builder
  • Added new chat templates: Llama2ChatTemplate, MistralChatTemplate, ChatMLTemplate
  • Refactored some data utilities: tokenize_prompt_and_response, truncate_if_necessary
  • Moved all non-dataset related files under torchtune/data/

Test plan

All unit tests and integration tests pass:
pytest tests --with-integration

E2E test with a recipe: TODO

Copy link

pytorch-bot bot commented Mar 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/576

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e215272 with merge base 6bc450c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2024
Copy link

netlify bot commented Mar 24, 2024

Deploy Preview for torchtune-preview failed.

Name Link
🔨 Latest commit 91aa4ad
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/66016b1c81a2fc000886be2a

@RdoubleA RdoubleA changed the title [WIP] Chat dataset + SlimOrca refactor Chat dataset + SlimOrca refactor + more templates Mar 25, 2024
@RdoubleA RdoubleA requested a review from winglian March 25, 2024 22:53
except InstantiationError:
# Verify that string can be used as a template, should have variable
# placeholders
pattern = r"\{.+?\}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the most robust validation? E.g. I think \{hello\} will pass but is not a valid template. Not to mention that we are not validating # of args or anything like that. Not a huge deal cause I know config validation is hard, but just wanna be realistic about how much we can accomplish with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is {hello} not a valid template? It technically is with the variable placeholder hello

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> hi = "\{hello\}"
>>> hi.format(hello='a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'hello\\'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I think I just needed to tighten the regex:
image

ValueError: if the template is not a PromptTemplate class or a proper
template string
"""
path = "torchtune.data." + template
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, isn't this different from our usual instantiate logic? Why the change here?

Copy link
Contributor Author

@RdoubleA RdoubleA Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is different from instantiate because it is working with the string directly instead of a DictConfig

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Personally I find that a little bit confusing, but I guess we don't expose this in configs anyways, right? (At least in the current form)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - this is strictly for the dataset builders. this method was originally in datasets/utils but I moved it to config since it was more akin to config functionality

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see the improved chat dataset support! Left a bunch of comments but no major concerns from my side.

@RdoubleA RdoubleA merged commit 8245523 into main Mar 28, 2024
20 checks passed
@RdoubleA RdoubleA deleted the rafiayub/chat_dataset branch March 28, 2024 19:52
tcapelle pushed a commit to tcapelle/torchtune that referenced this pull request Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants