Chat dataset + SlimOrca refactor + more templates #576

RdoubleA · 2024-03-24T21:31:32Z

Context

Chat and conversational data is one of the most common datasets that OSS users want to fine-tune on. Including tools and abstractions that empower users to quickly configure their own chat dataset without the overhead of data preprocessing can be immensely valuable.

The challenge here is designing an API that is general enough to apply to many chat datasets but not too rigid that it adds friction to the developer workflow. This is what I would primarily like early feedback on. You can see an example of how it generalizes with the slimorca_dataset builder.

Challenge: Conversational data can take many different formats and it's difficult to anticipate most or all of them

This is the biggest hurdle, but if we engineer a well-designed solution it would make users' lives significantly easier, or at least provide strong guidelines for how to customize to their own dataset. The approach we take here is to define a few lightweight abstractions:

Role = Literal["system", "user", "assistant"]

class Message(TypedDict):
    role: Role
    content: str

Dialogue = List[Message]

These are not new ideas, these were taken straight from Meta's llama inference repo. Axolotl also does something similar. We need to enforce a particular format so that other components can be easily designed around this assumption, and it's not entirely unreasonable to place the burden on users to format their data in this way. This tradeoff is preferable to designing for ANY type of conversation format, or multiple branching if-else statements.

The user will need to do this via convert_to_dialogue, a mandatory Callable parameter. The contract is pretty clear: process a Sample and return a Dialogue. You can see an example in the sharegpt_to_llama_dialogue transform. Users may typically want to transform their data anyway as a preprocessing step before templating and tokenization; this parameter simply takes the place of that.

Challenge: Multi-turn conversations

Handling multiple turns requires template each turn individually, and simultaneously respecting max sequence length, which can easily lead to a convoluted for loop. I think the approach here ended up being relatively straight-forward, but I need feedback here to see if I missed any edge cases.

Challenge: Interop with sample packing

This is still something I'm working through, so it is TBD.

Changelog

Added ChatDataset abstraction and unit tests
Refactored SlimOrcaDataset -> slimorca_dataset builder
Added new chat templates: Llama2ChatTemplate, MistralChatTemplate, ChatMLTemplate
Refactored some data utilities: tokenize_prompt_and_response, truncate_if_necessary
Moved all non-dataset related files under torchtune/data/

Test plan

All unit tests and integration tests pass:
pytest tests --with-integration

E2E test with a recipe: TODO

pytorch-bot · 2024-03-24T21:31:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/576

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e215272 with merge base 6bc450c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

netlify · 2024-03-24T21:31:51Z

❌ Deploy Preview for torchtune-preview failed.

Name	Link
🔨 Latest commit	`91aa4ad`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/66016b1c81a2fc000886be2a

torchtune/data/_templates.py

tests/torchtune/config/test_utils.py

tests/torchtune/data/test_templates.py

torchtune/config/_utils.py

torchtune/data/_utils.py

ebsmothers · 2024-03-26T04:22:20Z

torchtune/config/_utils.py

+    except InstantiationError:
+        # Verify that string can be used as a template, should have variable
+        # placeholders
+        pattern = r"\{.+?\}"


Is this the most robust validation? E.g. I think \{hello\} will pass but is not a valid template. Not to mention that we are not validating # of args or anything like that. Not a huge deal cause I know config validation is hard, but just wanna be realistic about how much we can accomplish with this.

why is {hello} not a valid template? It technically is with the variable placeholder hello

>>> hi = "\{hello\}" >>> hi.format(hello='a') Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'hello\\'

ok I think I just needed to tighten the regex:

ebsmothers · 2024-03-26T04:28:26Z

torchtune/config/_utils.py

+        ValueError: if the template is not a PromptTemplate class or a proper
+            template string
+    """
+    path = "torchtune.data." + template


I'm confused, isn't this different from our usual instantiate logic? Why the change here?

this is different from instantiate because it is working with the string directly instead of a DictConfig

I see. Personally I find that a little bit confusing, but I guess we don't expose this in configs anyways, right? (At least in the current form)

no - this is strictly for the dataset builders. this method was originally in datasets/utils but I moved it to config since it was more akin to config functionality

tests/torchtune/config/test_utils.py

tests/torchtune/data/test_utils.py

torchtune/datasets/_chat.py

tests/torchtune/data/test_data_utils.py

tests/torchtune/datasets/test_chat_dataset.py

torchtune/data/_utils.py

torchtune/datasets/_chat.py

ebsmothers

Great to see the improved chat dataset support! Left a bunch of comments but no major concerns from my side.

torchtune/datasets/_chat.py

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2024

RdoubleA requested review from gokulavasan, SLR722, ebsmothers and kartikayk March 24, 2024 21:32

RdoubleA force-pushed the rafiayub/chat_dataset branch from 91aa4ad to b33c3c9 Compare March 25, 2024 22:52

RdoubleA changed the title ~~[WIP] Chat dataset + SlimOrca refactor~~ Chat dataset + SlimOrca refactor + more templates Mar 25, 2024

RdoubleA requested a review from winglian March 25, 2024 22:53

joecummings reviewed Mar 26, 2024

View reviewed changes

torchtune/data/_templates.py Outdated Show resolved Hide resolved

ebsmothers reviewed Mar 26, 2024

View reviewed changes

tests/torchtune/config/test_utils.py Show resolved Hide resolved

ebsmothers reviewed Mar 26, 2024

View reviewed changes

tests/torchtune/data/test_templates.py Outdated Show resolved Hide resolved

ebsmothers reviewed Mar 26, 2024

View reviewed changes

tests/torchtune/data/test_templates.py Show resolved Hide resolved

ebsmothers reviewed Mar 26, 2024

View reviewed changes

torchtune/config/_utils.py Show resolved Hide resolved

ebsmothers reviewed Mar 26, 2024

View reviewed changes

torchtune/data/_utils.py Outdated Show resolved Hide resolved

ebsmothers reviewed Mar 26, 2024

View reviewed changes

tests/torchtune/config/test_utils.py Outdated Show resolved Hide resolved

ebsmothers reviewed Mar 26, 2024

View reviewed changes

tests/torchtune/data/test_utils.py Outdated Show resolved Hide resolved

kartikayk reviewed Mar 26, 2024

View reviewed changes

torchtune/datasets/_chat.py Show resolved Hide resolved

RdoubleA force-pushed the rafiayub/chat_dataset branch from 908da3b to 182ca4a Compare March 26, 2024 22:02