[6/7] SFTDataset: revamp instruct/chat #1286

RdoubleA · 2024-08-07T22:55:19Z

Context

Repurposes the old instruct_dataset and chat_dataset builders for config friendly builders of SFTDataset.

instruct_dataset
- creates SFTDataset with InputOutputToMessages as default message transform, since most instruct datasets follow this format
chat_dataset
- creates SFTDataset with ShareGPTToMessages or JSONToMessages, depending on selected conversation style, as default message transform, since most chat datasets follow this format

Also add a new decorator function deprecated, the grim reaper coming to announce the expiration of your favorite classes. The following are currently on the chopping block:

ChatDataset (replaced by SFTDataset)
InstructDataset (replaced by SFTDataset)
get_openai_messages (replaced by JSONToMessages)
get_sharegpt_messages (replaced by ShareGPTToMessages)

Rest In Power to the following:

Llama2ChatFormat (replaced by Llama2ChatTemplate)
MistralChatFormat (replaced by MistralChatTemplate)
ChatMLFormat (replaced by ChatMLTemplate)

AlpacaInstructTemplate will be removed by #1284 and StackExchangedPairedTemplate removed by #1276.

Other changes:

Centralize ASSETS in one location and refactor

Test plan

Updated tests for instruct_dataset and chat_dataset, these now load in a tiny json dataset with the expected format.

Removed tests for chat formats.

unit test for deprecated. you can also see how the logs look when you run the tests locally:

tests/torchtune/data/test_instruct_templates.py:43
  /data/users/rafiayub/torchtune-rafiayub/tests/torchtune/data/test_instruct_templates.py:43: FutureWarning: AlpacaInstructTemplate is deprecated and will be removed in future versions. 
    template = AlpacaInstructTemplate()

tests/torchtune/data/test_converters.py::TestShareGPTToLlama2Messages::test_conversion
  /data/users/rafiayub/torchtune-rafiayub/tests/torchtune/data/test_converters.py:35: FutureWarning: get_sharegpt_messages is deprecated and will be removed in future versions. Please use `torchtune.data.ShareGPTToMessages` with `torchtune.datasets.SFTDataset` instead.
    converted_messages = get_sharegpt_messages(self.samples)

tests/torchtune/data/test_converters.py::TestOpenAIToLlama2Messages::test_conversion_conversations_key
  /data/users/rafiayub/torchtune-rafiayub/tests/torchtune/data/test_converters.py:81: FutureWarning: get_openai_messages is deprecated and will be removed in future versions. Please use `torchtune.data.JSONToMessages` with `torchtune.datasets.SFTDataset` instead.
    converted_messages_1 = get_openai_messages(self.samples_1)

pytorch-bot · 2024-08-07T22:55:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1286

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1ea12f4 with merge base 3e29e6b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-08-08T13:47:33Z

torchtune/utils/logging.py

@@ -25,3 +29,34 @@ def get_logger(level: Optional[str] = None) -> logging.Logger:
        level = getattr(logging, level.upper())
        logger.setLevel(level)
    return logger
+
+
+def deprecated(msg: str = "") -> Callable[[T], T]:


To avoid blowing up people's logs, is it possible to make sure that:

This only logs once (not every time it is hit)

This only logs on rank0

Nice-to-have: include an explicit version number when we will remove support for the API

Also this is kinda orthogonal, but I don't like that our utils directory == trainer utils. Because it takes deps on pretty much every other directory, which means we may run into circular import issues if we try to use this in our lower-level components. I would think about whether there's a place further upstream in our dependency graph we can put this (if only we didn't already commandeer the utils folder name with our furthest downstream components..)

the lru cache ensures it's logged only once per class, also verified it in the unit test

good point, will add this

there's a place further upstream in our dependency graph we can put this

not entirely sure what this would practically mean, do you mean a different file?

Probably a separate directory. Because otherwise anywhere we add from utils.file_name import SomeClass will go through utils/__init__.py, right? And that will import all our APIs with all their upstream dependencies in data, datasets, modules, models. So we'll get stuck with a circular dependency unless we move this out of utils entirely

ebsmothers · 2024-08-08T16:46:03Z

torchtune/utils/logging.py

@@ -25,3 +29,34 @@ def get_logger(level: Optional[str] = None) -> logging.Logger:
        level = getattr(logging, level.upper())
        logger.setLevel(level)
    return logger
+
+
+def deprecated(msg: str = "") -> Callable[[T], T]:


Nice-to-have: include an explicit version number when we will remove support for the API

ebsmothers · 2024-08-08T16:48:15Z

torchtune/utils/logging.py

@@ -25,3 +29,34 @@ def get_logger(level: Optional[str] = None) -> logging.Logger:
        level = getattr(logging, level.upper())
        logger.setLevel(level)
    return logger
+
+
+def deprecated(msg: str = "") -> Callable[[T], T]:


Also this is kinda orthogonal, but I don't like that our utils directory == trainer utils. Because it takes deps on pretty much every other directory, which means we may run into circular import issues if we try to use this in our lower-level components. I would think about whether there's a place further upstream in our dependency graph we can put this (if only we didn't already commandeer the utils folder name with our furthest downstream components..)

ebsmothers · 2024-08-08T16:50:15Z

torchtune/data/_instruct_templates.py

@@ -37,6 +39,7 @@ def format(
        pass


+@deprecated()


Why no message for this one?

re our discussion on the alpaca PR, the template will just be absorbed into the message transform. this class will be removed anyway instead of deprecated

torchtune/data/_converters.py

SalmanMohammadi · 2024-08-08T20:15:06Z

torchtune/datasets/_instruct.py



+@deprecated(msg="Please use `torchtune.datasets.SFTDataset` for custom chat data.")


I think @ebsmothers already raised this, but worth pointing the user to a concrete example of a replacement?
If not, this says "chat" instead of instruct btw

torchtune/data/__init__.py

ebsmothers

Can you update the PR summary? I think it's not actually clearly emphasizing the latest set of changes you've made

torchtune/utils/logging.py

torchtune/datasets/_stack_exchange_paired.py

ebsmothers · 2024-08-22T00:39:16Z

tests/torchtune/datasets/test_grammar_dataset.py

@@ -36,7 +36,7 @@ def test_label_no_masking(self, load_dataset, tokenizer):
            ]
        )

-        grammar_ds = grammar_dataset(model_transform=tokenizer, train_on_input=True)
+        grammar_ds = grammar_dataset(tokenizer=tokenizer, train_on_input=True)


didn't we just change this lol. Not opposed to changing it back, just curious why we changed our mind here

I'm honestly on the fence here, but discussed with @pbontrager and agreed that we'll take a look at these holistically at the end and make a call once we work on multimodal recipes. For now I've been following text datasets = tokenizer, SFTDataset and multimodal = model transform

ebsmothers · 2024-08-22T00:41:39Z

tests/torchtune/utils/test_logging.py

+from torchtune.data._utils import deprecated
+
+
+def test_deprecated():


nit: put in a test class?

ebsmothers · 2024-08-22T00:55:33Z

torchtune/data/_utils.py

@@ -73,3 +77,34 @@ def validate_messages(
                f"System message at index {i} in messages, but system messages must come first"
            )
        last_turn = message.role
+
+
+def deprecated(msg: str = "") -> Callable[[T], T]:


Sorry I will never stop harping on this, but this time I come with an actual proposal.

How about a utils/_internal directory with its own __init__.py file. Then we just do from torchtune.utils._internal import deprecated, pretty sure any such usage will not trigger the imports in the parent directory's __init__.py. Admittedly a little bit confusing but better than what we have now and I think it should work. Please let me know if I am fundamentally misunderstanding how Python imports work (always a possibility)

I can try this

unfortunately, any import to torchtune.utils requires hitting the __init__.py file, so this does not work

ugh should've known it was too good to be true..

torchtune/datasets/_grammar.py

ebsmothers · 2024-08-22T01:11:00Z

tests/torchtune/utils/test_logging.py

+
+    with pytest.warns(
+        FutureWarning,
+        match="DummyClass is deprecated and will be removed in future versions. Please use `TotallyAwesomeClass` instead.",


this is a nit but it's a bit awkward to me that we split the log message across the utility and the arg we pass. Like your passing msg="Please use TotallyAwesomeClass instead" assumes that you know exactly what the first half of the message is, which is somewhat annoying to go check every time

Yeah, I could just pass in the full message everytime but that gets tedious. I supposed I could make it a required argument and do that for simplicity.

ebsmothers · 2024-08-22T01:15:37Z

torchtune/datasets/_instruct.py

+    message_transform = InputOutputToMessages(
+        train_on_input=train_on_input, column_map=column_map
+    )


Did we discuss renaming this or did I imagine that? Also I'm a bit out of the loop on #1366 but why do we use a single static system prompt in InputOutputToMessages now? Is that sufficient? (Sorry if you guys already hashed this all out)

Don't recall that discussion but @joecummings was interested in potentially renaming this.

As for the system prompt, yes a single prompt is sufficient to add a system message for every sample conversation.

torchtune/data/_prompt_templates.py

tests/torchtune/datasets/test_instruct_dataset.py

torchtune/data/_converters.py

torchtune/datasets/_alpaca.py

torchtune/datasets/_chat.py

SalmanMohammadi

Looks great @RdoubleA. Thanks for this. Only 14.285714285% of the work left!

add deprecated decorator

56c3c4e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 7, 2024

joecummings reviewed Aug 8, 2024

View reviewed changes

ebsmothers reviewed Aug 8, 2024

View reviewed changes

SalmanMohammadi reviewed Aug 8, 2024

View reviewed changes

torchtune/data/__init__.py Show resolved Hide resolved

RdoubleA added 4 commits August 14, 2024 11:18

Merge branch 'main' into deprecate_instruct_chat

a841895

refactor instruct builder

9db3359

Merge branch 'main' into deprecate_instruct_chat

2c6ef27

clean up merge

bd5214c

RdoubleA changed the title ~~[6/7] SFTDataset: deprecate instruct/chat~~ [6/7] SFTDataset: revamp instruct/chat Aug 20, 2024

RdoubleA added 4 commits August 19, 2024 17:32

refactor chat_dataset

e90fa07

update chat_dataset

4f9b813

Merge branch 'main' into deprecate_instruct_chat

4c0e7c5

fix circular dependency, update tests

7c25021

ebsmothers reviewed Aug 22, 2024

View reviewed changes

SalmanMohammadi reviewed Aug 22, 2024

View reviewed changes

torchtune/data/_converters.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Aug 22, 2024

View reviewed changes

torchtune/data/_converters.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Aug 22, 2024

View reviewed changes

torchtune/datasets/_alpaca.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Aug 22, 2024

View reviewed changes

torchtune/datasets/_chat.py Outdated Show resolved Hide resolved

RdoubleA added 3 commits August 22, 2024 12:05

Merge branch 'main' into deprecate_instruct_chat

331a375

fix ASSETS, address comments

54bf69d

fix docs

b2b4efe

SalmanMohammadi approved these changes Aug 22, 2024

View reviewed changes

ebsmothers mentioned this pull request Aug 26, 2024

[Lazy RFC] Rename utils/ directory #1414

Closed

Merge branch 'main' into deprecate_instruct_chat

1ea12f4

RdoubleA merged commit 7e084d9 into pytorch:main Aug 26, 2024
20 checks passed

RdoubleA deleted the deprecate_instruct_chat branch August 26, 2024 19:20

RdoubleA mentioned this pull request Aug 27, 2024

[Break BC][RFC] utils directory refactor #1421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6/7] SFTDataset: revamp instruct/chat #1286

[6/7] SFTDataset: revamp instruct/chat #1286

RdoubleA commented Aug 7, 2024 •

edited

Loading

pytorch-bot bot commented Aug 7, 2024 •

edited

Loading

joecummings Aug 8, 2024

ebsmothers Aug 8, 2024

ebsmothers Aug 8, 2024

RdoubleA Aug 8, 2024

RdoubleA Aug 8, 2024

ebsmothers Aug 8, 2024

ebsmothers Aug 8, 2024

ebsmothers Aug 8, 2024

ebsmothers Aug 8, 2024

RdoubleA Aug 8, 2024

SalmanMohammadi Aug 8, 2024

ebsmothers left a comment

ebsmothers Aug 22, 2024

RdoubleA Aug 22, 2024

ebsmothers Aug 22, 2024

ebsmothers Aug 22, 2024

RdoubleA Aug 22, 2024

RdoubleA Aug 22, 2024

ebsmothers Aug 26, 2024

ebsmothers Aug 22, 2024

RdoubleA Aug 22, 2024

ebsmothers Aug 22, 2024

RdoubleA Aug 22, 2024

SalmanMohammadi left a comment



		@deprecated(msg="Please use `torchtune.datasets.SFTDataset` for custom chat data.")

		from torchtune.data._utils import deprecated


		def test_deprecated():

[6/7] SFTDataset: revamp instruct/chat #1286

[6/7] SFTDataset: revamp instruct/chat #1286

Conversation

RdoubleA commented Aug 7, 2024 • edited Loading

Context

Test plan

pytorch-bot bot commented Aug 7, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1286

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi left a comment

Choose a reason for hiding this comment

RdoubleA commented Aug 7, 2024 •

edited

Loading

pytorch-bot bot commented Aug 7, 2024 •

edited

Loading