Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ChatFormat, InstructTemplate, old message converters #1895

Merged
merged 5 commits into from
Oct 28, 2024

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Oct 24, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Closes #1839. Closes #1849.

Changelog

What are the changes made in this PR?

  • Delete ChatFormat, InstructTemplate, and all references
  • Delete torchtune/data/_converters.py. These are replaced by the transforms in _messages.py
  • Delete the old dataset tutorial. All topics have already been covered in Basics section of documentation, and the tutorial refers to outdated APIs. Only sample packing was still new, so I moved that to its own section under Basics.
  • Update generate recipe to assume single prompt string and no chat format or instruct template since the prompt template is now defined on the tokenizer.

Test plan

Ran tune run generate --config generation. The output on main is just not good:

INFO:torchtune.utils._logging:Tell me a joke?
One of the most important things to do when doing stand up comedy is to tell a joke. The joke doesn’t have to be the best one ever told, but it must be memorable. The joke should be funny and it should be something that people can relate to.
The most important thing to remember is that you are not alone when you tell a joke. You are not the only person who has ever told a joke. You are not the only person who has ever told a joke. You are not the only person who has ever told a joke. You are not the only person who has ever told a joke.
The joke is not the only thing that matters. It is the story that makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny.
The joke is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what makes it funny. The story is what

After changes, it is somewhat more logical, although it still doesn't know how to stop:

INFO:torchtune.utils._logging:Tell me a joke.
One guy walks into a bar and orders a beer.
The bartender says, "Dang, you look like you need another one."
The guy says, "You know what? You're right."
Another guy walks into the bar and orders a beer.
The bartender says, "You look like you need another one."
The guy says, "You know what? You're right."
So a third guy walks into the bar and orders a beer.
The bartender says, "You look like you need another one."
The guy says, "You know what? You're right."
Then a fourth guy walks in and orders a beer.
The bartender says, "You look like you need another one."
The guy says, "You know what? You're right."
So the bartender says, "I'm not sure why, but you look like you need another one."
The guy says, "You know what? You're right."
So a fifth guy walks in and orders a beer.
The bartender says, "You look like you need another one."
The guy says, "You know what? You're right."
So the bartender pours him a beer and says, "Look,

Copy link

pytorch-bot bot commented Oct 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1895

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 81b6abe with merge base 2c948c6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2024
@@ -0,0 +1,52 @@
.. _packing_usage_label:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you snuck this in here you sneaky lil man

i love it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I just now realized the joke

chat_format = _get_component_from_path(chat_format)
messages = chat_format.format(messages)
return self._tokenizer.tokenize_messages(messages)[0]
messages = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noobq: Is this identical to free-form generation when a prompt template isn't provided?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this assumes instruct-based finetuned models I believe.

Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big complaints - thx!


get_sharegpt_messages
get_openai_messages

.. _message_transforms_ref:

Message transforms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we call these ToMessage transforms to convey that they convey immediately that they convert data to message format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used "Message transforms" throughout the docs, so I'll leave updating all those references for a future PR and keep this as is

@@ -0,0 +1,52 @@
.. _packing_usage_label:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh nice

docs/source/basics/packing.rst Outdated Show resolved Hide resolved
docs/source/basics/packing.rst Outdated Show resolved Hide resolved
@@ -27,11 +27,10 @@ tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/Llama-2-7b-hf/tokenizer.model
max_seq_len: null
prompt_template: null

# Generation arguments; defaults taken from gpt-fast
prompt: "Tell me a joke?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should just adopt a format like generation_v2 here where we have the default be to make it clear:

prompt:
	user: Tell me a joke.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that works. was just trying to make the minimal changes needed since we are migrating to generate_c2

@RdoubleA RdoubleA merged commit d3039da into pytorch:main Oct 28, 2024
17 checks passed
@RdoubleA RdoubleA deleted the deprecate_converters branch October 28, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete chat_formats.py and instruct_formats.py Remove deprecated _converters.py file
4 participants