Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sharegpt.py #976

Closed
wants to merge 1 commit into from
Closed

Conversation

noobmaster29
Copy link

Modified register_conv_template to resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py.

This issue was further discussed in the following issue:
#922 (comment)

@mhenrichsen
Copy link
Collaborator

mhenrichsen commented Dec 18, 2023

Assuming this fix is correct, what should the bugged chatml order of tokens look like? Seems like the comment references an extra newline.

The correct template looks like this:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

@noobmaster29
Copy link
Author

noobmaster29 commented Dec 24, 2023

The issue is that when shareGPT is runned with chatml, the tokenizer adds in the default BOS and EOS tokens. In the following example, the config is as follows:

special_tokens:
bos_token: ""
eos_token: "
"
unk_token: ""
tokens:

  • "<|im_start|>"
  • "<|im_end|>"

Tokenization results:

20231223_225412

If the config is setup as follows:
special_tokens:
bos_token: "<|im_start|>"
eos_token: "<|im_end|>"
unk_token: ""
tokens:

  • "<|im_start|>"
  • "<|im_end|>"

You end up with the following tokenization (Notice that im_start and im_end are doubled at the beginning and end:

20231223_225421

I'm not sure if this is actually an issue though. However, if conversation is not set the chatml, the tokenization is as follows (with single BOS and EOS tokens):

20231223_222809

@noobmaster29 noobmaster29 reopened this Dec 24, 2023
Resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py. 

axolotl-ai-cloud#922 (comment)
@@ -12,8 +12,8 @@
system_template="<|im_start|>system\n{system_message}",
system_message="You are a helpful assistant.",
roles=["<|im_start|>user", "<|im_start|>assistant"],
sep_style=SeparatorStyle.CHATML,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by removing this, the Conversation defaults to a sep_style=SeparatorStyle.ADD_COLON_SINGLE which results in colons getting added between the role and text.

@winglian
Copy link
Collaborator

winglian commented Jan 9, 2024

Fixed w #1054

@winglian winglian closed this Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants