Skip to content

Conversation

@twaka
Copy link
Contributor

@twaka twaka commented Dec 22, 2023

Hi, this is fix for text_offset of multi-token characters.
For example, "おはようございます" will be tokenized to
image
and a character "ざ" is composed of 3 tokens.
In non-stream case, text_offset is calculated from token_str's length, which will zero if token is not valid unicode, and so "ざ" is ignored and text_offset is not aligned with text.

This PR fixes it by using the same way as stream case, which detokenizes entire completion_tokens to calculate text's length.
It also fixes stream case's bug which calculates length of bytes instead of length of characters.

before

"prompt": "おはよう"
"text":"ございます!\n\nI’m back from my"
"text_offset":[4,5,5,5,5,6,7,8,9,10,11,12,13,14,19,24]
"tokens":["ご","","","","い","ま","す","!","\n","\n","I","’","m"," back"," from"," my"]

after

"prompt": "おはよう"
"text":"ございます!\n\nI’m back from my"
"text_offset":[4,5,5,5,6,7,8,9,10,11,12,13,14,15,20,25]
"tokens":["ご","","","","い","ま","す","!","\n","\n","I","’","m"," back"," from"," my"]

Example uses TheBloke/Mistral-7B-v0.1-GGUF/mistral-7b-v0.1.Q4_K_M.gguf, logprobs=1 and temperature=0.

@abetlen abetlen merged commit 2f03fb0 into abetlen:main Dec 22, 2023
@abetlen
Copy link
Owner

abetlen commented Dec 22, 2023

@twaka thank you!

@twaka twaka deleted the twaka-patch-1 branch December 22, 2023 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants