fix text_offset of multi-token characters #1037

twaka · 2023-12-22T03:04:23Z

Hi, this is fix for text_offset of multi-token characters.
For example, "おはようございます" will be tokenized to

and a character "ざ" is composed of 3 tokens.
In non-stream case, text_offset is calculated from token_str's length, which will zero if token is not valid unicode, and so "ざ" is ignored and text_offset is not aligned with text.

This PR fixes it by using the same way as stream case, which detokenizes entire completion_tokens to calculate text's length.
It also fixes stream case's bug which calculates length of bytes instead of length of characters.

before

"prompt": "おはよう"
"text":"ございます！\n\nI’m back from my"
"text_offset":[4,5,5,5,5,6,7,8,9,10,11,12,13,14,19,24]
"tokens":["ご","","","","い","ま","す","！","\n","\n","I","’","m"," back"," from"," my"]

after

"prompt": "おはよう"
"text":"ございます！\n\nI’m back from my"
"text_offset":[4,5,5,5,6,7,8,9,10,11,12,13,14,15,20,25]
"tokens":["ご","","","","い","ま","す","！","\n","\n","I","’","m"," back"," from"," my"]

Example uses TheBloke/Mistral-7B-v0.1-GGUF/mistral-7b-v0.1.Q4_K_M.gguf, logprobs=1 and temperature=0.

abetlen · 2023-12-22T05:03:36Z

@twaka thank you!

twaka added 2 commits December 22, 2023 11:04

fix text_offsets for bytes tokens

68b69da

fix

a583bf2

abetlen merged commit 2f03fb0 into abetlen:main Dec 22, 2023

twaka deleted the twaka-patch-1 branch December 22, 2023 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix text_offset of multi-token characters #1037

fix text_offset of multi-token characters #1037

Uh oh!

twaka commented Dec 22, 2023 •

edited

Loading

Uh oh!

abetlen commented Dec 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix text_offset of multi-token characters #1037

fix text_offset of multi-token characters #1037

Uh oh!

Conversation

twaka commented Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abetlen commented Dec 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

twaka commented Dec 22, 2023 •

edited

Loading