-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode characters break tokenizer #24
Comments
Good catch! Will update it soon. |
I think I got it working, by fixing tokenizer conversion in |
We went for |
I just read the "pybind11 will encode the Python string to UTF-8" |
We have to fix the tokenizer to iterate grapheme rather than character. |
You can try Unicode branch and see if your Unicode string works. |
|
I check it on my laptop. The reason is very clear for the gibberish to me, and I did talk to @PotatoSpudowski about it. |
Doesn't seem to be missing character, I tried the converter from alpaca.cpp (https://github.com/antimatter15/alpaca.cpp/blob/master/convert-pth-to-ggml.py#L102), the model/tokenizer worked fine: But when I tried the same model with fastLLaMa, it crashed right away: |
ok. I think I checked the wrong model (LLAMA 7B). Could you tell me which model you are using? |
I think I have an idea of what's happing. Maybe stream token function might be sending invalid Unicode to the py11bind. |
I had the same issue, and it seems that some tokens are actually only single bytes from multi-byte unicode characters. Any python really doesn't like that. streamed_char_bytes = b''
def stream_token(x: bytes) -> None:
"""
This function is called by the llama library to stream tokens
"""
if(len(x)==0):
return
global streamed_char_bytes
bit = x[0]>>7
if bit > 0:
streamed_char_bytes = streamed_char_bytes + x
try:
print(streamed_char_bytes.decode("utf-8"), end='', flush=True)
streamed_char_bytes = b''
except UnicodeDecodeError:
pass
else:
if(len(streamed_char_bytes)>0):
print('�')
streamed_char_bytes = b''
try:
c = x.decode("utf8")
print(c, end='', flush=True)
except UnicodeDecodeError:
print('�') I also edited the |
The fix i'm thinking of requires me to fix buffer inside the |
Try now fix/unicode. I fixed it. My approach was simple. I took the last invalid/partial |
Ok your modifications to diff --git a/convert-pth-to-ggml.py b/convert-pth-to-ggml.py
index fd934e7..8e2c556 100644
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@@ -112,11 +112,27 @@ for p in range(n_parts):
fout.write(struct.pack("i", ftype))
# Is this correct??
- for i in range(32000):
- # TODO: this is probably wrong - not sure how this tokenizer works
- text = tokenizer.decode([29889, i]).encode('utf-8')
- # remove the first byte (it's always '.')
- text = text[1:]
+ for i in range(tokenizer.vocab_size()):
+ # # TODO: this is probably wrong - not sure how this tokenizer works
+ # text = tokenizer.decode([29889, i]).encode('utf-8')
+ # # remove the first byte (it's always '.')
+ # text = text[1:]
+ # fout.write(struct.pack("i", len(text)))
+ # fout.write(text)
+
+ if tokenizer.is_unknown(i):
+ text = " \u2047 ".encode("utf-8")
+ elif tokenizer.is_control(i):
+ text = b""
+ elif tokenizer.is_byte(i):
+ piece = tokenizer.id_to_piece(i)
+ if len(piece) != 6:
+ print(f"Invalid token: {piece}")
+ sys.exit(1)
+ byte_value = int(piece[3:-1], 16)
+ text = struct.pack("B", byte_value)
+ else:
+ text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
|
Mega work @amitsingh19975 Thank you for picking this up! Keeping this issue open for now! @stduhpf I will test the changes you suggested and add it accordingly! |
@stduhpf tokenizer has been updated so this can be closed! |
When injesting model with multiple-bytes unicode characters, it prints
failed to tokenize string!
, and seems to ignore all tokens prior to said characters. That's bad for not only emoji support, but also languages like japanese and chinese.I haven't tried it yet, but I think implementing the fix proposed in this PR for llama.cpp could solve this issue.
The text was updated successfully, but these errors were encountered: