Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters break tokenizer #24

Closed
stduhpf opened this issue Mar 26, 2023 · 18 comments · Fixed by #30
Closed

Unicode characters break tokenizer #24

stduhpf opened this issue Mar 26, 2023 · 18 comments · Fixed by #30

Comments

@stduhpf
Copy link
Contributor

stduhpf commented Mar 26, 2023

When injesting model with multiple-bytes unicode characters, it prints failed to tokenize string!, and seems to ignore all tokens prior to said characters. That's bad for not only emoji support, but also languages like japanese and chinese.

I haven't tried it yet, but I think implementing the fix proposed in this PR for llama.cpp could solve this issue.

@PotatoSpudowski
Copy link
Owner

Good catch!

Will update it soon.

@stduhpf
Copy link
Contributor Author

stduhpf commented Mar 26, 2023

I think I got it working, by fixing tokenizer conversion in convert-pth-to-ggml, and making streaming_fn expecting py::bytesinstead of std::string should I make a PR for it?

@amitsingh19975
Copy link
Collaborator

We went for ASCII for the initial approach because we wanted a working solution for stop words. Therefore, support for Unicode has to work in tandem with stop words. If you can also add support for Unicode stop words, it would be much appreciated.

@amitsingh19975
Copy link
Collaborator

I just read the Pybind11 docs and found that they convert the python Unicode string to UTF-8 string. I think we've to handle it properly in C++

"pybind11 will encode the Python string to UTF-8"

@amitsingh19975
Copy link
Collaborator

We have to fix the tokenizer to iterate grapheme rather than character.

@amitsingh19975
Copy link
Collaborator

You can try Unicode branch and see if your Unicode string works.

@gaoxiao
Copy link

gaoxiao commented Mar 28, 2023

You can try Unicode branch and see if your Unicode string works.

This branch still has issues with Chinese:
image

@amitsingh19975
Copy link
Collaborator

I check it on my laptop. The reason is very clear for the gibberish to me, and I did talk to @PotatoSpudowski about it.
This is all due to the vocabulary. There might be cases where the Unicode or character does not exist inside the model vocab.

@gaoxiao
Copy link

gaoxiao commented Mar 28, 2023

Doesn't seem to be missing character, I tried the converter from alpaca.cpp (https://github.com/antimatter15/alpaca.cpp/blob/master/convert-pth-to-ggml.py#L102), the model/tokenizer worked fine:
image

But when I tried the same model with fastLLaMa, it crashed right away:
image

@amitsingh19975
Copy link
Collaborator

amitsingh19975 commented Mar 28, 2023

ok. I think I checked the wrong model (LLAMA 7B). Could you tell me which model you are using?

@amitsingh19975
Copy link
Collaborator

I think I have an idea of what's happing. Maybe stream token function might be sending invalid Unicode to the py11bind.

@stduhpf
Copy link
Contributor Author

stduhpf commented Mar 28, 2023

I had the same issue, and it seems that some tokens are actually only single bytes from multi-byte unicode characters. Any python really doesn't like that.
My fix was to turn the streaming callback into this monstrosity:

streamed_char_bytes = b''
def stream_token(x: bytes) -> None:
    """
    This function is called by the llama library to stream tokens
    """
    if(len(x)==0):
        return
    global streamed_char_bytes
    bit = x[0]>>7
    if bit > 0:    
        streamed_char_bytes = streamed_char_bytes + x
        try:
            print(streamed_char_bytes.decode("utf-8"), end='', flush=True)
            streamed_char_bytes = b''
        except UnicodeDecodeError:
            pass
    else:
        if(len(streamed_char_bytes)>0):
            print('�')
        streamed_char_bytes = b''
        try:
            c = x.decode("utf8")
            print(c, end='', flush=True)
        except UnicodeDecodeError:
            print('�')

I also edited the bridge.cpp file accordingly.

@amitsingh19975
Copy link
Collaborator

The fix i'm thinking of requires me to fix buffer inside the bridge.cpp. The buffer will wait for character to become valid. That should make this monstrosity obsolete and make python much simpler. I'll try to fix it by tomorrow.

@gaoxiao
Copy link

gaoxiao commented Mar 29, 2023

@amitsingh19975
Copy link
Collaborator

amitsingh19975 commented Mar 29, 2023

Try now fix/unicode. I fixed it. My approach was simple. I took the last invalid/partial UTF-8 codepoint and prepended it to the next token, and so on. It might look like the stream token is running slow, but it's not. This approach introduces the interdependence of two tokens on each other, which makes it wait a little longer inside the buffer before it becomes valid.

@stduhpf
Copy link
Contributor Author

stduhpf commented Mar 29, 2023

Ok your modifications to bridge.cpp seems to work fine, but I couldn't get it to output emojis properly without changing the tokenizer conversion in convert-pth-to-ggml.py (and then converting the models again).

diff --git a/convert-pth-to-ggml.py b/convert-pth-to-ggml.py
index fd934e7..8e2c556 100644
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@@ -112,11 +112,27 @@ for p in range(n_parts):
     fout.write(struct.pack("i", ftype))
 
     # Is this correct??
-    for i in range(32000):
-        # TODO: this is probably wrong - not sure how this tokenizer works
-        text = tokenizer.decode([29889, i]).encode('utf-8')
-        # remove the first byte (it's always '.')
-        text = text[1:]
+    for i in range(tokenizer.vocab_size()):
+        # # TODO: this is probably wrong - not sure how this tokenizer works
+        # text = tokenizer.decode([29889, i]).encode('utf-8')
+        # # remove the first byte (it's always '.')
+        # text = text[1:]
+        # fout.write(struct.pack("i", len(text)))
+        # fout.write(text)
+
+        if tokenizer.is_unknown(i):
+            text = " \u2047 ".encode("utf-8")
+        elif tokenizer.is_control(i):
+            text = b""
+        elif tokenizer.is_byte(i):
+            piece = tokenizer.id_to_piece(i)
+            if len(piece) != 6:
+                print(f"Invalid token: {piece}")
+                sys.exit(1)
+            byte_value = int(piece[3:-1], 16)
+            text = struct.pack("B", byte_value)
+        else:
+            text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
         fout.write(struct.pack("i", len(text)))
         fout.write(text)

@PotatoSpudowski
Copy link
Owner

Mega work @amitsingh19975

Thank you for picking this up!

Keeping this issue open for now! @stduhpf I will test the changes you suggested and add it accordingly!

@PotatoSpudowski
Copy link
Owner

@stduhpf tokenizer has been updated so this can be closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants