Unicode characters break tokenizer #24

stduhpf · 2023-03-26T16:00:51Z

When injesting model with multiple-bytes unicode characters, it prints failed to tokenize string!, and seems to ignore all tokens prior to said characters. That's bad for not only emoji support, but also languages like japanese and chinese.

I haven't tried it yet, but I think implementing the fix proposed in this PR for llama.cpp could solve this issue.

The text was updated successfully, but these errors were encountered:

PotatoSpudowski · 2023-03-26T20:55:41Z

Good catch!

Will update it soon.

stduhpf · 2023-03-26T23:57:48Z

I think I got it working, by fixing tokenizer conversion in convert-pth-to-ggml, and making streaming_fn expecting py::bytesinstead of std::string should I make a PR for it?

amitsingh19975 · 2023-03-27T07:05:34Z

We went for ASCII for the initial approach because we wanted a working solution for stop words. Therefore, support for Unicode has to work in tandem with stop words. If you can also add support for Unicode stop words, it would be much appreciated.

amitsingh19975 · 2023-03-27T07:21:02Z

I just read the Pybind11 docs and found that they convert the python Unicode string to UTF-8 string. I think we've to handle it properly in C++

"pybind11 will encode the Python string to UTF-8"

amitsingh19975 · 2023-03-27T07:26:03Z

We have to fix the tokenizer to iterate grapheme rather than character.

amitsingh19975 · 2023-03-27T19:13:20Z

You can try Unicode branch and see if your Unicode string works.

gaoxiao · 2023-03-28T03:37:13Z

You can try Unicode branch and see if your Unicode string works.

This branch still has issues with Chinese:

amitsingh19975 · 2023-03-28T07:43:54Z

I check it on my laptop. The reason is very clear for the gibberish to me, and I did talk to @PotatoSpudowski about it.
This is all due to the vocabulary. There might be cases where the Unicode or character does not exist inside the model vocab.

gaoxiao · 2023-03-28T17:05:07Z

Doesn't seem to be missing character, I tried the converter from alpaca.cpp (https://github.com/antimatter15/alpaca.cpp/blob/master/convert-pth-to-ggml.py#L102), the model/tokenizer worked fine:

But when I tried the same model with fastLLaMa, it crashed right away:

amitsingh19975 · 2023-03-28T17:08:58Z

ok. I think I checked the wrong model (LLAMA 7B). Could you tell me which model you are using?

amitsingh19975 · 2023-03-28T17:37:14Z

I think I have an idea of what's happing. Maybe stream token function might be sending invalid Unicode to the py11bind.

stduhpf · 2023-03-28T18:27:50Z

I had the same issue, and it seems that some tokens are actually only single bytes from multi-byte unicode characters. Any python really doesn't like that.
My fix was to turn the streaming callback into this monstrosity:

streamed_char_bytes = b''
def stream_token(x: bytes) -> None:
    """
    This function is called by the llama library to stream tokens
    """
    if(len(x)==0):
        return
    global streamed_char_bytes
    bit = x[0]>>7
    if bit > 0:    
        streamed_char_bytes = streamed_char_bytes + x
        try:
            print(streamed_char_bytes.decode("utf-8"), end='', flush=True)
            streamed_char_bytes = b''
        except UnicodeDecodeError:
            pass
    else:
        if(len(streamed_char_bytes)>0):
            print('�')
        streamed_char_bytes = b''
        try:
            c = x.decode("utf8")
            print(c, end='', flush=True)
        except UnicodeDecodeError:
            print('�')

I also edited the bridge.cpp file accordingly.

amitsingh19975 · 2023-03-28T18:32:14Z

The fix i'm thinking of requires me to fix buffer inside the bridge.cpp. The buffer will wait for character to become valid. That should make this monstrosity obsolete and make python much simpler. I'll try to fix it by tomorrow.

gaoxiao · 2023-03-29T07:19:09Z

This one: https://huggingface.co/Sosaka/Alpaca-native-4bit-ggml/tree/main

amitsingh19975 · 2023-03-29T07:59:24Z

Try now fix/unicode. I fixed it. My approach was simple. I took the last invalid/partial UTF-8 codepoint and prepended it to the next token, and so on. It might look like the stream token is running slow, but it's not. This approach introduces the interdependence of two tokens on each other, which makes it wait a little longer inside the buffer before it becomes valid.

stduhpf · 2023-03-29T13:44:06Z

Ok your modifications to bridge.cpp seems to work fine, but I couldn't get it to output emojis properly without changing the tokenizer conversion in convert-pth-to-ggml.py (and then converting the models again).

diff --git a/convert-pth-to-ggml.py b/convert-pth-to-ggml.py
index fd934e7..8e2c556 100644
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@@ -112,11 +112,27 @@ for p in range(n_parts):
     fout.write(struct.pack("i", ftype))
 
     # Is this correct??
-    for i in range(32000):
-        # TODO: this is probably wrong - not sure how this tokenizer works
-        text = tokenizer.decode([29889, i]).encode('utf-8')
-        # remove the first byte (it's always '.')
-        text = text[1:]
+    for i in range(tokenizer.vocab_size()):
+        # # TODO: this is probably wrong - not sure how this tokenizer works
+        # text = tokenizer.decode([29889, i]).encode('utf-8')
+        # # remove the first byte (it's always '.')
+        # text = text[1:]
+        # fout.write(struct.pack("i", len(text)))
+        # fout.write(text)
+
+        if tokenizer.is_unknown(i):
+            text = " \u2047 ".encode("utf-8")
+        elif tokenizer.is_control(i):
+            text = b""
+        elif tokenizer.is_byte(i):
+            piece = tokenizer.id_to_piece(i)
+            if len(piece) != 6:
+                print(f"Invalid token: {piece}")
+                sys.exit(1)
+            byte_value = int(piece[3:-1], 16)
+            text = struct.pack("B", byte_value)
+        else:
+            text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
         fout.write(struct.pack("i", len(text)))
         fout.write(text)

PotatoSpudowski · 2023-03-29T14:18:16Z

Mega work @amitsingh19975

Thank you for picking this up!

Keeping this issue open for now! @stduhpf I will test the changes you suggested and add it accordingly!

PotatoSpudowski · 2023-04-15T21:39:31Z

@stduhpf tokenizer has been updated so this can be closed!

amitsingh19975 mentioned this issue Mar 29, 2023

Fix/unicode #30

Merged

PotatoSpudowski closed this as completed in #30 Mar 29, 2023

PotatoSpudowski reopened this Mar 29, 2023

PotatoSpudowski closed this as completed Apr 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode characters break tokenizer #24

Unicode characters break tokenizer #24

stduhpf commented Mar 26, 2023

PotatoSpudowski commented Mar 26, 2023

stduhpf commented Mar 26, 2023

amitsingh19975 commented Mar 27, 2023

amitsingh19975 commented Mar 27, 2023

amitsingh19975 commented Mar 27, 2023

amitsingh19975 commented Mar 27, 2023

gaoxiao commented Mar 28, 2023

amitsingh19975 commented Mar 28, 2023

gaoxiao commented Mar 28, 2023

amitsingh19975 commented Mar 28, 2023 •

edited

Loading

amitsingh19975 commented Mar 28, 2023

stduhpf commented Mar 28, 2023 •

edited

Loading

amitsingh19975 commented Mar 28, 2023

gaoxiao commented Mar 29, 2023

amitsingh19975 commented Mar 29, 2023 •

edited

Loading

stduhpf commented Mar 29, 2023 •

edited

Loading

PotatoSpudowski commented Mar 29, 2023

PotatoSpudowski commented Apr 15, 2023

Unicode characters break tokenizer #24

Unicode characters break tokenizer #24

Comments

stduhpf commented Mar 26, 2023

PotatoSpudowski commented Mar 26, 2023

stduhpf commented Mar 26, 2023

amitsingh19975 commented Mar 27, 2023

amitsingh19975 commented Mar 27, 2023

amitsingh19975 commented Mar 27, 2023

amitsingh19975 commented Mar 27, 2023

gaoxiao commented Mar 28, 2023

amitsingh19975 commented Mar 28, 2023

gaoxiao commented Mar 28, 2023

amitsingh19975 commented Mar 28, 2023 • edited Loading

amitsingh19975 commented Mar 28, 2023

stduhpf commented Mar 28, 2023 • edited Loading

amitsingh19975 commented Mar 28, 2023

gaoxiao commented Mar 29, 2023

amitsingh19975 commented Mar 29, 2023 • edited Loading

stduhpf commented Mar 29, 2023 • edited Loading

PotatoSpudowski commented Mar 29, 2023

PotatoSpudowski commented Apr 15, 2023

amitsingh19975 commented Mar 28, 2023 •

edited

Loading

stduhpf commented Mar 28, 2023 •

edited

Loading

amitsingh19975 commented Mar 29, 2023 •

edited

Loading

stduhpf commented Mar 29, 2023 •

edited

Loading