Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Warning: Bad token in vocab at index xxx #11

Closed
CheatCod opened this issue Mar 15, 2023 · 10 comments
Closed

Warning: Bad token in vocab at index xxx #11

CheatCod opened this issue Mar 15, 2023 · 10 comments
Assignees
Labels
issue:bug Something isn't working

Comments

@CheatCod
Copy link

running cargo run --release -- -m ~/dev/llama.cpp/models/7B/ggml-model-f16.bin -f prompt gives a bunch of "Warning: Bad token in vocab at index..."
image

The path points to ggml converted llama model, which I have verified that they work with llama.cpp

@setzer22
Copy link
Collaborator

setzer22 commented Mar 15, 2023

Thanks for reporting @CheatCod! We are aware of the issue. Apparently, some models have tokens in the embedded vocabulary that use invalid UTF-8 codepoints. This is an error, but not an irrecoverable one. When detecting this, we simply print the warning and replace the broken tokens with the replacement character. The C++ code, on the other hand, simply ignores the issue (C++ strings are just byte arrays, while in Rust, they're required to hold valid UTF-8), but I thought it would be a good idea to at least warn the users.

There is some discussion about this at #3. Some of us don't see the errors, while others do. If I had to take a guess, this is something in the convert-pth-to-ggml.py script which, in one case is happily writing the invalid codepoints, and in other cases (like me, and others who reported the same), is substituting the invalid characters by the replacement character, so my model never had the broken utf-8 in the first place.

This could be due to a combination of the environment, OS, Python version, version of the conversion script, and who knows what else 🤔

@mwbryant
Copy link
Contributor

Hopefully this will be less intrusive after #10 merges and we have a better logging crate. Maybe it shouldn't even be a warn! and should be an info! (or even not print at all) because it really doesn't hurt anything according to @setzer22's digging into the actual token.

@setzer22
Copy link
Collaborator

Yup, I'd say turning it into info would be nice to avoid log spam, since the issue seems to be quite common.

@philpax
Copy link
Collaborator

philpax commented Mar 16, 2023

I've downgraded it to info in #10 in the CLI (which now controls the logs).

@setzer22
Copy link
Collaborator

Not sure what to do about this. We often get reports about this issue, so leaving it open sounds like a good idea to avoid duplicates.

But also, there's very little we can do about those non-utf8 tokens in the vocabulary, they are most certainly an error and affect lots of users.

So, what should we do? One option is to downgrade the error even further down to trace level. Another is to simply leave things as is, since people can use the RUST_LOG env variable to ignore all info messages. But people will probably still want to see the other loading messages.

@philpax
Copy link
Collaborator

philpax commented Mar 20, 2023

I think leaving it as-is is fine: we can hopefully figure out what's generating those invalid tokens when we get to #21.

@mwbryant
Copy link
Contributor

I think it's probably safe to completely remove the print or make it a 1 time thing. From your research we are already replacing the tokens with the correct unicode character and cpp version of this code doesn't handle them in any special way (and I've noticed the cpp code sometimes segfaults for me but this library never does). It seems to be caused upstream from llama-rs and we have already solved it without any more reported problems related except the print outs.

@philpax philpax added the issue:bug Something isn't working label Mar 24, 2023
@philpax
Copy link
Collaborator

philpax commented Mar 24, 2023

I'd like to investigate what's causing this. My feeling is that while the tokens themselves are not valid UTF-8, their use in the generated output is (e.g. two tokens form a valid string). I'm also curious if the newest llama.cpp tokeniser addresses this.

@philpax
Copy link
Collaborator

philpax commented Apr 6, 2023

I checked my theory and it appears to be correct - the tokens are not guaranteed to be valid UTF-8 by themselves.

With invalid tokens enabled:

llama-rs # cargo run --release --bin llama-cli -- infer -m ../llama-models/v0/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt -p "How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?" --seed 943589183
[...]

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?

### Response:
この������は、一つでも文字化けな������を持たせる必要があります。

With raw bytes:

llama-rs # cargo run --release --bin llama-cli -- infer -m ../llama-models/v0/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt -p "How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?" --seed 943589183
[...]

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?

### Response:
この試験は、一つでも文字化けな困險を持たせる必要があります。

I'm thinking we should return the raw byte slices, but offer a helper to coalesce tokens until they form valid UTF-8.

@philpax philpax mentioned this issue Apr 6, 2023
@philpax philpax self-assigned this Apr 7, 2023
@philpax
Copy link
Collaborator

philpax commented Apr 7, 2023

After some discussion on the Discord, we came to the conclusion that the current behaviour is in fact a bug. The LLM infers over byte-tokens, not UTF8-tokens.

That being said, we'd still like to make using the library easy, so I'm going to switch everything over to use &[u8] except for inference_with_token, which will buffer tokens until they form valid UTF-8. I'll also expose the adapter it uses for this, so that users can do something similar when they fetch the byte-tokens themselves.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants