-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infill Incorrect Tokenization #3503
Comments
Could you please try this and see if the issue is fixed there? Would be great if you could also test server with infill. PS for some reason I don't seem to be getting an additional space, or could it be a newline that is also deleted in the paper? |
Here are the additional tests: codellama: vvhg1/llama.cpp infill.cpp: vvhg1/llama.cpp server.cpp: ggerganov/llama.cpp server.cpp: Apart from the additional 29871 your fix seems correct. I think it is a space, since the prefix string ends in a space and this is the same token as the one before the suffix starts (32008). |
Here are some sample results using { "\t" , { 29871, 12, }, },
{ "\n" , { 29871, 13, }, },
{ "\n return" , { 29871, 13, 736, }, },
{ "\t\n" , { 29871, 12, 13, }, },
{ "Hello world" , { 15043, 3186, }, },
{ " Hello world" , { 29871, 15043, 3186, }, },
{ "Hello World" , { 15043, 2787, }, },
{ " Hello World" , { 29871, 15043, 2787, }, },
{ " Hello World!" , { 29871, 15043, 2787, 29991, }, },
{ "Hello, world!" , { 15043, 29892, 3186, 29991, }, },
{ " Hello, world!" , { 29871, 15043, 29892, 3186, 29991, }, }, It looks like codellama requires some extra logic for removing the leading spaces when they occur. |
aggreed, I'm on it, I guess we limit the cleaning to leading spaces as it seems newlines are not removed in the original? |
I think the additional space gets introduced by the llama.cpp tokenizer, a quick look suggests those lines are responsible: Lines 5220 to 5221 in 9ca79d5
The code probably relies on the assumption that a prompt always consists of a single part, which doesn't hold here. |
for some reason I get different tokens for newlines:
I only see the 29871 when I add leading whitespace to the suffix (which we want to clean in any case I guess) |
I used |
thx, the -e now gives me identical results what I do now is:
|
I think cleaning any leading spaces would be wrong, since they might legitimately be part of the suffix. Codellama uses this little method: def encode_infilling(self, s: str) -> List[int]:
"""Encode a string without an implicit leading space."""
return self.sp_model.encode("☺" + s)[2:] With an explanation here:
The issue also points to the solution of the transformers library, I think is the relevant part: I'm not too familiar with the tokenization, but might there be situations, where |
Yes, I think the topic is slightly more complicated than I initially thought. What we probably need to do is to have a look first how we encode because only the -e adds a space (at least it seems to me like that rn). Then in case of -e we either cut one space at the beginning of the string if it starts with a space (that will then get added again by the tokenizer) or if there is none we remove the single space token after tokenization. The rationale behind this is that multiple spaces are tokens in their own right, and I guess it would be more complicated (at least to maintain) to translate a n spaces token at the beginning of the string into a n-1 spaces token. |
What I do now: case 1: and the suffix string starts with a space
case 2: no leading space in original suffix string
Please test thoroughly |
Since SPM started inserting space unconditionally (#2810), I have to use a hack similar to Code Llama's. I've found that in models that I use, newline (LF) character does not stick to any other tokens, and its identifier is easily queryable with |
Seems to work now. One minor thing I noticed that I don't understand: I ran the original test from this issue: make -j && ./infill -t 4 -m models/codellama-7b/ggml-model-f16.gguf -c 4096 --temp 0.0 --repeat_penalty 1.1 --in-prefix 'def remove_non_ascii(s: str) -> str:\n """ ' --in-suffix '\n return result\n' --verbose-prompt -e -ngl 1
ggml_metal_add_buffer: allocated 'data ' buffer, size = 12853.98 MB, (12854.61 / 147456.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 2050.00 MB, (14904.61 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 288.02 MB, (15192.62 / 147456.00)
system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
main: prompt: ''
main: number of tokens in prompt = 26
1 -> ''
32007 -> ' <PRE>'
822 -> ' def'
3349 -> ' remove'
29918 -> '_'
5464 -> 'non'
29918 -> '_'
294 -> 'as'
18869 -> 'cii'
29898 -> '('
29879 -> 's'
29901 -> ':'
851 -> ' str'
29897 -> ')'
1599 -> ' ->'
851 -> ' str'
29901 -> ':'
13 -> '
'
9995 -> ' """'
29871 -> ' '
32008 -> ' <SUF>'
13 -> '
'
736 -> ' return'
1121 -> ' result'
13 -> '
'
32009 -> ' <MID>'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0
##### Infill mode #####
<PRE> def remove_non_ascii(s: str) -> str:
""" <SUF>
return result
<MID>Remove non-ASCII characters from a string.
Args:
s (str): The input string.
Returns:
str: The output string.
"""
import re
pattern = re.compile(r'[^\x00-\x7F]+', re.UNICODE)
result = pattern.sub('', s) <EOT> <EOT>
llama_print_timings: load time = 720.32 ms
llama_print_timings: sample time = 50.62 ms / 81 runs ( 0.62 ms per token, 1600.19 tokens per second)
llama_print_timings: prompt eval time = 54.77 ms / 26 tokens ( 2.11 ms per token, 474.72 tokens per second)
llama_print_timings: eval time = 1935.91 ms / 80 runs ( 24.20 ms per token, 41.32 tokens per second)
llama_print_timings: total time = 2089.48 ms I don't have the token 1678 in the suffix tokenization: 32008 13 736 1121 13 In contrast, OP gets: 32008 13 1678 736 1121 13 So not sure why you guys got the extra 1678 there in your tests. |
I think it should be ok since it is the same behaviour here and in Python. |
I am comparing the tokenization of the codellama repository with the infill example of this repository.
The first example prompt from the codellama repository consists of the strings:
Comparing the tokenization of both implementations results in:
There are two differences:
prefix_id
andbos
I think)bos
token again after thesuffix_id
token and an additional 29871 (is this a space?)I believe the latter is definitely wrong, as the paper states on page 4:
The text was updated successfully, but these errors were encountered: