-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Make space insertion optional in SentencePiece tokenizer #3664
Comments
Option 1 might actually be a good idea |
Unless there's a bug ( and I don't think there is, I tested it extensively, though it's not impossible ), it's the other way around. Tokenizing with "special=false" will only add space at the begining of the entire input string, and special tokens will be naively tokenized as plaintext. That is in line with old tokenizer behaviour. Tokenizing with "special=true" will not add any spaces, neither at the begining of the input string nor to the substrings between special tokens. EDIT: Option 1 would leave you with stray space if special tokens are enabled and add_bos is enabled. Example: Current behavior with special=false, same as old behavior: Current (new) behaviour with special=true: Tying space prefix to add bos when special=true would leave you with: |
@staviq I just tested it. request: As you can see, there are no spaces at all in the request, but tokenizer uses tokens that start with space. That is, it inserts space into each text fragment. |
Oh, I see, llama.cpp/examples/server/server.cpp Lines 269 to 278 in 4e82b2e
I thought you mean new special token handling broke this, but you meant it didn't fix an already existing problem in |
Yes, the compulsory space is affecting server's behavior ever since it was introduced in #2810. Maybe we can call it a problem, but I see it as a missing feature. In many cases, inserting space is desirable, but it takes control over it away from the client. If my option 1 is chosen, it may be that nothing needs to be changed in the server code. But it should be documented in which cases the client needs to add spaces. If my option 2 is chosen, the server can be made to behave like in option 1 by default and have optional parameters that control insertion of spaces. Again, with appropriate documentation. |
Spaces aren't opportunistically inserted within the text, so doing it only when A little extra info: The OP links to a hack that prepends a "🙂". Pretty smart! I've been prepending a newline, but am not sure if it will reliably yield a unique newline token for all models in all cases. If tokenizers are really unpredictable, clients that edit the context should be retokenizing at arbitrary boundaries (I think this is called "token healing"), which is also complicated by space insertion. |
@grencez I am also using newline (LF) for working around the space insertion. I'm curious to see a model that uses SentencePiece tokenizer and has multiple tokens that contain newline, which could make this usage of newline problematic. One of things that I also do is splitting text on token boundaries, so this will be important even when space insertion will be controllable. |
@shibe2 The CausalLM GGUF models on HuggingFace do this sometimes. " \n\n" becomes 2 tokens (" \n" and "\n"). Those quantizations are still experimental and seem buggy (e.g., |
Given that, BOS and initial space no longer seem to be nicely coupled. Therefore, I prefer my option 2 – adding a new argument to tokenization functions. |
Something definitely should happen with this. The current behavior destroys the quality of some models. See #4081 (comment) The way we token also clearly differs from the official Python tokenizing stuff. (Example in the comment I linked.) |
It seems the discussion is happening in parallel, here and in #4081 I looks to me like a single So my proposal is to add This way, we won't break llama.cpp bindings when adding function arguments ( we/I did accidentally break llama-cpp-python by adding |
How about doing that but make it take a struct with options? Sort of like how the quantize stuff has |
I was about to say that, but I thought it would be to radical of a change But this is a good idea, Vulkan for example uses that to solve it's problem of ungodly amount of parameters in function calls. |
I added it as option 3. I think, after adding that function we would still need to decide what to do with the existing function, because its API is not in a good shape given current needs. |
It could call the advanced function with the options for special and add_bos set appropriately. Possibly also deprecate it if it really just doesn't have a use case anymore. I guess if we were going to say that then we should probably just break the API and replace the existing function with the proposed advanced one. |
Once HF supports add_prefix_space for SPM (#3538 (comment)), YiTokenizer should no longer be needed upstream (AFAIK) and we can store the value of add_prefix_space in the GGUF for both SPM and GPT2 tokenizers. |
Would anybody be averse to me implementing Option 1? My code does incremental tokenization/decoding, and I'm getting weird spaces where they should not be appearing. I'd rather fix this now than work around it. |
I believe the original problem discussed in this issue has been resolved: Lines 7096 to 7100 in ddb008d
Can you give a specific example and how you use |
Changing this to
I expect I could trim prefix spaces from the added partial text, or throw away the first 29871 token from the result. However, this behavior is surprising and complicates the client. I think it would improve the API to not add a space in this case. |
This tokenization matches what I get when I use the SentencePiece implementation via Python and LLaMA tokenizer: from sentencepiece import SentencePieceProcessor
tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')
print(tokenizer.encode('I saw', add_bos=True))
# [1, 306, 4446]
print(tokenizer.encode(' Sam'))
# [29871, 3685] I think it is important to match the reference implementation API so the leading For example, in this Python example, how would you use the SentencePiece tokenizer to achieve your goal? If you can provide an example to do that, we could decide on approach to extend |
You have to set I do think the feature is important though, especially when dealing with prompt formats where there's a special token followed by non-space characters. For example, a user's input in the ChatML format starts with a |
Here's a relevant issue on SentencePiece: google/sentencepiece#282. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Not stale - spaces are still inserted by default with special=False, and this is undesirable when joining multiple tokenized texts. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I believe, this is still an issue. Although for my builds, I disable this space insertion altogether. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Current Behavior
Since #2810, space is inserted into any non-empty text. This breaks multiple use cases:
Recently, space insertion was disabled in the case when text representation of special tokens is recognized:
llama.cpp/llama.cpp
Line 6729 in 1a15955
main
example when escape processing is enabled, but leaves other scenarios broken. In particular, when special token identifiers are added to the prompt by client and passed toserver
(which is a proper way to handle added tokens), it inserts space into each piece of text between special tokens.Space insertion was made to match original Python implementation, but that behavior is itself not optimal, as evidenced by people having to hack around it.
Proposals
Option 1
Insert space only when BOS token is also inserted. There is a clear intersection between cases where BOS and space need to be inserted: when the whole prompt is one chunk of text. All broken cases that I listed here involve splitting and recombining the prompt.
The argument
bos
can be optionally renamed to something likefull_prompt
, or inversepartial
, with meaning that would encompass the behaviors controlled by it.Option 2
Add a separate argument that controls insertion of space. It would be used by /tokenize in server and in other places.
Option 3
Add another tokenization function with options that would control many aspects of the process. Suggested by @staviq.
The text was updated successfully, but these errors were encountered: