-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Attempt to pipe logit_bias to sampler's embedding_bias #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think i'm passing an array of arrays of logit biases, so that will probably need to be changed. |
|
Please rebase this branch to current main, then I will build and test this. I also want to get LMQL working with vLLM, because performance is not good with any of the other backends. |
|
Sure thing. I'll rebase tomorrow. |
|
No idea how this got closed. Reopened and rebased. Finally got it built too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added minor comments.
The client must know the vocabulary and the vocab_size in order to pass a logits_bias array which works with the model loaded. Right now the client has to load the tokenizer corresponding to the model loaded into the vLLM server to achieve this. It makes clients more complex and there is a risk of using the wrong tokenizer for the model, ending up in errors.
I suggest exposing two new REST API calls:
vocabularyReturns the vocabulary (array of strings). Thevocab_sizeis the length of this array. It may also return the special tokens, stop tokens, start/end of conversation tokens, etc.tokenizeAcceptstextand returns an array of integers (token values).
This way no tokenizer needs to be built on client side and no way to use the wrong vocabulary. The client can build the logits_bias based on the information returned (the client must acquire the vocabulary once on initialization).
Having these would allow for a pretty straightforward test case and easier integration with LMQL.
| stop_token_ids: Optional[List[int]] = None, | ||
| ignore_eos: bool = False, | ||
| max_tokens: int = 16, | ||
| logit_bias: float = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper type is: Optional[List[float]]
| self.max_tokens = max_tokens | ||
| self.logprobs = logprobs | ||
| self.prompt_logprobs = prompt_logprobs | ||
| self.logit_bias = logit_bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define type here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add self.logit_bias to __repr__ below.
| logit_biases: any = [] | ||
| for seq_group in input_metadata.seq_groups: | ||
| set_ids, sampling_params = seq_group | ||
| logit_biases += [sampling_params.logit_bias] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation of the size of the logit_bias array received from sampling_params. Having an explicit error with a clear explanation here is better than ending up with a cryptic PyTorch error message while trying to add arrays of mismatching sizes later.
(It cannot be validated inside SamplingParams, because the right size is not known there.)
|
@viktor-ferenczi I've been trying to test this locally, but it doesn't seem like the logit_bias parameter is actually working :( Were you able to see the logit_bias taking effect? |
@viktor-ferenczi I may be missing something, but why do they need to know the vocab size? Isn't logit_bias a sparse mapping of token ids to bias to apply? |
I did not know it is a sparse mapping, I expected it to be an array of floats with Anyway, the client must be able to retrieve the vocabulary of the currently loaded model for the solution to be fully usable. Otherwise the client must instantiate the tokenizer of the exact same model loaded into the vLLM server just to get the vocabulary, which means a lot more dependencies and room for error, not to mention the added loading time. |
|
Looks like on OAI's API it's a map of token_ids to bias values https://platform.openai.com/docs/api-reference/chat/create#chat-create-logit_bias |
|
@viktor-ferenczi @benbot did you guys come to a consensus on how to go about implementing this? Loading the tokenizer vocab will be tricky but then again this feature is meant for rather advanced use cases and we could just leave it to the api consumer to figure out the mapping between token_ids and tokens (that's what openai did anyway). |
|
There is a |
|
Depending on your use case using a grammar would be a much cleaner alternative than trying to manipulate logits, it also moves the problem of handling tokens into the server completely. See #2105: Add grammars |
|
Covered by #3027 |
Bringing back option of 256 as possible block-size arg value, that has been lost in some of the last rebases. It has been first added via HabanaAI#971 The options of arguments are now defined by unpacking predefined type hints  https://github.com/HabanaAI/vllm-fork/blob/habana_main/vllm/engine/arg_utils.py#L611
logit_biasis an important feature of the OpenAI API that vllm seems to have implemented but not exposed to the actual api.This is me taking a crack on exposing that functionality.
For the life of me, I can't get my cuda versions to all agree to build this locally, so while I try to do that i'm opening the PR for others to try out.
Should resolve: #379