-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Description
🚀 The feature, motivation and pitch
Currently the online API for embeddings allows you to pass a parameter to control truncation:
class EmbeddingCompletionRequest(OpenAIBaseModel):
...
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
This parameter, if given, must respect the following constraint: 0 < truncate_prompt_tokens <= max_seq_len, where max_seq_len is the maximum prompt length that the model supports. This API forces the clients to call /v1/models to find out the max model length before to make sure the they aren't going to exceed the limit and get a 400 error. In practice, the client has two options:
- Call
/v1/modelsonce and store the result somewhere, which requires the client to be stateful - Call
/v1/modelsfor every embedding operation and pay the price of two network round trips
Other inference frameworks such as caikit allow the users to specify -1 to truncate at the max_seq_len automatically.
For the offline API the usability is even worse because the embed method doesn't even have a truncate_prompt_tokens parameter, forcing the developer to tokenize and truncate the inputs first:
def embed(
self,
prompts: Union[PromptType, Sequence[PromptType]],
/,
*,
use_tqdm: bool = True,
lora_request: Optional[Union[List[LoRARequest], LoRARequest]] = None,
prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> List[EmbeddingRequestOutput]:
The same applies to the scoring and reranking functions.
Alternatives
No response
Additional context
FYI, @gmarinho2 and I are planning to implement the suggestions in this issue.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.