-
Notifications
You must be signed in to change notification settings - Fork 78
Description
vLLM Backend: Tokenizer Configuration Mismatch
Description
The vLLM backend has two separate tokenizers that can become misaligned when tokenizer-specific configuration is needed. This causes issues with features like tool calling when model-specific tokenizer modes are required.
Originally noticed with the vllm test which uses mistral (this test does give a warning about not using the mistral tokenizer)
Details
The backend initializes two tokenizers:
- vLLM's internal tokenizer - created with all engine arguments including
tokenizer_mode,tokenizer_revisionetc inAsyncLLMEngine.from_engine_args(). This is used to tokenize before inference, handling output, managing special tokens - Backend's separate tokenizer - created independently via
AutoTokenizer.from_pretrained()without any configuration. This only gets the modelid and is used for formatting prompts viaapply_chat_template()and deals with creating messages, adding tool definitions etc
Impact
If we pass tokenizer configuration (e.g., tokenizer_mode="mistral") to improve tool calling reliability, only vLLM's internal tokenizer receives it. The backend's tokenizer uses default settings... so we get a potential mismatch which can cause issues with special tokens and tool calling tokens ie with mistral. (and potentially make things even worse than not using a special tokenizer)
Fix (?)
I think we could use the vLLM tokenizer for both formatting and generation to be consistent throughout.
References
Found when working on #416