Skip to content

vLLM backend uses two different tokenizers #431

@planetf1

Description

@planetf1

vLLM Backend: Tokenizer Configuration Mismatch

Description

The vLLM backend has two separate tokenizers that can become misaligned when tokenizer-specific configuration is needed. This causes issues with features like tool calling when model-specific tokenizer modes are required.

Originally noticed with the vllm test which uses mistral (this test does give a warning about not using the mistral tokenizer)

Details

The backend initializes two tokenizers:

  1. vLLM's internal tokenizer - created with all engine arguments including tokenizer_mode, tokenizer_revision etc in AsyncLLMEngine.from_engine_args(). This is used to tokenize before inference, handling output, managing special tokens
  2. Backend's separate tokenizer - created independently via AutoTokenizer.from_pretrained() without any configuration. This only gets the modelid and is used for formatting prompts via apply_chat_template() and deals with creating messages, adding tool definitions etc

Impact

If we pass tokenizer configuration (e.g., tokenizer_mode="mistral") to improve tool calling reliability, only vLLM's internal tokenizer receives it. The backend's tokenizer uses default settings... so we get a potential mismatch which can cause issues with special tokens and tool calling tokens ie with mistral. (and potentially make things even worse than not using a special tokenizer)

Fix (?)

I think we could use the vLLM tokenizer for both formatting and generation to be consistent throughout.

References

Found when working on #416

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions