Skip to content

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

@gaohongkui

Description

@gaohongkui

System Info

{"model_id":"bge-large-zh-v1.5/","model_sha":null,"model_dtype":"float16","model_type":{"embedding":{"pooling":"cls"}},"max_concurrent_requests":512,"max_input_length":512,"max_batch_tokens":16384,"max_batch_requests":null,"max_client_batch_size":32,"auto_truncate":true,"tokenization_workers":10,"version":"1.6.0","sha":"f0e491a290385ef06f0871d188b21c0308ba86d6","docker_label":"sha-f0e491a"}

Description
When using the /tokenize endpoint with the bge-large-zh-v1.5 model deployed via text-embeddings-inference, the returned text, start, and stop fields do not align with the actual token IDs. This behavior differs from the results produced by the transformers library.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to Reproduce

  1. Deploy BAAI/bge-large-zh-v1.5 using text-embeddings-inference.
  2. Send a tokenization request:
curl localhost/tokenize -X POST -d '{"inputs":"北京天安门"}' -H 'Content-Type: application/json'
  1. Observe the response:
[[
  {"id":101,"text":"[CLS]","special":true,"start":null,"stop":null},
  {"id":1266,"text":" 北京天","special":false,"start":0,"stop":3},
  {"id":776,"text":"安门","special":false,"start":3,"stop":6},
  {"id":1921,"text":"","special":false,"start":6,"stop":9},
  {"id":2128,"text":"","special":false,"start":9,"stop":12},
  {"id":7305,"text":"","special":false,"start":12,"stop":15},
  {"id":102,"text":"[SEP]","special":true,"start":null,"stop":null}
]]

Expected behavior

Expected Behavior
The correct tokenization (verified via transformers) should produce:

101 -> [CLS]
1266 -> 北
776 -> 京
1921 -> 天
2128 -> 安
7305 -> 门
102 -> [SEP]
  • Each token ID should map to a single character in the input text.
  • start/stop offsets should align with character boundaries (e.g., 1266 corresponds to at position 0-1).

Actual Behavior

  • Token 1266 incorrectly maps to 北京天 (positions 0-3) instead of (position 0-1).
  • Tokens 1921, 2128, 7305 return empty text values despite valid IDs.
  • Offset positions (e.g., start:6, stop:9 for token 1921) do not match the expected single-character alignment.

Additional Context

  • transformers code showing correct behavior:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-zh-v1.5")
    result = tokenizer("北京天安门", return_offsets_mapping=True)
    for id, mapping in zip(result["input_ids"], result["offset_mapping"]):
        print(f"{id} -> {text[mapping[0]:mapping[1]]}")
  • Suspected issue: Incorrect handling of offset mappings or token-to-text alignment in the tokenization output formatting.

Environment

  • text-embeddings-inference version: [please specify]
  • Deployment method: [Docker/local/etc.]
  • Model: BAAI/bge-large-zh-v1.5

Let me know if you need further details to investigate this! 🙌

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions