-
Notifications
You must be signed in to change notification settings - Fork 326
Description
System Info
{"model_id":"bge-large-zh-v1.5/","model_sha":null,"model_dtype":"float16","model_type":{"embedding":{"pooling":"cls"}},"max_concurrent_requests":512,"max_input_length":512,"max_batch_tokens":16384,"max_batch_requests":null,"max_client_batch_size":32,"auto_truncate":true,"tokenization_workers":10,"version":"1.6.0","sha":"f0e491a290385ef06f0871d188b21c0308ba86d6","docker_label":"sha-f0e491a"}
Description
When using the /tokenize endpoint with the bge-large-zh-v1.5 model deployed via text-embeddings-inference, the returned text, start, and stop fields do not align with the actual token IDs. This behavior differs from the results produced by the transformers library.
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Steps to Reproduce
- Deploy
BAAI/bge-large-zh-v1.5using text-embeddings-inference. - Send a tokenization request:
curl localhost/tokenize -X POST -d '{"inputs":"北京天安门"}' -H 'Content-Type: application/json'- Observe the response:
[[
{"id":101,"text":"[CLS]","special":true,"start":null,"stop":null},
{"id":1266,"text":" 北京天","special":false,"start":0,"stop":3},
{"id":776,"text":"安门","special":false,"start":3,"stop":6},
{"id":1921,"text":"","special":false,"start":6,"stop":9},
{"id":2128,"text":"","special":false,"start":9,"stop":12},
{"id":7305,"text":"","special":false,"start":12,"stop":15},
{"id":102,"text":"[SEP]","special":true,"start":null,"stop":null}
]]Expected behavior
Expected Behavior
The correct tokenization (verified via transformers) should produce:
101 -> [CLS]
1266 -> 北
776 -> 京
1921 -> 天
2128 -> 安
7305 -> 门
102 -> [SEP]
- Each token ID should map to a single character in the input text.
start/stopoffsets should align with character boundaries (e.g.,1266corresponds to北at position 0-1).
Actual Behavior
- Token
1266incorrectly maps to北京天(positions 0-3) instead of北(position 0-1). - Tokens
1921,2128,7305return emptytextvalues despite valid IDs. - Offset positions (e.g.,
start:6, stop:9for token1921) do not match the expected single-character alignment.
Additional Context
transformerscode showing correct behavior:from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-zh-v1.5") result = tokenizer("北京天安门", return_offsets_mapping=True) for id, mapping in zip(result["input_ids"], result["offset_mapping"]): print(f"{id} -> {text[mapping[0]:mapping[1]]}")
- Suspected issue: Incorrect handling of offset mappings or token-to-text alignment in the tokenization output formatting.
Environment
- text-embeddings-inference version: [please specify]
- Deployment method: [Docker/local/etc.]
- Model:
BAAI/bge-large-zh-v1.5
Let me know if you need further details to investigate this! 🙌