Incorrect Tokenization Output for bge-large-zh-v1.5 Model

### System Info

{"model_id":"bge-large-zh-v1.5/","model_sha":null,"model_dtype":"float16","model_type":{"embedding":{"pooling":"cls"}},"max_concurrent_requests":512,"max_input_length":512,"max_batch_tokens":16384,"max_batch_requests":null,"max_client_batch_size":32,"auto_truncate":true,"tokenization_workers":10,"version":"1.6.0","sha":"f0e491a290385ef06f0871d188b21c0308ba86d6","docker_label":"sha-f0e491a"}

**Description**  
When using the `/tokenize` endpoint with the `bge-large-zh-v1.5` model deployed via text-embeddings-inference, the returned `text`, `start`, and `stop` fields do not align with the actual token IDs. This behavior differs from the results produced by the `transformers` library.


### Information

- [x] Docker
- [x] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

**Steps to Reproduce**  
1. Deploy `BAAI/bge-large-zh-v1.5` using text-embeddings-inference.
2. Send a tokenization request:  
```bash
curl localhost/tokenize -X POST -d '{"inputs":"北京天安门"}' -H 'Content-Type: application/json'
```
3. Observe the response:
```json
[[
  {"id":101,"text":"[CLS]","special":true,"start":null,"stop":null},
  {"id":1266,"text":" 北京天","special":false,"start":0,"stop":3},
  {"id":776,"text":"安门","special":false,"start":3,"stop":6},
  {"id":1921,"text":"","special":false,"start":6,"stop":9},
  {"id":2128,"text":"","special":false,"start":9,"stop":12},
  {"id":7305,"text":"","special":false,"start":12,"stop":15},
  {"id":102,"text":"[SEP]","special":true,"start":null,"stop":null}
]]
```

### Expected behavior

**Expected Behavior**  
The correct tokenization (verified via `transformers`) should produce:  
```
101 -> [CLS]
1266 -> 北
776 -> 京
1921 -> 天
2128 -> 安
7305 -> 门
102 -> [SEP]
```
- Each token ID should map to a single character in the input text.
- `start`/`stop` offsets should align with character boundaries (e.g., `1266` corresponds to `北` at position 0-1).

**Actual Behavior**  
- Token `1266` incorrectly maps to `北京天` (positions 0-3) instead of `北` (position 0-1).
- Tokens `1921`, `2128`, `7305` return empty `text` values despite valid IDs.
- Offset positions (e.g., `start:6, stop:9` for token `1921`) do not match the expected single-character alignment.

**Additional Context**  
- `transformers` code showing correct behavior:
  ```python
  from transformers import AutoTokenizer
  tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-zh-v1.5")
  result = tokenizer("北京天安门", return_offsets_mapping=True)
  for id, mapping in zip(result["input_ids"], result["offset_mapping"]):
      print(f"{id} -> {text[mapping[0]:mapping[1]]}")
  ```
- Suspected issue: Incorrect handling of offset mappings or token-to-text alignment in the tokenization output formatting.

**Environment**  
- text-embeddings-inference version: [please specify]
- Deployment method: [Docker/local/etc.]
- Model: `BAAI/bge-large-zh-v1.5`

Let me know if you need further details to investigate this! 🙌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions