Error occurs when using ONNX model with text-embeddings-inference turing image

### System Info

offline and airgapped ENV
OS version: rhel8.19
Model: bge-m3
Hardware: NVIDIA GPU T4
Deployment: Kubernetes (kserve)
Current version: turing-1.6

### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [ ] An officially supported command
- [x] My own modifications

### Reproduction

1. When I serve the BGE-M3 model using kserve with the turing-1.6 image and pytorch_model.bin, it works normally.
image: turing-1.6
pytorch_model.bin

YAML :
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: gembed
  namespace: kserve
spec:
  predictor:
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
    containers:
      - args:
         - '--model-id'
         - /data
        env:
          - name: HUGGINGFACE_HUB_CACHE
            value: /data
        image:  ghcr.io/huggingface/text-embeddings-inference:turing-latest
        imagePullPolicy: IfNotPresent
        name: kserve-container
        ports:
          - containerPort: 8080
            protocol: TCP
        resources:
          limits:
            cpu: '1'
            memory: 4Gi
            nvidia.com/gpu: '1'
          requests:
            cpu: '1'
            memory: 1Gi
            nvidia.com/gpu: '1'
        volumeMounts:
          - name: gembed-onnx-volume
            mountPath: /data
    maxReplicas: 1
    minReplicas: 1
    volumes:
      - name: gembed-onnx-volume
        persistentVolumeClaim:
          claimName: gembed-onnx-pv-claim


2. However, when I switch to the ONNX model, I get the following error:
k logs -f gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf -n kserve
2025-03-27T11:58:40.694775Z INFO text_embeddings_router: router/src/main.rs:185: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T11:58:40.698252Z WARN text_embeddings_router: router/src/lib.rs:403: The `--pooling` arg is not set and we could not find a pooling configuration (`1_Pooling/config.json`) for this model but the model is a BERT variant. Defaulting to `CLS` pooling.
2025-03-27T11:58:41.365892Z WARN text_embeddings_router: router/src/lib.rs:188: Could not find a Sentence Transformers config
2025-03-27T11:58:41.365911Z INFO text_embeddings_router: router/src/lib.rs:192: Maximum number of tokens per request: 8192
2025-03-27T11:58:41.366116Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T11:58:41.864311Z INFO text_embeddings_router: router/src/lib.rs:234: Starting model backend
2025-03-27T11:58:41.865066Z ERROR text_embeddings_backend: backends/src/lib.rs:388: Could not start Candle backend: Could not start backend: No such file or directory (os error 2)
Error: Could not create backend

3.When I change the image to cpu-1.6 and test it, it works normally.
2025-03-27T14:11:33.231208Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "gembed-predictor-00001-deployment-56ccb599cf-gzjp8", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T14:11:33.237362Z WARN text_embeddings_router: router/src/lib.rs:392: The `--pooling` arg is not set and we could not find a pooling configuration (`1_Pooling/config.json`) for this model but the model is a BERT variant. Defaulting to `CLS` pooling.
2025-03-27T14:11:33.897769Z WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config
2025-03-27T14:11:33.897784Z INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-03-27T14:11:33.898748Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T14:11:34.405665Z INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2025-03-27T14:11:40.755400Z WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 8
2025-03-27T14:11:40.755416Z WARN text_embeddings_router: router/src/lib.rs:259: forcing `max_batch_requests=8`
2025-03-27T14:11:40.755519Z WARN text_embeddings_router: router/src/lib.rs:310: Invalid hostname, defaulting to 0.0.0.0
2025-03-27T14:11:40.757444Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8080
2025-03-27T14:11:40.757456Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready

4. When I test with the turing-latest image, I get the same error as with turing-1.6


### Expected behavior

I'm not sure if the issue is with the Turing image or my configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error occurs when using ONNX model with text-embeddings-inference turing image #544

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error occurs when using ONNX model with text-embeddings-inference turing image #544

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions