-
Notifications
You must be signed in to change notification settings - Fork 325
Description
System Info
offline and airgapped ENV
OS version: rhel8.19
Model: bge-m3
Hardware: NVIDIA GPU T4
Deployment: Kubernetes (kserve)
Current version: turing-1.6
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
- When I serve the BGE-M3 model using kserve with the turing-1.6 image and pytorch_model.bin, it works normally.
image: turing-1.6
pytorch_model.bin
YAML :
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: gembed
namespace: kserve
spec:
predictor:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- args:
- '--model-id'
- /data
env:
- name: HUGGINGFACE_HUB_CACHE
value: /data
image: ghcr.io/huggingface/text-embeddings-inference:turing-latest
imagePullPolicy: IfNotPresent
name: kserve-container
ports:
- containerPort: 8080
protocol: TCP
resources:
limits:
cpu: '1'
memory: 4Gi
nvidia.com/gpu: '1'
requests:
cpu: '1'
memory: 1Gi
nvidia.com/gpu: '1'
volumeMounts:
- name: gembed-onnx-volume
mountPath: /data
maxReplicas: 1
minReplicas: 1
volumes:
- name: gembed-onnx-volume
persistentVolumeClaim:
claimName: gembed-onnx-pv-claim
- However, when I switch to the ONNX model, I get the following error:
k logs -f gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf -n kserve
2025-03-27T11:58:40.694775Z INFO text_embeddings_router: router/src/main.rs:185: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hf_token: None, hostname: "gembed-predictor-00002-deployment-7d4b8d6f67-8g9xf", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T11:58:40.698252Z WARN text_embeddings_router: router/src/lib.rs:403: The--poolingarg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting toCLSpooling.
2025-03-27T11:58:41.365892Z WARN text_embeddings_router: router/src/lib.rs:188: Could not find a Sentence Transformers config
2025-03-27T11:58:41.365911Z INFO text_embeddings_router: router/src/lib.rs:192: Maximum number of tokens per request: 8192
2025-03-27T11:58:41.366116Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T11:58:41.864311Z INFO text_embeddings_router: router/src/lib.rs:234: Starting model backend
2025-03-27T11:58:41.865066Z ERROR text_embeddings_backend: backends/src/lib.rs:388: Could not start Candle backend: Could not start backend: No such file or directory (os error 2)
Error: Could not create backend
3.When I change the image to cpu-1.6 and test it, it works normally.
2025-03-27T14:11:33.231208Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/****", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "gembed-predictor-00001-deployment-56ccb599cf-gzjp8", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-03-27T14:11:33.237362Z WARN text_embeddings_router: router/src/lib.rs:392: The --pooling arg is not set and we could not find a pooling configuration (1_Pooling/config.json) for this model but the model is a BERT variant. Defaulting to CLS pooling.
2025-03-27T14:11:33.897769Z WARN text_embeddings_router: router/src/lib.rs:184: Could not find a Sentence Transformers config
2025-03-27T14:11:33.897784Z INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-03-27T14:11:33.898748Z INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-03-27T14:11:34.405665Z INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2025-03-27T14:11:40.755400Z WARN text_embeddings_router: router/src/lib.rs:258: Backend does not support a batch size > 8
2025-03-27T14:11:40.755416Z WARN text_embeddings_router: router/src/lib.rs:259: forcing max_batch_requests=8
2025-03-27T14:11:40.755519Z WARN text_embeddings_router: router/src/lib.rs:310: Invalid hostname, defaulting to 0.0.0.0
2025-03-27T14:11:40.757444Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:8080
2025-03-27T14:11:40.757456Z INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready
- When I test with the turing-latest image, I get the same error as with turing-1.6
Expected behavior
I'm not sure if the issue is with the Turing image or my configuration.