You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The command in my docker-compose is:
command: --model_path /workspace/Llama-3.2-3B-Instruct --model_name meta-llama/Llama-3.2-3B-Instruct --port 9001 --rest_port 8001 --target_device GPU
From the logs in the container I see that the server loads the model and starts correctly. Indeed if I call the API http://localhost:8001/v1/config I obtain:
Please note that this demo was officially validated on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9. Other OS/hardware might work but still, issues are expected.
@Iffa-Intel thanks for the reply.
Actually the GPU is detected correctly from the docker container running on the WSL2 Ubuntu22.
And also the model is running correctly with the OVModelForCausalLM library for python on windows locally:
model_id = "Fede90/llama-3.2-3b-instruct-INT4"
model = OVModelForCausalLM.from_pretrained(model_id, device="GPU.0", trust_remote_code=True)
@fedecompa we'll further investigate & clarify this and get back to you. This probably relates to the architecture of WSL2 in Windows vs Ubuntu which influenced the OpenVINO library functionality.
OpenVINO Version
2024.3
Operating System
Windows System
Device used for inference
intel UHD Graphics GPU
Framework
None
Model used
meta-llama/Llama-3.2-3B-Instruct
Issue description
I deployed the llama 3.2 -3B model using the image: openvino/model_server:latest-gpu following the documentation here:
https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_demos_continuous_batching.html
and the folder structure for the openvino IR model:
https://github.com/openvinotoolkit/model_server/blob/main/docs/models_repository.md
The command in my docker-compose is:
command: --model_path /workspace/Llama-3.2-3B-Instruct --model_name meta-llama/Llama-3.2-3B-Instruct --port 9001 --rest_port 8001 --target_device GPU
From the logs in the container I see that the server loads the model and starts correctly. Indeed if I call the API http://localhost:8001/v1/config I obtain:
{
"meta-llama/Llama-3.2-3B-Instruct" :
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}
}
However when I call the completions endpoint I get 404: {
"error": "Model with requested name is not found"
}
Step-by-step reproduction
No response
Relevant log output
No response
Issue submission checklist
The text was updated successfully, but these errors were encountered: