Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: docker image openvino/model_server:latest-gpu does not serve the model correctly #27541

Open
3 tasks done
fedecompa opened this issue Nov 13, 2024 · 3 comments
Open
3 tasks done
Assignees
Labels
bug Something isn't working PSE support_request

Comments

@fedecompa
Copy link

fedecompa commented Nov 13, 2024

OpenVINO Version

2024.3

Operating System

Windows System

Device used for inference

intel UHD Graphics GPU

Framework

None

Model used

meta-llama/Llama-3.2-3B-Instruct

Issue description

I deployed the llama 3.2 -3B model using the image: openvino/model_server:latest-gpu following the documentation here:

https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_demos_continuous_batching.html

and the folder structure for the openvino IR model:

https://github.com/openvinotoolkit/model_server/blob/main/docs/models_repository.md

The command in my docker-compose is:
command: --model_path /workspace/Llama-3.2-3B-Instruct --model_name meta-llama/Llama-3.2-3B-Instruct --port 9001 --rest_port 8001 --target_device GPU

From the logs in the container I see that the server loads the model and starts correctly. Indeed if I call the API http://localhost:8001/v1/config I obtain:

{
"meta-llama/Llama-3.2-3B-Instruct" :
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}
}

However when I call the completions endpoint I get 404: {
"error": "Model with requested name is not found"
}

Step-by-step reproduction

No response

Relevant log output

No response

Issue submission checklist

  • I'm reporting an issue. It's not a question.
  • I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • There is reproducer code and related data files such as images, videos, models, etc.
@Iffa-Intel
Copy link

@fedecompa I encountered several issues too when attempting the steps in this guide (which you shared) on Windows: How to serve LLM models with Continuous Batching via OpenAI API.

Please note that this demo was officially validated on Intel® Xeon® processors Gen4 and Gen5 and Intel dGPU ARC and Flex models on Ubuntu22/24 and RedHat8/9. Other OS/hardware might work but still, issues are expected.

@fedecompa
Copy link
Author

@Iffa-Intel thanks for the reply.
Actually the GPU is detected correctly from the docker container running on the WSL2 Ubuntu22.
And also the model is running correctly with the OVModelForCausalLM library for python on windows locally:

model_id = "Fede90/llama-3.2-3b-instruct-INT4"
model = OVModelForCausalLM.from_pretrained(model_id, device="GPU.0", trust_remote_code=True)

So it is actually very strange...

@Iffa-Intel
Copy link

Iffa-Intel commented Nov 19, 2024

@fedecompa we'll further investigate & clarify this and get back to you. This probably relates to the architecture of WSL2 in Windows vs Ubuntu which influenced the OpenVINO library functionality.

@Iffa-Intel Iffa-Intel added the PSE label Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PSE support_request
Projects
None yet
Development

No branches or pull requests

2 participants