Cannot run Continuous batching Demo with GPU #2770

jsapede · 2024-10-29T07:44:48Z

Describe the bug
A clear and concise description of what the bug is.

To Reproduce

installation is made on a proxmox homelab, on a debian 12 LXC with GPU passthrough for openvino :

GPU passthrough is usually working well wtih many other services i use : immich / frigate / jellyfin ...

Steps to reproduce the behavior: as specified on demo !

firts i downloaded the docker version :

docker pull openvino/model_server:latest-gpu

then prepared the model :

export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
pip3 install optimum-intel@git+https://github.com/huggingface/optimum-intel.git  openvino-tokenizers[transformers]==2024.4.* openvino==2024.4.* nncf>=2.11.0 "transformers<4.45"

prepared the folders :

root@openai:~# cd /
root@openai:/# mkdir workspace
root@openai:/# cd workspace/

installed huggingface-cli and logged in :

root@openai:/workspace# git config --global credential.helper store
root@openai:/workspace# huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
The token `test4` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `test4`

then ran optimum-cli :

root@openai:/workspace# convert_tokenizer -o Meta-Llama-3-8B-Instruct --utf8_replace_mode replace --with-detokenizer --skip-special-tokens --streaming-detokenizer --not-add-special-tokens meta-llama/Meta-Llama-3-8B-Instruct
Loading Huggingface Tokenizer...
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 503kB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:01<00:00, 8.81MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 617kB/s]
Converting Huggingface Tokenizer to OpenVINO...
Saved OpenVINO Tokenizer: Meta-Llama-3-8B-Instruct/openvino_tokenizer.xml, Meta-Llama-3-8B-Instruct/openvino_tokenizer.bin
Saved OpenVINO Detokenizer: Meta-Llama-3-8B-Instruct/openvino_detokenizer.xml, Meta-Llama-3-8B-Instruct/openvino_detokenizer.bin

and got this firts kinda warning :

root@openai:/workspace# optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 Meta-Llama-3-8B-Instruct
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 4.85MB/s]
model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 51.8MB/s]
model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [04:16<00:00, 19.4MB/s]
model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [04:14<00:00, 19.6MB/s]
model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████| 4.92G/4.92G [04:18<00:00, 19.0MB/s]
model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:58<00:00, 19.9MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [13:49<00:00, 207.49s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:26<00:00,  6.59s/it]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 187/187 [00:00<00:00, 1.72MB/s]
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
/usr/local/lib/python3.11/dist-packages/optimum/exporters/openvino/model_patcher.py:497: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sequence_length != 1:

then created the graph.pbtxt using the GPU template in the demo :

root@openai:/workspace# ls
Meta-Llama-3-8B-Instruct
root@openai:/workspace# cd Meta-Llama-3-8B-Instruct/
root@openai:/workspace/Meta-Llama-3-8B-Instruct# nano graph.pbtxt

then added the config.json at workspace root :

root@openai:/workspace# nano config.json
root@openai:/workspace#

then ran the container with GPU passthrough :

root@openai:/workspace# docker run --rm --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -p 8000:8000 -v /workspace:/workspace:ro openvin
o/model_server:latest-gpu --port 9000 --rest_port 8000 --config_path /workspace/config.json
[2024-10-29 07:30:11.317][1][serving][info][server.cpp:75] OpenVINO Model Server 2024.4.28219825c
[2024-10-29 07:30:11.317][1][serving][info][server.cpp:76] OpenVINO backend c3152d32c9c7
[2024-10-29 07:30:11.317][1][serving][info][pythoninterpretermodule.cpp:35] PythonInterpreterModule starting
[2024-10-29 07:30:11.511][1][serving][info][pythoninterpretermodule.cpp:46] PythonInterpreterModule started
[2024-10-29 07:30:11.891][1][modelmanager][info][modelmanager.cpp:125] Available devices for Open VINO: CPU, GPU
[2024-10-29 07:30:11.895][1][serving][info][grpcservermodule.cpp:122] GRPCServerModule starting
[2024-10-29 07:30:11.899][1][serving][info][grpcservermodule.cpp:191] GRPCServerModule started
[2024-10-29 07:30:11.899][1][serving][info][grpcservermodule.cpp:192] Started gRPC server on port 9000
[2024-10-29 07:30:11.899][1][serving][info][httpservermodule.cpp:33] HTTPServerModule starting
[2024-10-29 07:30:11.899][1][serving][info][httpservermodule.cpp:37] Will start 16 REST workers
[evhttp_server.cc : 253] NET_LOG: Entering the event loop ...
[2024-10-29 07:30:11.903][1][serving][info][http_server.cpp:269] REST server listening on port 8000 with 16 threads
[2024-10-29 07:30:11.903][1][serving][info][httpservermodule.cpp:47] HTTPServerModule started
[2024-10-29 07:30:11.903][1][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000
[2024-10-29 07:30:11.903][1][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2024-10-29 07:30:11.904][1][modelmanager][info][modelmanager.cpp:536] Configuration file doesn't have custom node libraries property.
[2024-10-29 07:30:11.904][1][modelmanager][info][modelmanager.cpp:579] Configuration file doesn't have pipelines property.
[2024-10-29 07:30:11.909][1][serving][info][mediapipegraphdefinition.cpp:419] MediapipeGraphDefinition initializing graph nodes

then tried V1 API from another machine :

root@proxnok:~# curl http://192.168.0.246:8000/v1/config
{}root@proxnok:~#

but got empty response !

container seems to work as theres some heavy activity :

but collapses after some minutes :

root@proxnok:~# curl http://192.168.0.246:8000/v1/config
curl: (7) Failed to connect to 192.168.0.246 port 8000 after 0 ms: Couldn't connect to server

then test with V3 :

root@openai:/workspace# curl http://localhost:8000/v3/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "max_tokens":30,
    "stream":false,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is OpenVINO?"
      }
    ]
  }'| jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   345  100    51  100   294   1306   7529 --:--:-- --:--:-- --:--:--  9078
{
  "error": "Model with requested name is not found"

Expected behavior
run the demo

Logs
Logs from OVMS, ideally with --log_level DEBUG. Logs from client.

Configuration
OVMS version : latest-gpu

OVMS config.json :

{
    "model_config_list": [],
    "mediapipe_config_list": [
        {
            "name": "meta-llama/Meta-Llama-3-8B-Instruct",
            "base_path": "Meta-Llama-3-8B-Instruct"
        }
    ]
}

CPU, accelerator's versions if applicable : OpenVINO / GPU passthrough

Model repository directory structure :

root@openai:/workspace/Meta-Llama-3-8B-Instruct# ls -altr
total 15703012
drwxr-xr-x 3 root root        4096 Oct 29 07:53 ..
-rw-r--r-- 1 root root       27363 Oct 29 07:53 openvino_tokenizer.xml
-rw-r--r-- 1 root root     5697662 Oct 29 07:53 openvino_tokenizer.bin
-rw-r--r-- 1 root root        9369 Oct 29 07:53 openvino_detokenizer.xml
-rw-r--r-- 1 root root     1586526 Oct 29 07:53 openvino_detokenizer.bin
-rw-r--r-- 1 root root         194 Oct 29 08:08 generation_config.json
-rw-r--r-- 1 root root         727 Oct 29 08:08 config.json
-rw-r--r-- 1 root root       50977 Oct 29 08:08 tokenizer_config.json
-rw-r--r-- 1 root root         296 Oct 29 08:08 special_tokens_map.json
-rw-r--r-- 1 root root     9085698 Oct 29 08:08 tokenizer.json
-rw-r--r-- 1 root root 16060522904 Oct 29 08:22 openvino_model.bin
-rw-r--r-- 1 root root     2852847 Oct 29 08:22 openvino_model.xml
-rw-r--r-- 1 root root        1010 Oct 29 08:25 graph.pbtxt
drwxr-xr-x 2 root root        4096 Oct 29 08:25 .

Additional context
graph.pbtxt content :

input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"

node: {
  name: "LLMExecutor"
  calculator: "HttpLLMCalculator"
  input_stream: "LOOPBACK:loopback"
  input_stream: "HTTP_REQUEST_PAYLOAD:input"
  input_side_packet: "LLM_NODE_RESOURCES:llm"
  output_stream: "LOOPBACK:loopback"
  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
  input_stream_info: {
    tag_index: 'LOOPBACK:0',
    back_edge: true
  }
  node_options: {
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
          models_path: "./",
          plugin_config: '{}',
          block_size: 16,
          dynamic_split_fuse: false,
          max_num_seqs: 256,
          max_num_batched_tokens:8192,
          cache_size: 8,
          device: "GPU"
      }
  }
  input_stream_handler {
    input_stream_handler: "SyncSetInputStreamHandler",
    options {
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
        sync_set {
          tag_index: "LOOPBACK:0"
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

dtrawins · 2024-10-29T09:03:18Z

@jsapede What kind of GPU do you use on your host? The copied logs from ovms don't include the error message but I suspect the models size is too big for the GPU. Try converting the model with lower quantization like int4 and use less cache in graph.pbtxt like 2GB.
for example
optimum-cli export openvino --disable-convert-tokenizer --model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 Meta-Llama-3-8B-Instruct
and
cache_size: 2,

jsapede · 2024-10-29T10:12:39Z

i'm on an old i5 gen 7
it works in int4 ! thanks !
btw i'm totally noob with this but, is the graph common with all available models ? i.e. if i want tot try a mistral one for example ?

dtrawins · 2024-10-29T11:37:47Z

@jsapede yes, the same graph can be shared with all models.

jsapede · 2024-10-29T11:38:05Z

well looks like i spoke a little too fast ... it workED
first tests were ok, got replys but since impossible to get is back to work. I changed nothing ... tried to rebuild model from beginning but nothing wokrs anymore

quite lost, reinstalled LXC container from begining , made all procedure from the begining inclufing int4 corrections but impossible to get the docker to work again ...

[EDIT] it seems that the container entered in a GPU loop, had to turn off/on the whole proxmox. will do further tests tomorrow, thanks for the help !

jsapede added the bug Something isn't working label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run Continuous batching Demo with GPU #2770

Cannot run Continuous batching Demo with GPU #2770

jsapede commented Oct 29, 2024

dtrawins commented Oct 29, 2024

jsapede commented Oct 29, 2024

dtrawins commented Oct 29, 2024

jsapede commented Oct 29, 2024 •

edited

Loading

Cannot run Continuous batching Demo with GPU #2770

Cannot run Continuous batching Demo with GPU #2770

Comments

jsapede commented Oct 29, 2024

dtrawins commented Oct 29, 2024

jsapede commented Oct 29, 2024

dtrawins commented Oct 29, 2024

jsapede commented Oct 29, 2024 • edited Loading

jsapede commented Oct 29, 2024 •

edited

Loading