Llama3.2 Vision Model: Guides and Issues #8826

simon-mo · 2024-09-25T22:50:46Z

Running the server (using the vLLM CLI or our docker image):

vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct --enforce-eager --max-num-seqs 32 --tensor-parallel-size 8

Currently:

Only one leading image is supported. Support for multiple images and interleaving images are work in progress.
Text only inference is supported.
Only NVIDIA GPUs are supported.
Performance is acceptable but to be optimized! We aim at first release to be functionality correct. We will work on making it fast 🏎️

Please see the next steps for better supporting this model on vLLM.

cc @heheda12345 @ywang96

The text was updated successfully, but these errors were encountered:

shermansiu · 2024-09-25T23:30:41Z

What is the ETA for adding support for multiple images? For a project I'm working on, we're trying to see if it's viable to use vLLM for Llama 3.2 in the next few days.

pseudotensor · 2024-09-26T01:54:22Z

What is optimal way to use H100s? I'd guess 8 are not required.

I can get 4*H100's to start up like this:

docker pull vllm/vllm-openai:v0.6.2
docker stop llama31-70b ; docker remove llama31-70b
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=1,2,3,7"' \
    --shm-size=10.24gb \
    -p 5020:5020 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name llama31-70b \
     vllm/vllm-openai:v0.6.2 \
        --port=5020 \
        --host=0.0.0.0 \
        --model=meta-llama/Llama-3.2-90B-Vision-Instruct \
        --seed 1234 \
        --tensor-parallel-size=4 \
        --max-model-len=81920 \
        --max-num-batched-tokens=81920 --max-log-len=100 \
        --limit_mm_per_prompt 'image=1' \
        --enforce-eager --max-num-seqs 8 \
        --gpu-memory-utilization 0.99 \
        --served-model-name meta-llama/Llama-3.2-90B-Vision-Instruct meta-llama/Meta-Llama-3.1-70B-Instruct \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.llama31_70b.txt

But it fails when used like this. Just for a test TEXT only query, no images. Seems very short queries work, but around 10k input fails.

WARNING 09-25 18:52:22 preprocess.py:86] Falling back on <BOS> for decoder start token id because decoder start token id is not available.
INFO 09-25 18:52:22 engine.py:288] Added request cmpl-7f02d431762140e784491182a7121772-0.
ERROR 09-25 18:52:23 engine.py:157] AssertionError()
ERROR 09-25 18:52:23 engine.py:157] Traceback (most recent call last):
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 09-25 18:52:23 engine.py:157]     self.run_engine_loop()
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 09-25 18:52:23 engine.py:157]     request_outputs = self.engine_step()
ERROR 09-25 18:52:23 engine.py:157]                       ^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 09-25 18:52:23 engine.py:157]     raise e
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 09-25 18:52:23 engine.py:157]     return self.engine.step()
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1264, in step
ERROR 09-25 18:52:23 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 09-25 18:52:23 engine.py:157]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 78, in execute_model
ERROR 09-25 18:52:23 engine.py:157]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 09-25 18:52:23 engine.py:157]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model
ERROR 09-25 18:52:23 engine.py:157]     return self.driver_worker.execute_model(execute_model_req)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-25 18:52:23 engine.py:157]     output = self.model_runner.execute_model(
ERROR 09-25 18:52:23 engine.py:157]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-25 18:52:23 engine.py:157]     return func(*args, **kwargs)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 223, in execute_model
ERROR 09-25 18:52:23 engine.py:157]     output: SamplerOutput = self.model.sample(
ERROR 09-25 18:52:23 engine.py:157]                             ^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 940, in sample
ERROR 09-25 18:52:23 engine.py:157]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 09-25 18:52:23 engine.py:157]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-25 18:52:23 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-25 18:52:23 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 274, in forward
ERROR 09-25 18:52:23 engine.py:157]     maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
ERROR 09-25 18:52:23 engine.py:157]                                                                  ^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 879, in _sample
ERROR 09-25 18:52:23 engine.py:157]     return _sample_with_torch(
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 819, in _sample_with_torch
ERROR 09-25 18:52:23 engine.py:157]     multinomial_samples[sampling_type] = _multinomial(
ERROR 09-25 18:52:23 engine.py:157]                                          ^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 627, in _multinomial
ERROR 09-25 18:52:23 engine.py:157]     assert seq_group.generator is not None
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157] AssertionError
(VllmWorkerProcess pid=209) INFO 09-25 18:52:23 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=211) INFO 09-25 18:52:23 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=210) INFO 09-25 18:52:23 multiproc_worker_utils.py:244] Worker exiting
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.12/asyncio/locks.py", line 212, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7ff530597740

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 250, in __call__
  |     async with anyio.create_task_group() as task_group:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 736, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 253, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 242, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 248, in completion_stream_generator
    |     async for prompt_idx, res in result_generator:
    |   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 490, in merge_async_iterators
    |     item = await d
    |            ^^^^^^^
    |   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 486, in _process_request
    |     raise request_output
    | AssertionError
    +------------------------------------
INFO 09-25 18:52:24 multiproc_worker_utils.py:124] Killing local vLLM worker processes
[rank0]:[W925 18:52:26.595501679 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
INFO:     172.16.0.87:35352 - "GET /health HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application

tcapelle · 2024-09-26T09:31:53Z

How can I run the vllm docker image against the meta provided checkpoint (not the huggingface one)?

groccy · 2024-09-26T12:36:25Z

What is the ETA for adding support for multiple images? For a project I'm working on, we're trying to see if it's viable to use vLLM for Llama 3.2 in the next few days.

Same here. If we can know the ETA for adding support for multiple images and interleaving images, it will be truly appreciated. We also have a project that depends on this kind of SOTA VLMs.

Thank you so much for your hard work and great contributions!

heheda12345 · 2024-09-26T17:39:26Z

How can I run the vllm docker image against the meta provided checkpoint (not the huggingface one)?

You can add the logic of loading meta provided checkpoint here. The main work will be changing some names of the parameter. And you still need huggingface's configs.

tcapelle · 2024-09-26T20:21:54Z

I got a HF checkpoint thatnks to some EU friends, managed to serve the model half of today, but now it's dying

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb9eaf03f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb9eaeb2d10 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb9eafdef08 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fb9ec1fb3e6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fb9ec200600 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fb9ec2072ba in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb9ec2096fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fba399addf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fba3abc6609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fba3ad00353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb9eaf03f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb9eaeb2d10 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb9eafdef08 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fb9ec1fb3e6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fb9ec200600 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fb9ec2072ba in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb9ec2096fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fba399addf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fba3abc6609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fba3ad00353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb9eaf03f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fb9ebe92a84 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fba399addf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fba3abc6609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fba3ad00353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown

sfc-gh-zhwang · 2024-09-26T23:10:19Z

What is the general memory requirement for the model. Seems that 90b model is not able to run on an 8x40GB A100 machine.

varoudis · 2024-09-27T10:39:45Z

What is the general memory requirement for the model. Seems that 90b model is not able to run on an 8x40GB A100 machine.

pif... imagine I tried to run it on 8 A30 :(

DarkLight1337 · 2024-09-27T12:51:57Z

If you run into OOM issues, please reduce max_model_len and/or max_num_seqs, as described in the example script.

ywang96 · 2024-09-27T19:56:05Z

Thank you all for trying out Llama 3.2 vision model on vLLM!

As you may already know, multimodal Llama 3.2 is quite different from other LlaVA-style VLMs that we currently support on vLLM as it involves cross-attention layers in the language model, and below are the next steps and their priorities for features and optimizations related to this new architecture.

P0

Cross attention mask:
- To properly support multi/interleaved image inference, a custom cross attention mask is required but passing attention mask to attention module is not supported on vLLM today. Help wanted!
- [Model] Make llama3.2 support multiple and interleaved images #9095
Address KV cache waste issue:
- We need better memory management for cross attention/encoder-decoder models since the page size cannot be specified separately.
Optimize inference of mixed requests (text-only & text + image):
- Since the model forward pass changes dynamically to the input (text only or text + image), we needed to add additional dummy image + mask for text-only sequences so that mixed requests can be batched
- As a result, there's significant room for optimizing both compute & space efficiency.
- This is less of an issue for offline inference (where typically requests are either all text-only or all text + image), but more for the online serving case.

P1

Attention backends other than xformers for cross attention
Enabling cuda graph

P2

Image embeddings as input
Frontend changes for multi/interleave images
- [Frontend] Enable Online Multi-image Support for MLlama #9393
Add xformers implementation for vision encoder attention

pseudotensor · 2024-09-27T20:27:14Z

Any thoughts on my issue or should I post real separate issue?

ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 819, in _sample_with_torch
ERROR 09-25 18:52:23 engine.py:157]     multinomial_samples[sampling_type] = _multinomial(
ERROR 09-25 18:52:23 engine.py:157]                                          ^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 627, in _multinomial
ERROR 09-25 18:52:23 engine.py:157]     assert seq_group.generator is not None
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157] AssertionError

#8826 (comment)

njhill · 2024-09-27T20:28:49Z

@pseudotensor fix is here #8870.

AmericanPresidentJimmyCarter · 2024-09-27T23:52:46Z

It's a good first attempt but as it currently is it's basically unusable. Without caching the cross attention projections it is insanely slow, probably about 1/20th the speed it should be.

heheda12345 · 2024-09-28T00:22:40Z

It's a good first attempt but as it currently is it's basically unusable. Without caching the cross attention projections it is insanely slow, probably about 1/20th the speed it should be.

Can you provide more information for us to reproduce your problem? e.g., your benchmark script & your data.

AmericanPresidentJimmyCarter · 2024-09-28T02:14:18Z

I don't think any of the implementations currently have the cross attention projection caches? But for inference, it looks like the outputs of the cross attention kv projections for attending to the image can be cached after the first token and reused for all subsequent inference steps without being recalculated each time (the image doesn't change).

vllm/vllm/model_executor/models/mllama.py

Lines 674 to 707 in bd429f2

    
           def forward( 
        
               self, 
        
               hidden_states: torch.Tensor, 
        
               attention_mask: Optional[torch.Tensor], 
        
               cross_attention_states: Optional[torch.Tensor], 
        
               kv_cache: torch.Tensor, 
        
               attn_metadata: AttentionMetadata, 
        
           ) -> torch.Tensor: 
        
               qkv_dec, _ = self.qkv_proj(hidden_states) 
        
               q, _, _ = qkv_dec.split( 
        
                   [self.q_local_size, self.kv_local_size, self.kv_local_size], 
        
                   dim=-1) 
        
               if cross_attention_states is None: 
        
                   k = None 
        
                   v = None 
        
               else: 
        
                   qkv_enc, _ = self.qkv_proj(cross_attention_states) 
        
                   _, k, v = qkv_enc.split( 
        
                       [self.q_local_size, self.kv_local_size, self.kv_local_size], 
        
                       dim=-1) 
        
                   k = k.view(-1, self.num_local_key_value_heads, self.head_dim) 
        
                   v = v.view(-1, self.num_local_key_value_heads, self.head_dim) 
        
                   k = self.k_norm(k) 
        
               q = q.view(-1, self.num_local_heads, self.head_dim) 
        
               q = self.q_norm(q) 
        
               output = self.attn(q, 
        
                                  k, 
        
                                  v, 
        
                                  kv_cache, 
        
                                  attn_metadata, 
        
                                  attn_type=AttentionType.ENCODER_DECODER) 
        
               out, _ = self.o_proj(output) 
        
               return out

It's possible I am misunderstanding the code, but it looks like it repeats them each time even though the output should be the same?

ywang96 · 2024-09-28T02:24:04Z

I don't think any of the implementations currently have the cross attention projection caches? But for inference, it looks like the outputs of the cross attention kv projections for attending to the image can be cached after the first token and reused for all subsequent inference steps without being recalculated each time (the image doesn't change).

vllm/vllm/model_executor/models/mllama.py

Lines 674 to 707 in bd429f2

def forward(

self,

hidden_states: torch.Tensor,

attention_mask: Optional[torch.Tensor],

cross_attention_states: Optional[torch.Tensor],

kv_cache: torch.Tensor,

attn_metadata: AttentionMetadata,

) -> torch.Tensor:

qkv_dec, _ = self.qkv_proj(hidden_states)

q, _, _ = qkv_dec.split(

[self.q_local_size, self.kv_local_size, self.kv_local_size],

dim=-1)

if cross_attention_states is None:

k = None

v = None

else:

qkv_enc, _ = self.qkv_proj(cross_attention_states)

_, k, v = qkv_enc.split(

[self.q_local_size, self.kv_local_size, self.kv_local_size],

dim=-1)

k = k.view(-1, self.num_local_key_value_heads, self.head_dim)

v = v.view(-1, self.num_local_key_value_heads, self.head_dim)

k = self.k_norm(k)

q = q.view(-1, self.num_local_heads, self.head_dim)

q = self.q_norm(q)

output = self.attn(q,

k,

v,

kv_cache,

attn_metadata,

attn_type=AttentionType.ENCODER_DECODER)

out, _ = self.o_proj(output)

return out

It's possible I am misunderstanding the code, but it looks like it repeats them each time even though the output should be the same?

I suggest you take a look at the actual attention module implementation here to understand how attention with KV cache works.

vllm/vllm/attention/layer.py

Lines 15 to 27 in bd429f2

    
           class Attention(nn.Module): 
        
               """Attention layer. 
        
               This class takes query, key, and value tensors as input. The input tensors 
        
               can either contain prompt tokens or generation tokens. 
        
               The class does the following: 
        
               1. Store the input key and value tensors in the KV cache. 
        
               2. Perform (multi-head/multi-query/grouped-query) attention. 
        
               3. Return the output tensor. 
        
               """ 
        
               def __init__(

AmericanPresidentJimmyCarter · 2024-09-28T03:25:30Z

OK -- I misunderstood. I think my previous speed issues might have been a skill issue on my part and I dug through the other issues to try to figure out which command line arguments I was missing. Running

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.95 --model=meta-llama/Llama-3.2-11B-Vision-Instruct --tokenizer=meta-llama/Llama-3.2-11B-Vision-Instruct --download-dir="/home/user/storage/hf_cache" --dtype=bfloat16 --device=cuda --host=127.0.0.1 --port=30000 --max-model-len=8192 --quantization="fp8" --enforce_eager --max_num_seqs=8

On 3x 3090s is giving me:

Average caption generation time: 40.10s over 166 captions
Recent captions per second: 1.30
Average caption generation time: 43.22s over 235 captions
Recent captions per second: 1.15
Average caption generation time: 44.14s over 308 captions
Recent captions per second: 1.22
Average caption generation time: 45.89s over 368 captions
Recent captions per second: 1.00

AmericanPresidentJimmyCarter · 2024-09-28T06:06:45Z

Found another weird thing with vllm and LLaMA 3.2 11b, or possibly another failure of myself to read the docs. If I have --enable_chunked-prefill=false on, the model will batch 4-5 requests as running at the same time. If I don't supply this argument, it will only process 1 request at a time and be significantly slower. With that argument I go from about 2.4 sec/caption to 2.1 sec/caption on a 3090.

Searcherr · 2024-09-28T11:33:43Z

Found another weird thing with vllm and LLaMA 3.2 11b, or possibly another failure of myself to read the docs. If I have --enable_chunked-prefill=false on, the model will batch 4-5 requests as running at the same time. If I don't supply this argument, it will only process 1 request at a time and be significantly slower. With that argument I go from about 2.4 sec/caption to 2.1 sec/caption on a 3090.

May I ask you to share vLLM argument for the model deployment that you used?

AmericanPresidentJimmyCarter · 2024-09-28T13:24:12Z

It's above.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.95 --model=meta-llama/Llama-3.2-11B-Vision-Instruct --tokenizer=meta-llama/Llama-3.2-11B-Vision-Instruct --download-dir="/home/user/storage/hf_cache" --dtype=bfloat16 --device=cuda --host=127.0.0.1 --port=30000 --max-model-len=8192 --quantization="fp8" --enforce_eager --max_num_seqs=8 --enable_chunked-prefill=false

AmericanPresidentJimmyCarter · 2024-09-28T13:25:36Z

I'm still having other issues too. vllm will hang randomly sometimes without errors and stop responding to requests after a few hours. :(

Searcherr · 2024-09-28T17:03:11Z

It's above.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.95 --model=meta-llama/Llama-3.2-11B-Vision-Instruct --tokenizer=meta-llama/Llama-3.2-11B-Vision-Instruct --download-dir="/home/user/storage/hf_cache" --dtype=bfloat16 --device=cuda --host=127.0.0.1 --port=30000 --max-model-len=8192 --quantization="fp8" --enforce_eager --max_num_seqs=8 --enable_chunked-prefill=false

Perfectly worked for me. Thank you so much!
Going to tests on Monday. Or tomorrow if the weather become terrible :)))

AmericanPresidentJimmyCarter · 2024-09-28T17:06:48Z

Perfectly worked for me. Thank you so much! Going to tests on Monday. Or tomorrow if the weather become terrible :)))

Unfortunately performance for me degrades over time and I don't know why. After about an hour of giving it continuous requests throughput falls to about half the original rate. It's very reproducible at least.

edit: After going through the logs, I wonder if it's another issue with the HTTP server.

INFO 09-28 17:21:02 metrics.py:351] Avg prompt throughput: 79.7 tokens/s, Avg generation throughput: 120.6 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 64.9%, CPU KV cache usage: 0.0%.

On the slow gpu, it stops receiving new requests ahead of time and Pending: remains at zero always. Two other gpus on the same system have pending stuck full as I'm constantly shipping new requests to them. So why one server gets stuck and doesn't seem to queue the new requests, I am not sure.

AmericanPresidentJimmyCarter · 2024-09-28T21:06:31Z

Ok, I give up. The one server (which is running on localhost) stops processing requests randomly. It results in long stretches of idle time where no work is being done, rendering the openai server unusable.

tcapelle · 2024-09-28T21:40:01Z

Just to report that I have been serving 90b from HF checkpoint without issues (besides when trying to use instructor on top that killed the server).
It's not super fast (~35tok/s) but can handle multiple requests in parallel.
Hardware 8xh100.
I have served more than 10M tokens already ;)

AmericanPresidentJimmyCarter · 2024-09-29T01:28:33Z

I tried offline inference mode with batch size 8 on a 3090, again it gets hung randomly and stops processing batches.

On ctrl+c it shows it is stuck in return len(self._output_token_ids) + len(self._prompt_token_ids).

Traceback (most recent call last):                                              
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()                                                                  
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run       
    self._target(*self._args, **self._kwargs)                                   
  File "/home/user/Programs/vllm/caption_vllm.py", line 147, in worker    
    outputs = llm.generate(inputs, sampling_params=sampling_params)             
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/utils.py", line 1047, in inner             
    return fn(*args, **kwargs)                                                  
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 388, in generate                                               
    outputs = self._run_engine(use_tqdm=use_tqdm)                               
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 877, in _run_engine
    step_outputs = self.llm_engine.step()                                   
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1217, in step                                                
    ) = self.scheduler[virtual_engine].schedule()
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1189, in schedule 
    scheduler_outputs = self._schedule()                                        
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1154, in _schedule                                              
    return self._schedule_default()                                             
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 989, in _schedule_default                          
    prefills = self._schedule_prefills(budget,                                  
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 903, in _schedule_prefills                                      
    can_allocate = self.block_manager.can_allocate(seq_group)                   
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/block_manager_v1.py", line 290, in can_allocate      
    self_num_required_blocks = self._get_seq_num_required_blocks(               
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/block_manager_v1.py", line 282, in _get_seq_num_required_blocks
    return 0 if seq is None else seq.n_blocks                                   
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/sequence.py", line 452, in n_blocks                                                      
    return (self.get_len() + self.block_size - 1) // self.block_size            
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/sequence.py", line 557, in get_len                                                       
    return self.data.get_len()                                                  
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/sequence.py", line 273, in get_len          
    return len(self._output_token_ids) + len(self._prompt_token_ids)            
KeyboardInterrupt

vllm was instantiated like:

    model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
    llm = LLM(
        model=model_name,
        max_model_len=8192,  # Adjust as needed
        max_num_seqs=8,      # Adjust as needed
        enforce_eager=True,
        quantization="fp8",
        gpu_memory_utilization=0.98,
    )

It was also much slower (~7s/caption) versus the openai server.

samos123 · 2024-09-29T02:30:56Z

I am getting the following error:

ERROR 09-28 19:27:59 async_llm_engine.py:61] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Exception in callback functools.partial(<function _log_task_completion at 0x7c49e15f6e80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7c49dda27aa0>>)
h

Full logs: https://gist.github.com/samos123/43db2724d7ac1bba44a379ce300b156b

I reduced memory requirements and afterwards got a new error: https://gist.github.com/samos123/ee858936496e1d314785719f9287230a

Update: I only get the issue when I had --kv-cache-dtype=fp8. After I removed that flag it started working.

ywang96 · 2024-09-29T06:54:48Z

I am getting the following error:
ERROR 09-28 19:27:59 async_llm_engine.py:61] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Exception in callback functools.partial(<function _log_task_completion at 0x7c49e15f6e80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7c49dda27aa0>>)
h
Full logs: https://gist.github.com/samos123/43db2724d7ac1bba44a379ce300b156b

I reduced memory requirements and afterwards got a new error: https://gist.github.com/samos123/ee858936496e1d314785719f9287230a

Update: I only get the issue when I had --kv-cache-dtype=fp8. After I removed that flag it started working.

@samos123 Yea I'm almost sure fp8 for kv cache isn't working with cross attention yet.

FWIW, the command below works for me on 1xH100

 vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 32

AmericanPresidentJimmyCarter · 2024-09-29T16:10:15Z

I did get it to run stably overnight at 5 seconds/caption on 3090s at --gpu-memory-utilization 0.9 using the openai server.

The problem I'm seeing is that performance degrades very quickly because vllm stops doing batches together. I'm not sure how that is determined (based on kv-cache occupancy?). So it will start out going approximately double the speed then degrade in about 20-30 minutes.

At startup:

INFO 09-29 16:03:38 metrics.py:351] Avg prompt throughput: 45.9 tokens/s, Avg generation throughput: 166.1 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 11 reqs, GPU KV cache usage: 94.2%, CPU KV cache usage: 0.0%.

After some time:

INFO 09-29 16:06:33 metrics.py:351] Avg prompt throughput: 59.0 tokens/s, Avg generation throughput: 44.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 15 reqs, GPU KV cache usage: 94.8%, CPU KV cache usage: 0.0%.

Setting --gpu-memory-utilization higher seems to make vllm unstable. Turning off most of the logs seems to have helped maintain the Pending: requests at a high amount so they don't fall to zero anymore, but the performance degradation issue over time persists.

edit: I tried the same run command with L40s and it reliably does batch sizes of 8 at around 0.9 seconds/caption. So it seems to be an issue with lower VRAM cards.

The FastAPI does indeed have another problem, if you send too many requests at once they all seem to timeout. I was only able to get the L40s to uses a consistent batch size of 8 by carefully trickling in requests, which seems less than ideal. Maybe the FastAPI server needs more threads/processes to handle incoming requests efficiently.

heheda12345 · 2024-09-29T19:31:45Z

The problem I'm seeing is that performance degrades very quickly because vllm stops doing batches together. I'm not sure how that is determined (based on kv-cache occupancy?). So it will start out going approximately double the speed then degrade in about 20-30 minutes.

@AmericanPresidentJimmyCarter Yes, the scheduling is based on kv-cache occupancy. GPU KV cache usage: 94.8% so I guess that the reason it does not batch more request is that the kv cache does not have enough space.

AmericanPresidentJimmyCarter · 2024-09-29T19:56:33Z

That makes sense. It seems like a lot of the kv cache may stick around too long to be useful, as when you first start up the server it runs almost twice as fast (and does 3-4x simultaneous requests instead of 1). It would be nice if there was a way to occasionally purge the kv cache like an endpoint I could poll or an argument.

AmericanPresidentJimmyCarter · 2024-09-30T04:31:05Z

For anyone perplexed about this: you might need to tweak --max_num_seqs. For 3090/4090 I need to set it at 4. After countless hours this weekend it seems that this variable controls the number of simultaneous batches and also the quantity of the kv cache filled with each item added to the async engine for generate. Maybe a VLLM dev can chime in on exactly what this is and why it's critical. From reading the source code it seems like --max_num_seqs might be mean "maximum async engine batch size to run", but why that has any impact on kv cache I don't understand. The kv cache was blowing up before while the model was processing exactly the same batch sizes at --max_num_seqs 8 (even with it set this high, VLLM would only run a maximum of 4-5 items per batch anyway).

edit: Go here: #2492

tcapelle · 2024-09-30T08:05:06Z

Just FYI, on a 8xH100 node I am getting:

INFO 09-30 01:04:27 metrics.py:351] Avg prompt throughput: 209.3 tokens/s, Avg generation throughput│
: 44.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV

destino92 · 2024-09-30T17:47:04Z

How can i deploy this to aws it is for an experiment and i am really new to AI ops, can anyone assist me please?

varoudis · 2024-10-02T08:43:57Z

Just FYI, on a 8xH100 node I am getting:

INFO 09-30 01:04:27 metrics.py:351] Avg prompt throughput: 209.3 tokens/s, Avg generation throughput│
: 44.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV

whats the vram usage?

tcapelle · 2024-10-02T13:47:38Z

~60-70GB

joe-schwartz-certara · 2024-10-04T20:57:41Z

https://docs.vllm.ai/en/stable/models/vlm.html

Am I correct the implication of the above guide is that I can do online inference ONLY with the /v1/chat/completions endpoint and ONLY with a url for the image? Is there another way to supply an image to the model server besides a web url?

joe-schwartz-certara · 2024-10-04T21:00:45Z

Btw I am having no problems serving https://huggingface.co/neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic on 2 A100s. Text inference and text+single image inference both work great but is very slow. Looking forward to the optimized implementation :)

AmericanPresidentJimmyCarter · 2024-10-04T22:38:20Z

https://docs.vllm.ai/en/stable/models/vlm.html

Am I correct the implication of the above guide is that I can do online inference ONLY with the /v1/chat/completions endpoint and ONLY with a url for the image? Is there another way to supply an image to the model server besides a web url?

base64

Searcherr · 2024-10-05T04:51:02Z

Indeed, Here is a guide from OpenAI https://platform.openai.com/docs/guides/vision

…

On Sat, Oct 5, 2024, 00:38 39th president of the United States, probably < ***@***.***> wrote: https://docs.vllm.ai/en/stable/models/vlm.html Am I correct the implication of the above guide is that I can do online inference ONLY with the /v1/chat/completions endpoint and ONLY with a url for the image? Is there another way to supply an image to the model server besides a web url? base64 — Reply to this email directly, view it on GitHub <#8826 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQGGY3NPGVNR6EQVZNYR553ZZ4KHJAVCNFSM6AAAAABO3TDIGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJUG42TIOBQGI> . You are receiving this because you commented.Message ID: ***@***.***>

javed828 · 2024-10-05T18:27:45Z

i m using

CUDA_VISIBLE_DEVICES=0,1 python3 -m vllm.entrypoints.openai.api_server
--gpu-memory-utilization 0.95 --model=meta-llama/Meta-Llama-3.1-8B-Instruct
--tokenizer=meta-llama/Meta-Llama-3.1-8B-Instruct
--download-dir="/home/user/storage/hf_cache" --dtype=float16 --device=cuda
--host=0.0.0.0 --port=30000 --max-model-len=8192
--enforce_eager --max_num_seqs=8
--enable_chunked-prefill=false --tensor_parallel_size=2

i m using 2 nvidia GPU when sending 1 request woking perfect. but when sending parellel 2 request then only 1 is processing and other is blocking till first process not complete. How can i process parellel request. Can anyone help on it. i m new in it

heheda12345 · 2024-10-05T20:04:40Z

What is the ETA for adding support for multiple images? For a project I'm working on, we're trying to see if it's viable to use vLLM for Llama 3.2 in the next few days.

We are working in this pr. #9095

tcapelle · 2024-10-07T06:34:17Z

We are working in this pr. #9095

I don't think that this is the issue, @javed828 is asking about parallel requests.

Are you sure you are sending them in parallel? I had no issues running up to 10 parallel requests (I was on 8xH100).

shermansiu · 2024-10-07T06:36:48Z

I don't think that was in reply to javed828's comment: that was in reply to the comment I made about 2 weeks ago at the beginning of the thread.

javed828 · 2024-10-07T08:52:50Z

@tcapelle yes i m hitting parellel and i m using nvidia T4

Archmilio · 2024-10-08T11:20:52Z

The GPU blocks count is set to be very small compared to the VRAM size. Does anyone have the same issue?
Block Size is 16
11B, 1 GPU (80GB * 1ea)
INFO 10-08 04:16:45 gpu_executor.py:122] # GPU blocks: 1264, # CPU blocks: 1638
90B, 4GPU (80GB * 4ea)
INFO 10-08 04:17:17 distributed_gpu_executor.py:57] # GPU blocks: 2327, # CPU blocks: 2621

sjuxax · 2024-10-13T23:08:23Z

docs.vllm.ai/en/stable/models/vlm.html

Am I correct the implication of the above guide is that I can do online inference ONLY with the /v1/chat/completions endpoint and ONLY with a url for the image? Is there another way to supply an image to the model server besides a web url?

It's pretty easy to hack in support for file:// URIs. I think it makes sense for vLLM not to upstream it for security purposes/related shenanigans, but I threw together a simple patch that worked well in my limited testing. I'll upload it in a sec.

Edit: commit here

keyword1983 · 2024-10-21T02:07:29Z

u can try to reduce your max_seq_num for setting parms . u can discover GPU blocks can increase when u decrease max seq num. i think a lot of memory be preserved for mm_tokens in each seq.

The GPU blocks count is set to be very small compared to the VRAM size. Does anyone have the same issue? Block Size is 16 11B, 1 GPU (80GB * 1ea) INFO 10-08 04:16:45 gpu_executor.py:122] # GPU blocks: 1264, # CPU blocks: 1638 90B, 4GPU (80GB * 4ea) INFO 10-08 04:17:17 distributed_gpu_executor.py:57] # GPU blocks: 2327, # CPU blocks: 2621

Cognitus-Stuti · 2024-10-28T16:19:45Z

How can we optimize llama-3.2-11b to run on 4 T4 GPUs,
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --max-model-len 6000 --enforce-eager --max-num-seqs 5 --port 8000 --dtype half --tensor-parallel-size 4 --gpu-memory-utilization 0.97

Avg tokens per second is 2

Cognitus-Stuti · 2024-10-28T16:21:09Z

@tcapelle yes i m hitting parellel and i m using nvidia T4

@javed828 could you please share your serve command here

simon-mo added misc and removed misc labels Sep 25, 2024

simon-mo changed the title ~~Llama3.2 Vision Guides and Issues~~ Llama3.2 Vision Model: Guides and Issues Sep 25, 2024

ywang96 mentioned this issue Sep 27, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

69 tasks

This was referenced Oct 14, 2024

[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage #9352

Merged

[Bug]: LLAMA 3.2 11B Vision Instruct Model not Running in VLLM 0.6.2 #9341

Closed

Llama3.2 Vision Model: Guides and Issues #8826

Llama3.2 Vision Model: Guides and Issues #8826

Comments

simon-mo commented Sep 25, 2024 • edited by ywang96 Loading

shermansiu commented Sep 25, 2024

pseudotensor commented Sep 26, 2024 • edited Loading

tcapelle commented Sep 26, 2024

groccy commented Sep 26, 2024

heheda12345 commented Sep 26, 2024

tcapelle commented Sep 26, 2024 • edited Loading

sfc-gh-zhwang commented Sep 26, 2024

varoudis commented Sep 27, 2024

DarkLight1337 commented Sep 27, 2024

ywang96 commented Sep 27, 2024 • edited Loading

pseudotensor commented Sep 27, 2024

njhill commented Sep 27, 2024 • edited Loading

AmericanPresidentJimmyCarter commented Sep 27, 2024

heheda12345 commented Sep 28, 2024

AmericanPresidentJimmyCarter commented Sep 28, 2024

ywang96 commented Sep 28, 2024

AmericanPresidentJimmyCarter commented Sep 28, 2024 • edited Loading

AmericanPresidentJimmyCarter commented Sep 28, 2024 • edited Loading

Searcherr commented Sep 28, 2024

AmericanPresidentJimmyCarter commented Sep 28, 2024 • edited Loading

AmericanPresidentJimmyCarter commented Sep 28, 2024

Searcherr commented Sep 28, 2024

AmericanPresidentJimmyCarter commented Sep 28, 2024 • edited Loading

AmericanPresidentJimmyCarter commented Sep 28, 2024

tcapelle commented Sep 28, 2024

AmericanPresidentJimmyCarter commented Sep 29, 2024 • edited Loading

samos123 commented Sep 29, 2024 • edited Loading

ywang96 commented Sep 29, 2024

AmericanPresidentJimmyCarter commented Sep 29, 2024 • edited Loading

heheda12345 commented Sep 29, 2024

AmericanPresidentJimmyCarter commented Sep 29, 2024

AmericanPresidentJimmyCarter commented Sep 30, 2024 • edited Loading

tcapelle commented Sep 30, 2024

destino92 commented Sep 30, 2024

varoudis commented Oct 2, 2024 • edited Loading

tcapelle commented Oct 2, 2024

joe-schwartz-certara commented Oct 4, 2024 • edited Loading

joe-schwartz-certara commented Oct 4, 2024

AmericanPresidentJimmyCarter commented Oct 4, 2024

Searcherr commented Oct 5, 2024 via email

javed828 commented Oct 5, 2024

heheda12345 commented Oct 5, 2024

tcapelle commented Oct 7, 2024 • edited Loading

shermansiu commented Oct 7, 2024

javed828 commented Oct 7, 2024 • edited Loading

Archmilio commented Oct 8, 2024 • edited Loading

sjuxax commented Oct 13, 2024 • edited Loading

keyword1983 commented Oct 21, 2024 • edited Loading

Cognitus-Stuti commented Oct 28, 2024

Cognitus-Stuti commented Oct 28, 2024

simon-mo commented Sep 25, 2024 •

edited by ywang96

Loading

pseudotensor commented Sep 26, 2024 •

edited

Loading

tcapelle commented Sep 26, 2024 •

edited

Loading

ywang96 commented Sep 27, 2024 •

edited

Loading

njhill commented Sep 27, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented Sep 28, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented Sep 28, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented Sep 28, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented Sep 28, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented Sep 29, 2024 •

edited

Loading

samos123 commented Sep 29, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented Sep 29, 2024 •

edited

Loading

AmericanPresidentJimmyCarter commented Sep 30, 2024 •

edited

Loading

varoudis commented Oct 2, 2024 •

edited

Loading

joe-schwartz-certara commented Oct 4, 2024 •

edited

Loading

tcapelle commented Oct 7, 2024 •

edited

Loading

javed828 commented Oct 7, 2024 •

edited

Loading

Archmilio commented Oct 8, 2024 •

edited

Loading

sjuxax commented Oct 13, 2024 •

edited

Loading

keyword1983 commented Oct 21, 2024 •

edited

Loading