Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3.2 Vision Model: Guides and Issues #8826

Open
Tracked by #4194
simon-mo opened this issue Sep 25, 2024 · 50 comments
Open
Tracked by #4194

Llama3.2 Vision Model: Guides and Issues #8826

simon-mo opened this issue Sep 25, 2024 · 50 comments

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Sep 25, 2024

Running the server (using the vLLM CLI or our docker image):

  • vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
  • vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct --enforce-eager --max-num-seqs 32 --tensor-parallel-size 8

Currently:

  • Only one leading image is supported. Support for multiple images and interleaving images are work in progress.
  • Text only inference is supported.
  • Only NVIDIA GPUs are supported.
  • Performance is acceptable but to be optimized! We aim at first release to be functionality correct. We will work on making it fast 🏎️

Please see the next steps for better supporting this model on vLLM.

cc @heheda12345 @ywang96

@simon-mo simon-mo added misc and removed misc labels Sep 25, 2024
@simon-mo simon-mo changed the title Llama3.2 Vision Guides and Issues Llama3.2 Vision Model: Guides and Issues Sep 25, 2024
@shermansiu
Copy link

What is the ETA for adding support for multiple images? For a project I'm working on, we're trying to see if it's viable to use vLLM for Llama 3.2 in the next few days.

@pseudotensor
Copy link

pseudotensor commented Sep 26, 2024

What is optimal way to use H100s? I'd guess 8 are not required.

I can get 4*H100's to start up like this:

docker pull vllm/vllm-openai:v0.6.2
docker stop llama31-70b ; docker remove llama31-70b
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=1,2,3,7"' \
    --shm-size=10.24gb \
    -p 5020:5020 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name llama31-70b \
     vllm/vllm-openai:v0.6.2 \
        --port=5020 \
        --host=0.0.0.0 \
        --model=meta-llama/Llama-3.2-90B-Vision-Instruct \
        --seed 1234 \
        --tensor-parallel-size=4 \
        --max-model-len=81920 \
        --max-num-batched-tokens=81920 --max-log-len=100 \
        --limit_mm_per_prompt 'image=1' \
        --enforce-eager --max-num-seqs 8 \
        --gpu-memory-utilization 0.99 \
        --served-model-name meta-llama/Llama-3.2-90B-Vision-Instruct meta-llama/Meta-Llama-3.1-70B-Instruct \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.llama31_70b.txt

But it fails when used like this. Just for a test TEXT only query, no images. Seems very short queries work, but around 10k input fails.

WARNING 09-25 18:52:22 preprocess.py:86] Falling back on <BOS> for decoder start token id because decoder start token id is not available.
INFO 09-25 18:52:22 engine.py:288] Added request cmpl-7f02d431762140e784491182a7121772-0.
ERROR 09-25 18:52:23 engine.py:157] AssertionError()
ERROR 09-25 18:52:23 engine.py:157] Traceback (most recent call last):
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 09-25 18:52:23 engine.py:157]     self.run_engine_loop()
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 09-25 18:52:23 engine.py:157]     request_outputs = self.engine_step()
ERROR 09-25 18:52:23 engine.py:157]                       ^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 09-25 18:52:23 engine.py:157]     raise e
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 09-25 18:52:23 engine.py:157]     return self.engine.step()
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1264, in step
ERROR 09-25 18:52:23 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 09-25 18:52:23 engine.py:157]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 78, in execute_model
ERROR 09-25 18:52:23 engine.py:157]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 09-25 18:52:23 engine.py:157]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model
ERROR 09-25 18:52:23 engine.py:157]     return self.driver_worker.execute_model(execute_model_req)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-25 18:52:23 engine.py:157]     output = self.model_runner.execute_model(
ERROR 09-25 18:52:23 engine.py:157]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-25 18:52:23 engine.py:157]     return func(*args, **kwargs)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 223, in execute_model
ERROR 09-25 18:52:23 engine.py:157]     output: SamplerOutput = self.model.sample(
ERROR 09-25 18:52:23 engine.py:157]                             ^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama.py", line 940, in sample
ERROR 09-25 18:52:23 engine.py:157]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 09-25 18:52:23 engine.py:157]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-25 18:52:23 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-25 18:52:23 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 274, in forward
ERROR 09-25 18:52:23 engine.py:157]     maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
ERROR 09-25 18:52:23 engine.py:157]                                                                  ^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 879, in _sample
ERROR 09-25 18:52:23 engine.py:157]     return _sample_with_torch(
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 819, in _sample_with_torch
ERROR 09-25 18:52:23 engine.py:157]     multinomial_samples[sampling_type] = _multinomial(
ERROR 09-25 18:52:23 engine.py:157]                                          ^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 627, in _multinomial
ERROR 09-25 18:52:23 engine.py:157]     assert seq_group.generator is not None
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157] AssertionError
(VllmWorkerProcess pid=209) INFO 09-25 18:52:23 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=211) INFO 09-25 18:52:23 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=210) INFO 09-25 18:52:23 multiproc_worker_utils.py:244] Worker exiting
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.12/asyncio/locks.py", line 212, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7ff530597740

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 250, in __call__
  |     async with anyio.create_task_group() as task_group:
  |                ^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 736, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 253, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 242, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 248, in completion_stream_generator
    |     async for prompt_idx, res in result_generator:
    |   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 490, in merge_async_iterators
    |     item = await d
    |            ^^^^^^^
    |   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 486, in _process_request
    |     raise request_output
    | AssertionError
    +------------------------------------
INFO 09-25 18:52:24 multiproc_worker_utils.py:124] Killing local vLLM worker processes
[rank0]:[W925 18:52:26.595501679 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
INFO:     172.16.0.87:35352 - "GET /health HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application

@tcapelle
Copy link

How can I run the vllm docker image against the meta provided checkpoint (not the huggingface one)?

@groccy
Copy link

groccy commented Sep 26, 2024

What is the ETA for adding support for multiple images? For a project I'm working on, we're trying to see if it's viable to use vLLM for Llama 3.2 in the next few days.

Same here. If we can know the ETA for adding support for multiple images and interleaving images, it will be truly appreciated. We also have a project that depends on this kind of SOTA VLMs.

Thank you so much for your hard work and great contributions!

@heheda12345
Copy link
Collaborator

How can I run the vllm docker image against the meta provided checkpoint (not the huggingface one)?

You can add the logic of loading meta provided checkpoint here. The main work will be changing some names of the parameter. And you still need huggingface's configs.

@tcapelle
Copy link

tcapelle commented Sep 26, 2024

I got a HF checkpoint thatnks to some EU friends, managed to serve the model half of today, but now it's dying

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb9eaf03f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb9eaeb2d10 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb9eafdef08 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fb9ec1fb3e6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fb9ec200600 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fb9ec2072ba in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb9ec2096fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fba399addf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fba3abc6609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fba3ad00353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb9eaf03f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb9eaeb2d10 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb9eafdef08 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fb9ec1fb3e6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fb9ec200600 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fb9ec2072ba in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb9ec2096fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fba399addf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fba3abc6609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fba3ad00353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb9eaf03f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7fb9ebe92a84 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7fba399addf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fba3abc6609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fba3ad00353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown

@sfc-gh-zhwang
Copy link
Contributor

What is the general memory requirement for the model. Seems that 90b model is not able to run on an 8x40GB A100 machine.

@varoudis
Copy link

What is the general memory requirement for the model. Seems that 90b model is not able to run on an 8x40GB A100 machine.

pif... imagine I tried to run it on 8 A30 :(

@DarkLight1337
Copy link
Member

If you run into OOM issues, please reduce max_model_len and/or max_num_seqs, as described in the example script.

@ywang96
Copy link
Member

ywang96 commented Sep 27, 2024

Thank you all for trying out Llama 3.2 vision model on vLLM!

As you may already know, multimodal Llama 3.2 is quite different from other LlaVA-style VLMs that we currently support on vLLM as it involves cross-attention layers in the language model, and below are the next steps and their priorities for features and optimizations related to this new architecture.

P0

  • Cross attention mask:
  • Address KV cache waste issue:
    • We need better memory management for cross attention/encoder-decoder models since the page size cannot be specified separately.
  • Optimize inference of mixed requests (text-only & text + image):
    • Since the model forward pass changes dynamically to the input (text only or text + image), we needed to add additional dummy image + mask for text-only sequences so that mixed requests can be batched
    • As a result, there's significant room for optimizing both compute & space efficiency.
    • This is less of an issue for offline inference (where typically requests are either all text-only or all text + image), but more for the online serving case.

P1

  • Attention backends other than xformers for cross attention
  • Enabling cuda graph

P2

@pseudotensor
Copy link

Any thoughts on my issue or should I post real separate issue?

ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 819, in _sample_with_torch
ERROR 09-25 18:52:23 engine.py:157]     multinomial_samples[sampling_type] = _multinomial(
ERROR 09-25 18:52:23 engine.py:157]                                          ^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sampler.py", line 627, in _multinomial
ERROR 09-25 18:52:23 engine.py:157]     assert seq_group.generator is not None
ERROR 09-25 18:52:23 engine.py:157]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-25 18:52:23 engine.py:157] AssertionError

#8826 (comment)

@njhill
Copy link
Member

njhill commented Sep 27, 2024

@pseudotensor fix is here #8870.

@AmericanPresidentJimmyCarter

It's a good first attempt but as it currently is it's basically unusable. Without caching the cross attention projections it is insanely slow, probably about 1/20th the speed it should be.

@heheda12345
Copy link
Collaborator

It's a good first attempt but as it currently is it's basically unusable. Without caching the cross attention projections it is insanely slow, probably about 1/20th the speed it should be.

Can you provide more information for us to reproduce your problem? e.g., your benchmark script & your data.

@AmericanPresidentJimmyCarter

I don't think any of the implementations currently have the cross attention projection caches? But for inference, it looks like the outputs of the cross attention kv projections for attending to the image can be cached after the first token and reused for all subsequent inference steps without being recalculated each time (the image doesn't change).

def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor],
cross_attention_states: Optional[torch.Tensor],
kv_cache: torch.Tensor,
attn_metadata: AttentionMetadata,
) -> torch.Tensor:
qkv_dec, _ = self.qkv_proj(hidden_states)
q, _, _ = qkv_dec.split(
[self.q_local_size, self.kv_local_size, self.kv_local_size],
dim=-1)
if cross_attention_states is None:
k = None
v = None
else:
qkv_enc, _ = self.qkv_proj(cross_attention_states)
_, k, v = qkv_enc.split(
[self.q_local_size, self.kv_local_size, self.kv_local_size],
dim=-1)
k = k.view(-1, self.num_local_key_value_heads, self.head_dim)
v = v.view(-1, self.num_local_key_value_heads, self.head_dim)
k = self.k_norm(k)
q = q.view(-1, self.num_local_heads, self.head_dim)
q = self.q_norm(q)
output = self.attn(q,
k,
v,
kv_cache,
attn_metadata,
attn_type=AttentionType.ENCODER_DECODER)
out, _ = self.o_proj(output)
return out

It's possible I am misunderstanding the code, but it looks like it repeats them each time even though the output should be the same?

@ywang96
Copy link
Member

ywang96 commented Sep 28, 2024

I don't think any of the implementations currently have the cross attention projection caches? But for inference, it looks like the outputs of the cross attention kv projections for attending to the image can be cached after the first token and reused for all subsequent inference steps without being recalculated each time (the image doesn't change).

def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor],
cross_attention_states: Optional[torch.Tensor],
kv_cache: torch.Tensor,
attn_metadata: AttentionMetadata,
) -> torch.Tensor:
qkv_dec, _ = self.qkv_proj(hidden_states)
q, _, _ = qkv_dec.split(
[self.q_local_size, self.kv_local_size, self.kv_local_size],
dim=-1)
if cross_attention_states is None:
k = None
v = None
else:
qkv_enc, _ = self.qkv_proj(cross_attention_states)
_, k, v = qkv_enc.split(
[self.q_local_size, self.kv_local_size, self.kv_local_size],
dim=-1)
k = k.view(-1, self.num_local_key_value_heads, self.head_dim)
v = v.view(-1, self.num_local_key_value_heads, self.head_dim)
k = self.k_norm(k)
q = q.view(-1, self.num_local_heads, self.head_dim)
q = self.q_norm(q)
output = self.attn(q,
k,
v,
kv_cache,
attn_metadata,
attn_type=AttentionType.ENCODER_DECODER)
out, _ = self.o_proj(output)
return out

It's possible I am misunderstanding the code, but it looks like it repeats them each time even though the output should be the same?

I suggest you take a look at the actual attention module implementation here to understand how attention with KV cache works.

class Attention(nn.Module):
"""Attention layer.
This class takes query, key, and value tensors as input. The input tensors
can either contain prompt tokens or generation tokens.
The class does the following:
1. Store the input key and value tensors in the KV cache.
2. Perform (multi-head/multi-query/grouped-query) attention.
3. Return the output tensor.
"""
def __init__(

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Sep 28, 2024

OK -- I misunderstood. I think my previous speed issues might have been a skill issue on my part and I dug through the other issues to try to figure out which command line arguments I was missing. Running

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.95 --model=meta-llama/Llama-3.2-11B-Vision-Instruct --tokenizer=meta-llama/Llama-3.2-11B-Vision-Instruct --download-dir="/home/user/storage/hf_cache" --dtype=bfloat16 --device=cuda --host=127.0.0.1 --port=30000 --max-model-len=8192 --quantization="fp8" --enforce_eager --max_num_seqs=8

On 3x 3090s is giving me:

Average caption generation time: 40.10s over 166 captions
Recent captions per second: 1.30
Average caption generation time: 43.22s over 235 captions
Recent captions per second: 1.15
Average caption generation time: 44.14s over 308 captions
Recent captions per second: 1.22
Average caption generation time: 45.89s over 368 captions
Recent captions per second: 1.00

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Sep 28, 2024

Found another weird thing with vllm and LLaMA 3.2 11b, or possibly another failure of myself to read the docs. If I have --enable_chunked-prefill=false on, the model will batch 4-5 requests as running at the same time. If I don't supply this argument, it will only process 1 request at a time and be significantly slower. With that argument I go from about 2.4 sec/caption to 2.1 sec/caption on a 3090.

@Searcherr
Copy link

Found another weird thing with vllm and LLaMA 3.2 11b, or possibly another failure of myself to read the docs. If I have --enable_chunked-prefill=false on, the model will batch 4-5 requests as running at the same time. If I don't supply this argument, it will only process 1 request at a time and be significantly slower. With that argument I go from about 2.4 sec/caption to 2.1 sec/caption on a 3090.

May I ask you to share vLLM argument for the model deployment that you used?

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Sep 28, 2024

It's above.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.95 --model=meta-llama/Llama-3.2-11B-Vision-Instruct --tokenizer=meta-llama/Llama-3.2-11B-Vision-Instruct --download-dir="/home/user/storage/hf_cache" --dtype=bfloat16 --device=cuda --host=127.0.0.1 --port=30000 --max-model-len=8192 --quantization="fp8" --enforce_eager --max_num_seqs=8 --enable_chunked-prefill=false

@AmericanPresidentJimmyCarter

I'm still having other issues too. vllm will hang randomly sometimes without errors and stop responding to requests after a few hours. :(

@Searcherr
Copy link

It's above.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.95 --model=meta-llama/Llama-3.2-11B-Vision-Instruct --tokenizer=meta-llama/Llama-3.2-11B-Vision-Instruct --download-dir="/home/user/storage/hf_cache" --dtype=bfloat16 --device=cuda --host=127.0.0.1 --port=30000 --max-model-len=8192 --quantization="fp8" --enforce_eager --max_num_seqs=8 --enable_chunked-prefill=false

Perfectly worked for me. Thank you so much!
Going to tests on Monday. Or tomorrow if the weather become terrible :)))

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Sep 28, 2024

Perfectly worked for me. Thank you so much! Going to tests on Monday. Or tomorrow if the weather become terrible :)))

Unfortunately performance for me degrades over time and I don't know why. After about an hour of giving it continuous requests throughput falls to about half the original rate. It's very reproducible at least.

edit: After going through the logs, I wonder if it's another issue with the HTTP server.

INFO 09-28 17:21:02 metrics.py:351] Avg prompt throughput: 79.7 tokens/s, Avg generation throughput: 120.6 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 64.9%, CPU KV cache usage: 0.0%.

On the slow gpu, it stops receiving new requests ahead of time and Pending: remains at zero always. Two other gpus on the same system have pending stuck full as I'm constantly shipping new requests to them. So why one server gets stuck and doesn't seem to queue the new requests, I am not sure.

@AmericanPresidentJimmyCarter

Ok, I give up. The one server (which is running on localhost) stops processing requests randomly. It results in long stretches of idle time where no work is being done, rendering the openai server unusable.

@tcapelle
Copy link

Just to report that I have been serving 90b from HF checkpoint without issues (besides when trying to use instructor on top that killed the server).
It's not super fast (~35tok/s) but can handle multiple requests in parallel.
Hardware 8xh100.
I have served more than 10M tokens already ;)

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Sep 29, 2024

I tried offline inference mode with batch size 8 on a 3090, again it gets hung randomly and stops processing batches.

On ctrl+c it shows it is stuck in return len(self._output_token_ids) + len(self._prompt_token_ids).

Traceback (most recent call last):                                              
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()                                                                  
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run       
    self._target(*self._args, **self._kwargs)                                   
  File "/home/user/Programs/vllm/caption_vllm.py", line 147, in worker    
    outputs = llm.generate(inputs, sampling_params=sampling_params)             
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/utils.py", line 1047, in inner             
    return fn(*args, **kwargs)                                                  
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 388, in generate                                               
    outputs = self._run_engine(use_tqdm=use_tqdm)                               
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 877, in _run_engine
    step_outputs = self.llm_engine.step()                                   
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1217, in step                                                
    ) = self.scheduler[virtual_engine].schedule()
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1189, in schedule 
    scheduler_outputs = self._schedule()                                        
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1154, in _schedule                                              
    return self._schedule_default()                                             
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 989, in _schedule_default                          
    prefills = self._schedule_prefills(budget,                                  
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/scheduler.py", line 903, in _schedule_prefills                                      
    can_allocate = self.block_manager.can_allocate(seq_group)                   
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/block_manager_v1.py", line 290, in can_allocate      
    self_num_required_blocks = self._get_seq_num_required_blocks(               
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/core/block_manager_v1.py", line 282, in _get_seq_num_required_blocks
    return 0 if seq is None else seq.n_blocks                                   
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/sequence.py", line 452, in n_blocks                                                      
    return (self.get_len() + self.block_size - 1) // self.block_size            
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/sequence.py", line 557, in get_len                                                       
    return self.data.get_len()                                                  
  File "/home/user/Programs/vllm/env/lib/python3.10/site-packages/vllm/sequence.py", line 273, in get_len          
    return len(self._output_token_ids) + len(self._prompt_token_ids)            
KeyboardInterrupt  

vllm was instantiated like:

    model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
    llm = LLM(
        model=model_name,
        max_model_len=8192,  # Adjust as needed
        max_num_seqs=8,      # Adjust as needed
        enforce_eager=True,
        quantization="fp8",
        gpu_memory_utilization=0.98,
    )

It was also much slower (~7s/caption) versus the openai server.

@samos123
Copy link
Contributor

samos123 commented Sep 29, 2024

I am getting the following error:

ERROR 09-28 19:27:59 async_llm_engine.py:61] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Exception in callback functools.partial(<function _log_task_completion at 0x7c49e15f6e80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7c49dda27aa0>>)
h

Full logs: https://gist.github.com/samos123/43db2724d7ac1bba44a379ce300b156b

I reduced memory requirements and afterwards got a new error: https://gist.github.com/samos123/ee858936496e1d314785719f9287230a

Update: I only get the issue when I had --kv-cache-dtype=fp8. After I removed that flag it started working.

@ywang96
Copy link
Member

ywang96 commented Sep 29, 2024

I am getting the following error:

ERROR 09-28 19:27:59 async_llm_engine.py:61] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Exception in callback functools.partial(<function _log_task_completion at 0x7c49e15f6e80>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7c49dda27aa0>>)
h

Full logs: https://gist.github.com/samos123/43db2724d7ac1bba44a379ce300b156b

I reduced memory requirements and afterwards got a new error: https://gist.github.com/samos123/ee858936496e1d314785719f9287230a

Update: I only get the issue when I had --kv-cache-dtype=fp8. After I removed that flag it started working.

@samos123 Yea I'm almost sure fp8 for kv cache isn't working with cross attention yet.

FWIW, the command below works for me on 1xH100

 vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 32

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Sep 29, 2024

I did get it to run stably overnight at 5 seconds/caption on 3090s at --gpu-memory-utilization 0.9 using the openai server.

The problem I'm seeing is that performance degrades very quickly because vllm stops doing batches together. I'm not sure how that is determined (based on kv-cache occupancy?). So it will start out going approximately double the speed then degrade in about 20-30 minutes.

At startup:

INFO 09-29 16:03:38 metrics.py:351] Avg prompt throughput: 45.9 tokens/s, Avg generation throughput: 166.1 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 11 reqs, GPU KV cache usage: 94.2%, CPU KV cache usage: 0.0%.

After some time:

INFO 09-29 16:06:33 metrics.py:351] Avg prompt throughput: 59.0 tokens/s, Avg generation throughput: 44.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 15 reqs, GPU KV cache usage: 94.8%, CPU KV cache usage: 0.0%.

Setting --gpu-memory-utilization higher seems to make vllm unstable. Turning off most of the logs seems to have helped maintain the Pending: requests at a high amount so they don't fall to zero anymore, but the performance degradation issue over time persists.

edit: I tried the same run command with L40s and it reliably does batch sizes of 8 at around 0.9 seconds/caption. So it seems to be an issue with lower VRAM cards.

The FastAPI does indeed have another problem, if you send too many requests at once they all seem to timeout. I was only able to get the L40s to uses a consistent batch size of 8 by carefully trickling in requests, which seems less than ideal. Maybe the FastAPI server needs more threads/processes to handle incoming requests efficiently.

@heheda12345
Copy link
Collaborator

The problem I'm seeing is that performance degrades very quickly because vllm stops doing batches together. I'm not sure how that is determined (based on kv-cache occupancy?). So it will start out going approximately double the speed then degrade in about 20-30 minutes.

@AmericanPresidentJimmyCarter Yes, the scheduling is based on kv-cache occupancy. GPU KV cache usage: 94.8% so I guess that the reason it does not batch more request is that the kv cache does not have enough space.

@AmericanPresidentJimmyCarter

That makes sense. It seems like a lot of the kv cache may stick around too long to be useful, as when you first start up the server it runs almost twice as fast (and does 3-4x simultaneous requests instead of 1). It would be nice if there was a way to occasionally purge the kv cache like an endpoint I could poll or an argument.

@AmericanPresidentJimmyCarter
Copy link

AmericanPresidentJimmyCarter commented Sep 30, 2024

For anyone perplexed about this: you might need to tweak --max_num_seqs. For 3090/4090 I need to set it at 4. After countless hours this weekend it seems that this variable controls the number of simultaneous batches and also the quantity of the kv cache filled with each item added to the async engine for generate. Maybe a VLLM dev can chime in on exactly what this is and why it's critical. From reading the source code it seems like --max_num_seqs might be mean "maximum async engine batch size to run", but why that has any impact on kv cache I don't understand. The kv cache was blowing up before while the model was processing exactly the same batch sizes at --max_num_seqs 8 (even with it set this high, VLLM would only run a maximum of 4-5 items per batch anyway).

edit: Go here: #2492

@tcapelle
Copy link

Just FYI, on a 8xH100 node I am getting:

INFO 09-30 01:04:27 metrics.py:351] Avg prompt throughput: 209.3 tokens/s, Avg generation throughput│
: 44.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV

@destino92
Copy link

How can i deploy this to aws it is for an experiment and i am really new to AI ops, can anyone assist me please?

@varoudis
Copy link

varoudis commented Oct 2, 2024

Just FYI, on a 8xH100 node I am getting:

INFO 09-30 01:04:27 metrics.py:351] Avg prompt throughput: 209.3 tokens/s, Avg generation throughput│
: 44.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV

whats the vram usage?

@tcapelle
Copy link

tcapelle commented Oct 2, 2024

~60-70GB

@joe-schwartz-certara
Copy link

joe-schwartz-certara commented Oct 4, 2024

https://docs.vllm.ai/en/stable/models/vlm.html

Am I correct the implication of the above guide is that I can do online inference ONLY with the /v1/chat/completions endpoint and ONLY with a url for the image? Is there another way to supply an image to the model server besides a web url?

@joe-schwartz-certara
Copy link

Btw I am having no problems serving https://huggingface.co/neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic on 2 A100s. Text inference and text+single image inference both work great but is very slow. Looking forward to the optimized implementation :)

@AmericanPresidentJimmyCarter

https://docs.vllm.ai/en/stable/models/vlm.html

Am I correct the implication of the above guide is that I can do online inference ONLY with the /v1/chat/completions endpoint and ONLY with a url for the image? Is there another way to supply an image to the model server besides a web url?

base64

@Searcherr
Copy link

Searcherr commented Oct 5, 2024 via email

@javed828
Copy link

javed828 commented Oct 5, 2024

i m using

CUDA_VISIBLE_DEVICES=0,1 python3 -m vllm.entrypoints.openai.api_server
--gpu-memory-utilization 0.95 --model=meta-llama/Meta-Llama-3.1-8B-Instruct
--tokenizer=meta-llama/Meta-Llama-3.1-8B-Instruct
--download-dir="/home/user/storage/hf_cache" --dtype=float16 --device=cuda
--host=0.0.0.0 --port=30000 --max-model-len=8192
--enforce_eager --max_num_seqs=8
--enable_chunked-prefill=false --tensor_parallel_size=2

i m using 2 nvidia GPU when sending 1 request woking perfect. but when sending parellel 2 request then only 1 is processing and other is blocking till first process not complete. How can i process parellel request. Can anyone help on it. i m new in it

@heheda12345
Copy link
Collaborator

What is the ETA for adding support for multiple images? For a project I'm working on, we're trying to see if it's viable to use vLLM for Llama 3.2 in the next few days.

We are working in this pr. #9095

@tcapelle
Copy link

tcapelle commented Oct 7, 2024

We are working in this pr. #9095

I don't think that this is the issue, @javed828 is asking about parallel requests.

Are you sure you are sending them in parallel? I had no issues running up to 10 parallel requests (I was on 8xH100).

@shermansiu
Copy link

I don't think that was in reply to javed828's comment: that was in reply to the comment I made about 2 weeks ago at the beginning of the thread.

@javed828
Copy link

javed828 commented Oct 7, 2024

@tcapelle yes i m hitting parellel and i m using nvidia T4

@Archmilio
Copy link

Archmilio commented Oct 8, 2024

The GPU blocks count is set to be very small compared to the VRAM size. Does anyone have the same issue?
Block Size is 16
11B, 1 GPU (80GB * 1ea)
INFO 10-08 04:16:45 gpu_executor.py:122] # GPU blocks: 1264, # CPU blocks: 1638
90B, 4GPU (80GB * 4ea)
INFO 10-08 04:17:17 distributed_gpu_executor.py:57] # GPU blocks: 2327, # CPU blocks: 2621

@sjuxax
Copy link

sjuxax commented Oct 13, 2024

docs.vllm.ai/en/stable/models/vlm.html

Am I correct the implication of the above guide is that I can do online inference ONLY with the /v1/chat/completions endpoint and ONLY with a url for the image? Is there another way to supply an image to the model server besides a web url?

It's pretty easy to hack in support for file:// URIs. I think it makes sense for vLLM not to upstream it for security purposes/related shenanigans, but I threw together a simple patch that worked well in my limited testing. I'll upload it in a sec.

Edit: commit here

@keyword1983
Copy link

keyword1983 commented Oct 21, 2024

u can try to reduce your max_seq_num for setting parms . u can discover GPU blocks can increase when u decrease max seq num. i think a lot of memory be preserved for mm_tokens in each seq.

The GPU blocks count is set to be very small compared to the VRAM size. Does anyone have the same issue? Block Size is 16 11B, 1 GPU (80GB * 1ea) INFO 10-08 04:16:45 gpu_executor.py:122] # GPU blocks: 1264, # CPU blocks: 1638 90B, 4GPU (80GB * 4ea) INFO 10-08 04:17:17 distributed_gpu_executor.py:57] # GPU blocks: 2327, # CPU blocks: 2621

@Cognitus-Stuti
Copy link

How can we optimize llama-3.2-11b to run on 4 T4 GPUs,
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --max-model-len 6000 --enforce-eager --max-num-seqs 5 --port 8000 --dtype half --tensor-parallel-size 4 --gpu-memory-utilization 0.97

Avg tokens per second is 2

@Cognitus-Stuti
Copy link

@tcapelle yes i m hitting parellel and i m using nvidia T4

@javed828 could you please share your serve command here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests