diff --git a/OpenGVLab/InternVL3.md b/OpenGVLab/InternVL3.md new file mode 100644 index 0000000..aa8f081 --- /dev/null +++ b/OpenGVLab/InternVL3.md @@ -0,0 +1,175 @@ +# InternVL3 Usage Guide + +This guide describes how to run InternVL3 series on NVIDIA GPUs. + +[InternVL3](https://huggingface.co/collections/OpenGVLab/internvl3-67f7f690be79c2fe9d74fe9d) is a powerful multimodal model that combines vision and language understanding capabilities. This recipe provides step-by-step instructions for running InternVL3 using vLLM, optimized for various hardware configurations. + +## Deployment Steps + +### Installing vLLM + +```bash +uv venv +source .venv/bin/activate +uv pip install -U vllm --torch-backend auto +``` + +### Weights +[OpenGVLab/InternVL3-8B-hf](https://huggingface.co/OpenGVLab/InternVL3-8B-hf) + +### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards) + +Launch the online inference server using TP=2: +```bash +export CUDA_VISIBLE_DEVICES=0,1 +vllm serve OpenGVLab/InternVL3-8B-hf \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 2 \ + --data-parallel-size 1 +``` + +## Configs and Parameters + +* You can set `limit-mm-per-prompt` to limit how many multimodal data instances to allow for each prompt. This is useful if you want to control the incoming traffic of multimodal requests. E.g., `--limit-mm-per-prompt '{"image":2, "video":0}'` + +* You can set `--tensor-parallel-size` and `--data-parallel-size` to adjust the parallel strategy. + +## Validation & Expected Behavior + +### Basic Test +Open another terminal, and use the following commands: +```bash +# need to start vLLM service first +curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>", + "max_tokens": 100, + "temperature": 0.7 + }' +``` + +The result would be like this: +```json +{ +"id": "cmpl-1ed0df81b56448afa597215a8725c686", +"object": "text_completion", +"created": 1755739470, +"model": "OpenGVLab/InternVL3-8B-hf", +"choices": + [{ + "index":0, + "text":" The capital of France is Paris.", + "logprobs":null, + "finish_reason":"stop", + "stop_reason":null, + "prompt_logprobs":null + }], +"service_tier":null, +"system_fingerprint":null, +"usage": + { + "prompt_tokens":35, + "total_tokens":43, + "completion_tokens":8, + "prompt_tokens_details":null + }, +"kv_transfer_params":null} +``` + +### Benchmarking Performance + +#### InternVL3-8B-hf on Multimodal Random Dataset + +Take InternVL3-8B-hf as an example, using the random multimodal dataset mentioned in [this vLLM PR](https://github.com/vllm-project/vllm/pull/23119): + +```bash +# need to start vLLM service first +vllm bench serve \ + --host 0.0.0.0 \ + --port 8000 \ + --model OpenGVLab/InternVL3-8B-hf \ + --dataset-name random-mm \ + --num-prompts 100 \ + --max-concurrency 10 \ + --random-prefix-len 25 \ + --random-input-len 300 \ + --random-output-len 40 \ + --random-range-ratio 0.2 \ + --random-mm-base-items-per-request 0 \ + --random-mm-num-mm-items-range-ratio 0 \ + --random-mm-limit-mm-per-prompt '{"image":3,"video":0}' \ + --random-mm-bucket-config '{(256, 256, 1): 0.25, (720, 1280, 1): 0.75}' \ + --request-rate inf \ + --ignore-eos \ + --endpoint-type openai-chat \ + --endpoint "/v1/chat/completions" \ + --seed 42 +``` +If it works successfully, you will see the following output. + +``` +============ Serving Benchmark Result ============ +Successful requests: 100 +Maximum request concurrency: 10 +Benchmark duration (s): 24.54 +Total input tokens: 32805 +Total generated tokens: 3982 +Request throughput (req/s): 4.07 +Output token throughput (tok/s): 162.25 +Total Token throughput (tok/s): 1498.91 +---------------Time to First Token---------------- +Mean TTFT (ms): 198.18 +Median TTFT (ms): 158.99 +P99 TTFT (ms): 524.05 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 55.56 +Median TPOT (ms): 56.04 +P99 TPOT (ms): 60.32 +---------------Inter-token Latency---------------- +Mean ITL (ms): 54.22 +Median ITL (ms): 47.02 +P99 ITL (ms): 116.90 +================================================== +``` + +#### InternVL3-8B-hf on VisionArena-Chat Dataset + +```bash +# need to start vLLM service first +vllm bench serve \ + --host 0.0.0.0 \ + --port 8000 \ + --endpoint /v1/chat/completions \ + --endpoint-type openai-chat \ + --model OpenGVLab/InternVL3-8B-hf \ + --dataset-name hf \ + --dataset-path lmarena-ai/VisionArena-Chat \ + --num-prompts 1000 +``` +If it works successfully, you will see the following output. + +``` +============ Serving Benchmark Result ============ +Successful requests: 1000 +Benchmark duration (s): 597.45 +Total input tokens: 109173 +Total generated tokens: 109352 +Request throughput (req/s): 1.67 +Output token throughput (tok/s): 183.03 +Total Token throughput (tok/s): 365.76 +---------------Time to First Token---------------- +Mean TTFT (ms): 280208.05 +Median TTFT (ms): 270322.52 +P99 TTFT (ms): 582602.60 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 519.16 +Median TPOT (ms): 539.03 +P99 TPOT (ms): 596.74 +---------------Inter-token Latency---------------- +Mean ITL (ms): 593.88 +Median ITL (ms): 530.72 +P99 ITL (ms): 4129.92 +================================================== +``` diff --git a/README.md b/README.md index 5b8175b..57a48a1 100644 --- a/README.md +++ b/README.md @@ -23,6 +23,9 @@ This repo intends to host community maintained common recipes to run vLLM answer ### OpenAI OpenAI - [gpt-oss](OpenAI/GPT-OSS.md) +### OpenGVLab +- [InternVL3](OpenGVLab/InternVL3.md) + ### Qwen Qwen - [Qwen2.5-VL](Qwen/Qwen2.5-VL.md) - [Qwen3-Coder-480B-A35B](Qwen/Qwen3-Coder-480B-A35B.md)