|  | 
|  | 1 | +# InternVL3 Usage Guide | 
|  | 2 | + | 
|  | 3 | +This guide describes how to run InternVL3 series on NVIDIA GPUs. | 
|  | 4 | + | 
|  | 5 | +[InternVL3](https://huggingface.co/collections/OpenGVLab/internvl3-67f7f690be79c2fe9d74fe9d) is a powerful multimodal model that combines vision and language understanding capabilities. This recipe provides step-by-step instructions for running InternVL3 using vLLM, optimized for various hardware configurations. | 
|  | 6 | + | 
|  | 7 | +## Deployment Steps | 
|  | 8 | + | 
|  | 9 | +### Installing vLLM | 
|  | 10 | + | 
|  | 11 | +```bash | 
|  | 12 | +uv venv | 
|  | 13 | +source .venv/bin/activate | 
|  | 14 | +uv pip install -U vllm --torch-backend auto | 
|  | 15 | +``` | 
|  | 16 | + | 
|  | 17 | +### Weights | 
|  | 18 | +[OpenGVLab/InternVL3-8B-hf](https://huggingface.co/OpenGVLab/InternVL3-8B-hf) | 
|  | 19 | + | 
|  | 20 | +### Running InternVL3-8B-hf model on A100-SXM4-40GB GPUs (2 cards) in eager mode | 
|  | 21 | + | 
|  | 22 | +Launch the online inference server using TP=2: | 
|  | 23 | +```bash | 
|  | 24 | +export CUDA_VISIBLE_DEVICES=0,1 | 
|  | 25 | +vllm serve OpenGVLab/InternVL3-8B-hf --enforce-eager \ | 
|  | 26 | +  --host 0.0.0.0 \ | 
|  | 27 | +  --port 8000 \ | 
|  | 28 | +  --tensor-parallel-size 2 \ | 
|  | 29 | +  --data-parallel-size 1 | 
|  | 30 | +``` | 
|  | 31 | + | 
|  | 32 | +## Configs and Parameters | 
|  | 33 | + | 
|  | 34 | +`--enforce-eager` disables the CUDA Graph in PyTorch; otherwise, it will throw error `torch._dynamo.exc.Unsupported: Data-dependent branching` during testing. For more information about CUDA Graph, please check [Accelerating-pytorch-with-cuda-graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) | 
|  | 35 | + | 
|  | 36 | +`--tensor-parallel-size` sets Tensor Parallel (TP). | 
|  | 37 | + | 
|  | 38 | +`--data-parallel-size` sets Data-parallel (DP). | 
|  | 39 | + | 
|  | 40 | + | 
|  | 41 | + | 
|  | 42 | +## Validation & Expected Behavior | 
|  | 43 | + | 
|  | 44 | +### Basic Test | 
|  | 45 | +Open another terminal, and use the following commands: | 
|  | 46 | +```bash | 
|  | 47 | +# need to start vLLM service first | 
|  | 48 | +curl http://localhost:8000/v1/completions \ | 
|  | 49 | +  -H "Content-Type: application/json" \ | 
|  | 50 | +  -d '{ | 
|  | 51 | +    "prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>", | 
|  | 52 | +    "max_tokens": 100, | 
|  | 53 | +    "temperature": 0.7 | 
|  | 54 | +  }' | 
|  | 55 | +``` | 
|  | 56 | + | 
|  | 57 | +The result would be like this: | 
|  | 58 | +```json | 
|  | 59 | +{ | 
|  | 60 | +"id": "cmpl-1ed0df81b56448afa597215a8725c686", | 
|  | 61 | +"object": "text_completion", | 
|  | 62 | +"created": 1755739470, | 
|  | 63 | +"model": "OpenGVLab/InternVL3-8B-hf", | 
|  | 64 | +"choices": | 
|  | 65 | +  [{ | 
|  | 66 | +  "index":0, | 
|  | 67 | +  "text":" The capital of France is Paris.", | 
|  | 68 | +  "logprobs":null, | 
|  | 69 | +  "finish_reason":"stop", | 
|  | 70 | +  "stop_reason":null, | 
|  | 71 | +  "prompt_logprobs":null | 
|  | 72 | +  }], | 
|  | 73 | +"service_tier":null, | 
|  | 74 | +"system_fingerprint":null, | 
|  | 75 | +"usage": | 
|  | 76 | +  { | 
|  | 77 | +  "prompt_tokens":35, | 
|  | 78 | +  "total_tokens":43, | 
|  | 79 | +  "completion_tokens":8, | 
|  | 80 | +  "prompt_tokens_details":null | 
|  | 81 | +  }, | 
|  | 82 | +"kv_transfer_params":null} | 
|  | 83 | +``` | 
|  | 84 | + | 
|  | 85 | +### Benchmarking Performance | 
|  | 86 | + | 
|  | 87 | +Take InternVL3-8B-hf as an example: | 
|  | 88 | + | 
|  | 89 | +```bash | 
|  | 90 | +# need to start vLLM service first | 
|  | 91 | +vllm bench serve \ | 
|  | 92 | +  --host 0.0.0.0 \ | 
|  | 93 | +  --port 8000 \ | 
|  | 94 | +  --model OpenGVLab/InternVL3-8B-hf \ | 
|  | 95 | +  --dataset-name random \ | 
|  | 96 | +  --random-input 2048 \ | 
|  | 97 | +  --random-output 1024 \ | 
|  | 98 | +  --max-concurrency 10 \ | 
|  | 99 | +  --num-prompts 50 \ | 
|  | 100 | +  --ignore-eos | 
|  | 101 | +``` | 
|  | 102 | +If it works successfully, you will see the following output. | 
|  | 103 | + | 
|  | 104 | +``` | 
|  | 105 | +============ Serving Benchmark Result ============ | 
|  | 106 | +Successful requests:                     497 | 
|  | 107 | +Benchmark duration (s):                  229.42 | 
|  | 108 | +Total input tokens:                      507680 | 
|  | 109 | +Total generated tokens:                  62259 | 
|  | 110 | +Request throughput (req/s):              2.17 | 
|  | 111 | +Output token throughput (tok/s):         271.37 | 
|  | 112 | +Total Token throughput (tok/s):          2484.22 | 
|  | 113 | +---------------Time to First Token---------------- | 
|  | 114 | +Mean TTFT (ms):                          102429.40 | 
|  | 115 | +Median TTFT (ms):                        99644.38 | 
|  | 116 | +P99 TTFT (ms):                           213820.81 | 
|  | 117 | +-----Time per Output Token (excl. 1st token)------ | 
|  | 118 | +Mean TPOT (ms):                          664.26 | 
|  | 119 | +Median TPOT (ms):                        776.39 | 
|  | 120 | +P99 TPOT (ms):                           848.52 | 
|  | 121 | +---------------Inter-token Latency---------------- | 
|  | 122 | +Mean ITL (ms):                           661.73 | 
|  | 123 | +Median ITL (ms):                         844.15 | 
|  | 124 | +P99 ITL (ms):                            856.42 | 
|  | 125 | +================================================== | 
|  | 126 | +``` | 
0 commit comments