@@ -42,11 +42,11 @@ Install `mlperf-inf-mm-vl2l` and the development tools with:
4242
4343- On Bash
4444``` bash
45- pip install multimodal/vl2l/[dev]
45+ pip install -e multimodal/vl2l/[dev]
4646```
4747- On Zsh
4848``` zsh
49- pip install multimodal/vl2l/" [dev]"
49+ pip install -e multimodal/vl2l/" [dev]"
5050```
5151
5252### Post VL2L benchmarking CLI installation
@@ -63,7 +63,8 @@ You can enable shell autocompletion for `mlperf-inf-mm-vl2l` with:
6363mlperf-inf-mm-vl2l --install-completion
6464```
6565
66- > NOTE: Shell auto-completion will take effect once you restart the terminal.
66+ > [ !NOTE]
67+ > Shell auto-completion will take effect once you restart the terminal.
6768
6869### Start an inference endpoint on your local host machine with vLLM
6970
@@ -108,6 +109,12 @@ Accuracy only mode:
108109mlperf-inf-mm-vl2l benchmark endpoint --settings.test.scenario server --settings.test.mode accuracy_only
109110```
110111
112+ ### Evalute the response quality
113+
114+ ``` bash
115+ mlperf-inf-mm-vl2l evaluate --filename output/mlperf_log_accuracy.json
116+ ```
117+
111118## Docker
112119
113120[ docker/] ( docker/ ) provides examples of Dockerfiles that install the VL2L benchmarking
@@ -117,6 +124,30 @@ for example, in a situation where you must use a GPU cluster managed by
117124[ Slurm] ( https://slurm.schedmd.com/ ) with [ enroot] ( https://github.com/nvidia/enroot ) and
118125[ pyxis] ( https://github.com/NVIDIA/pyxis ) .
119126
127+ As an illustrative example, assuming that you are at the root directory of the MLPerf
128+ Inference repo:
129+
130+ 1 . You can build a container image against the vLLM's
131+ ` vllm/vllm-openai:v0.12.0 ` release by
132+
133+ ``` bash
134+ docker build \
135+ --build-arg BASE_IMAGE_URL=vllm/vllm-openai:v0.12.0 \
136+ --build-arg MLPERF_INF_MM_VL2L_INSTALL_URL=multimodal/vl2l \
137+ -f multimodal/vl2l/docker/vllm-cuda.Dockerfile \
138+ -t mlperf-inf-mm-vl2l:vllm-openai-v0.12.0 \
139+ .
140+ ```
141+ > [ !NOTE]
142+ > ` MLPERF_INF_MM_VL2L_INSTALL_URL ` can also take in a remote GitHub location, such as
143+ > ` git+https://github.com/mlcommons/inference.git#subdirectory=multimodal/vl2l/ ` .
144+
145+ 2 . Afterwards, you can start the container in the interactive mode by
146+
147+ ``` bash
148+ docker run --rm -it --gpus all -v ~ /.cache:/root/.cache --ipc=host mlperf-inf-mm-vl2l:vllm-openai-v0.12.0
149+ ```
150+
120151### Benchmark against vLLM inside the container
121152
122153If you are running ` mlperf-inf-mm-vl2l ` inside a local environment that has access to
@@ -128,16 +159,27 @@ vLLM (such as inside a container that was created using the
1281592 . Wait for the endpoint to be healthy.
1291603 . Run the benchmark against that endpoint.
130161
131- For example, inside the container, you can run the Offline scenario Performance only
162+ For example, inside the container, you can run the Offline scenario Accuracy only
132163mode with:
133164
134165``` bash
135166mlperf-inf-mm-vl2l benchmark vllm \
136- --vllm.model.repo_id Qwen/Qwen3-VL-235B-A22B-Instruct \
137- --vllm.arg=--tensor-parallel-size=8 \
138- --vllm.arg=--limit-mm-per-prompt.video=0 \
139167 --settings.test.scenario offline \
140- --settings.test.mode performance_only
168+ --settings.test.mode accuracy_only \
169+ --dataset.token ... \
170+ --vllm.cli=--async-scheduling \
171+ --vllm.cli=--max-model-len=32768 \
172+ --vllm.cli=--max-num-seqs=1024 \
173+ --vllm.cli=--compilation-config=' {
174+ "cudagraph_capture_sizes": [
175+ 1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128,
176+ 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248,
177+ 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480,
178+ 496, 512, 1024, 1536, 2048, 3072, 4096, 6144, 8192, 12288, 16384, 24576, 32768
179+ ]
180+ }' \
181+ --vllm.cli=--limit-mm-per-prompt.video=0 \
182+ --vllm.cli=--tensor-parallel-size=8
141183```
142184
143185## Developer Guide
0 commit comments