-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Doc] Reorganize user guide #18661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vllm-bot
merged 9 commits into
vllm-project:main
from
DarkLight1337:reorganize-user-guide
May 24, 2025
Merged
[Doc] Reorganize user guide #18661
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
106fdab
[Doc] Reorganize user guide
DarkLight1337 ec0e0a3
Rename
DarkLight1337 fdc55a9
Merge branch 'main' into reorganize-user-guide
DarkLight1337 07cd0f7
Adjust
DarkLight1337 47ec947
Update
DarkLight1337 cd3135b
Rename
DarkLight1337 85d184c
Rename
DarkLight1337 630d844
Update
DarkLight1337 6a7cd61
Move more pages
DarkLight1337 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,3 @@ | ||
| # Contributing to vLLM | ||
|
|
||
| You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing/overview.html). | ||
| You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,29 +5,35 @@ nav: | |
| - getting_started/quickstart.md | ||
| - getting_started/installation | ||
| - Examples: | ||
| - LMCache: getting_started/examples/lmcache | ||
| - getting_started/examples/offline_inference | ||
| - getting_started/examples/online_serving | ||
| - getting_started/examples/other | ||
| - Offline Inference: getting_started/examples/offline_inference | ||
| - Online Serving: getting_started/examples/online_serving | ||
| - Others: | ||
| - LMCache: getting_started/examples/lmcache | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe not something for this PR, but we probably should just move the LM cache example into the other dir There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah agreed |
||
| - getting_started/examples/other/* | ||
| - Quick Links: | ||
| - User Guide: serving/offline_inference.md | ||
| - Developer Guide: contributing/overview.md | ||
| - User Guide: usage/README.md | ||
| - Developer Guide: contributing/README.md | ||
| - API Reference: api/README.md | ||
| - Timeline: | ||
| - Roadmap: https://roadmap.vllm.ai | ||
| - Releases: https://github.com/vllm-project/vllm/releases | ||
| - User Guide: | ||
| - usage/README.md | ||
| - General: | ||
| - usage/* | ||
| - Inference and Serving: | ||
| - serving/offline_inference.md | ||
| - serving/openai_compatible_server.md | ||
| - serving/* | ||
| - serving/integrations | ||
| - Training: training | ||
| - Deployment: | ||
| - deployment/* | ||
| - deployment/frameworks | ||
| - deployment/integrations | ||
| - Performance: performance | ||
| - Training: training | ||
| - Configuration: | ||
| - Summary: configuration/README.md | ||
| - configuration/* | ||
| - Models: | ||
| - models/supported_models.md | ||
| - models/generative_models.md | ||
|
|
@@ -37,12 +43,11 @@ nav: | |
| - features/compatibility_matrix.md | ||
| - features/* | ||
| - features/quantization | ||
| - Other: | ||
| - getting_started/* | ||
| - Developer Guide: | ||
| - contributing/overview.md | ||
| - glob: contributing/* | ||
| flatten_single_child_sections: true | ||
| - contributing/README.md | ||
| - General: | ||
| - glob: contributing/* | ||
| flatten_single_child_sections: true | ||
| - Model Implementation: contributing/model | ||
| - Design Documents: | ||
| - V0: design | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # Configuration Options | ||
|
|
||
| This section lists the most common options for running the vLLM engine. | ||
| For a full list, refer to the [configuration][configuration] page. | ||
hmellor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,144 @@ | ||
| # Conserving Memory | ||
|
|
||
| Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem. | ||
|
|
||
| ## Tensor Parallelism (TP) | ||
|
|
||
| Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs. | ||
|
|
||
| The following code splits the model across 2 GPUs. | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", | ||
| tensor_parallel_size=2) | ||
| ``` | ||
|
|
||
| !!! warning | ||
| To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][]) | ||
| before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`. | ||
|
|
||
| To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable. | ||
|
|
||
| !!! note | ||
| With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). | ||
|
|
||
| You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. | ||
|
|
||
| ## Quantization | ||
|
|
||
| Quantized models take less memory at the cost of lower precision. | ||
|
|
||
| Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI)) | ||
| and used directly without extra configuration. | ||
|
|
||
| Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details. | ||
|
|
||
| ## Context length and batch size | ||
|
|
||
| You can further reduce memory usage by limiting the context length of the model (`max_model_len` option) | ||
| and the maximum batch size (`max_num_seqs` option). | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| llm = LLM(model="adept/fuyu-8b", | ||
| max_model_len=2048, | ||
| max_num_seqs=2) | ||
| ``` | ||
|
|
||
| ## Reduce CUDA Graphs | ||
|
|
||
| By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU. | ||
|
|
||
| !!! warning | ||
| CUDA graph capture takes up more memory in V1 than in V0. | ||
|
|
||
| You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage: | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
| from vllm.config import CompilationConfig, CompilationLevel | ||
|
|
||
| llm = LLM( | ||
| model="meta-llama/Llama-3.1-8B-Instruct", | ||
| compilation_config=CompilationConfig( | ||
| level=CompilationLevel.PIECEWISE, | ||
| # By default, it goes up to max_num_seqs | ||
| cudagraph_capture_sizes=[1, 2, 4, 8, 16], | ||
| ), | ||
| ) | ||
| ``` | ||
|
|
||
| You can disable graph capturing completely via the `enforce_eager` flag: | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", | ||
| enforce_eager=True) | ||
| ``` | ||
|
|
||
| ## Adjust cache size | ||
|
|
||
| If you run out of CPU RAM, try the following options: | ||
|
|
||
| - (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB). | ||
| - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB). | ||
|
|
||
| ## Multi-modal input limits | ||
|
|
||
| You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model: | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| # Accept up to 3 images and 1 video per prompt | ||
| llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", | ||
| limit_mm_per_prompt={"image": 3, "video": 1}) | ||
| ``` | ||
|
|
||
| You can go a step further and disable unused modalities completely by setting its limit to zero. | ||
| For example, if your application only accepts image input, there is no need to allocate any memory for videos. | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| # Accept any number of images but no videos | ||
| llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", | ||
| limit_mm_per_prompt={"video": 0}) | ||
| ``` | ||
|
|
||
| You can even run a multi-modal model for text-only inference: | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| # Don't accept images. Just text. | ||
| llm = LLM(model="google/gemma-3-27b-it", | ||
| limit_mm_per_prompt={"image": 0}) | ||
| ``` | ||
|
|
||
| ## Multi-modal processor arguments | ||
|
|
||
| For certain models, you can adjust the multi-modal processor arguments to | ||
| reduce the size of the processed multi-modal inputs, which in turn saves memory. | ||
|
|
||
| Here are some examples: | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| # Available for Qwen2-VL series models | ||
| llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", | ||
| mm_processor_kwargs={ | ||
| "max_pixels": 768 * 768, # Default is 1280 * 28 * 28 | ||
| }) | ||
|
|
||
| # Available for InternVL series models | ||
| llm = LLM(model="OpenGVLab/InternVL2-2B", | ||
| mm_processor_kwargs={ | ||
| "max_dynamic_patch": 4, # Default is 12 | ||
| }) | ||
| ``` |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # Model Resolution | ||
|
|
||
| vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository | ||
| and finding the corresponding implementation that is registered to vLLM. | ||
| Nevertheless, our model resolution may fail for the following reasons: | ||
|
|
||
| - The `config.json` of the model repository lacks the `architectures` field. | ||
| - Unofficial repositories refer to a model using alternative names which are not recorded in vLLM. | ||
| - The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded. | ||
|
|
||
| To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option. | ||
| For example: | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
|
|
||
| model = LLM( | ||
| model="cerebras/Cerebras-GPT-1.3B", | ||
| hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2 | ||
| ) | ||
| ``` | ||
|
|
||
| Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM. |
5 changes: 1 addition & 4 deletions
5
docs/performance/optimization.md → docs/configuration/optimization.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.