diff --git a/docs/source/getting_started/installation/cpu.md b/docs/source/getting_started/installation/cpu.md index 65af7b50bdc1..1b2ffd619994 100644 --- a/docs/source/getting_started/installation/cpu.md +++ b/docs/source/getting_started/installation/cpu.md @@ -193,7 +193,7 @@ vLLM CPU backend supports the following vLLM features: ## Related runtime environment variables -- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. +- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. - `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. - `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False). diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/source/getting_started/v1_user_guide.md index 3e54022eebb2..26b28c04fe73 100644 --- a/docs/source/getting_started/v1_user_guide.md +++ b/docs/source/getting_started/v1_user_guide.md @@ -156,6 +156,9 @@ vLLM V1 is currently optimized for decoder-only transformers. Models requiring For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html). -## FAQ +## Frequently Asked Questions -TODO +**I'm using vLLM V1 and I'm getting CUDA OOM errors. What should I do?** +The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`. + +On the other hand, if you get an error about insufficient memory for the cache blocks, you should increase `gpu_memory_utilization` as this indicates that your GPU has sufficient memory but you're not allocating enough to vLLM for KV cache blocks. diff --git a/docs/source/serving/engine_args.md b/docs/source/serving/engine_args.md index f4587b94edea..e9943571a40a 100644 --- a/docs/source/serving/engine_args.md +++ b/docs/source/serving/engine_args.md @@ -2,7 +2,12 @@ # Engine Arguments -Below, you can find an explanation of every engine argument for vLLM: +Engine arguments control the behavior of the vLLM engine. + +- For [offline inference](#offline-inference), they are part of the arguments to `LLM` class. +- For [online serving](#openai-compatible-server), they are part of the arguments to `vllm serve`. + +Below, you can find an explanation of every engine argument: ```{eval-rst} @@ -15,7 +20,7 @@ Below, you can find an explanation of every engine argument for vLLM: ## Async Engine Arguments -Below are the additional arguments related to the asynchronous engine: +Additional arguments are available to the asynchronous engine which is used for online serving: ```{eval-rst} diff --git a/docs/source/serving/offline_inference.md b/docs/source/serving/offline_inference.md index ded57500c5d0..7bf1c08828db 100644 --- a/docs/source/serving/offline_inference.md +++ b/docs/source/serving/offline_inference.md @@ -97,6 +97,13 @@ llm = LLM(model="adept/fuyu-8b", max_num_seqs=2) ``` +#### Adjust cache size + +If you run out of CPU RAM, try the following options: + +- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB). +- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB). + ### Performance optimization and tuning You can potentially improve the performance of vLLM by finetuning various options. diff --git a/vllm/envs.py b/vllm/envs.py index 829f9450fb77..7dc468f8773a 100644 --- a/vllm/envs.py +++ b/vllm/envs.py @@ -339,7 +339,7 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]: lambda: os.getenv("VLLM_PP_LAYER_PARTITION", None), # (CPU backend only) CPU key-value cache space. - # default is 4GB + # default is 4 GiB "VLLM_CPU_KVCACHE_SPACE": lambda: int(os.getenv("VLLM_CPU_KVCACHE_SPACE", "0")), @@ -411,9 +411,9 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]: lambda: int(os.getenv("VLLM_AUDIO_FETCH_TIMEOUT", "10")), # Cache size (in GiB) for multimodal input cache - # Default is 8GiB + # Default is 4 GiB "VLLM_MM_INPUT_CACHE_GIB": - lambda: int(os.getenv("VLLM_MM_INPUT_CACHE_GIB", "8")), + lambda: int(os.getenv("VLLM_MM_INPUT_CACHE_GIB", "4")), # Path to the XLA persistent cache directory. # Only used for XLA devices such as TPUs. diff --git a/vllm/platforms/cpu.py b/vllm/platforms/cpu.py index 40eacfd080e1..4b10b298dce1 100644 --- a/vllm/platforms/cpu.py +++ b/vllm/platforms/cpu.py @@ -92,7 +92,7 @@ def check_and_update_config(cls, vllm_config: VllmConfig) -> None: if kv_cache_space == 0: cache_config.cpu_kvcache_space_bytes = 4 * GiB_bytes # type: ignore logger.warning( - "Environment variable VLLM_CPU_KVCACHE_SPACE (GB) " + "Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) " "for CPU backend is not set, using 4 by default.") else: cache_config.cpu_kvcache_space_bytes = kv_cache_space * GiB_bytes # type: ignore # noqa