Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -381,7 +381,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
# Install FlashInfer from source
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
# Keep this in sync with "flashinfer" extra in setup.py
ARG FLASHINFER_GIT_REF="v0.3.1"
ARG FLASHINFER_GIT_REF="v0.4.0rc1"
# Flag to control whether to compile FlashInfer AOT kernels
# Set to "true" to enable AOT compilation:
# docker build --build-arg FLASHINFER_AOT_COMPILE=true ...
Expand Down
4 changes: 2 additions & 2 deletions docker/Dockerfile.nightly_torch
Original file line number Diff line number Diff line change
Expand Up @@ -246,15 +246,15 @@ RUN pip install setuptools==75.6.0 packaging==23.2 ninja==1.11.1.3 build==1.2.2.


# build flashinfer for torch nightly from source around 10 mins
# release version: v0.3.1
# release version: v0.4.0rc1
# todo(elainewy): cache flashinfer build result for faster build
ENV CCACHE_DIR=/root/.cache/ccache
RUN --mount=type=cache,target=/root/.cache/ccache \
--mount=type=cache,target=/root/.cache/uv \
echo "git clone flashinfer..." \
&& git clone --recursive https://github.com/flashinfer-ai/flashinfer.git \
&& cd flashinfer \
&& git checkout v0.3.1 \
&& git checkout v0.4.0rc1 \
&& git submodule update --init --recursive \
&& echo "finish git clone flashinfer..." \
&& rm -rf build \
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -662,7 +662,7 @@ def _read_requirements(filename: str) -> list[str]:
"mistral_common[audio]"], # Required for audio processing
"video": [], # Kept for backwards compatibility
# FlashInfer should be updated together with the Dockerfile
"flashinfer": ["flashinfer-python==0.3.1"],
"flashinfer": ["flashinfer-python==0.4.0rc1"],
# Optional deps for AMD FP4 quantization support
"petit-kernel": ["petit-kernel"],
},
Expand Down
5 changes: 4 additions & 1 deletion vllm/v1/attention/backends/flashinfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1085,7 +1085,7 @@ def fast_plan_decode(
qo_indptr_host = _get_range_buf(batch_size + 1, "cpu")

try:
# Make sure we pass exactly 15 arguments for tensor core version
# Make sure we pass exactly 18 arguments for tensor core version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While you've correctly updated the internal _cached_module.plan call for the new flashinfer version, the corresponding public self.plan method call (at lines 1025-1043) seems to have been missed. This public method is used for the initial warm-up when CUDA graphs are enabled.

If the public plan API also changed (which is highly likely given the internal API change), this will cause a TypeError at runtime during the warm-up. Please update the call at lines 1025-1043 to include any new arguments. Based on the changes to the internal call, it's likely that arguments such as window_right and allow_fp16_qk_reduction need to be added.

self._plan_info = self._cached_module.plan(
self._float_workspace_buffer,
self._int_workspace_buffer,
Expand All @@ -1102,6 +1102,9 @@ def fast_plan_decode(
head_dim,
head_dim,
False, # causal
window_left,
-1,
False,
)
except Exception as e:
raise RuntimeError(f"Error in tensor core plan: {e}") from e
Expand Down