Skip to content

Conversation

@tvoas
Copy link

@tvoas tvoas commented May 12, 2025

  • Enable PP solution with full support for DeepSeek R1 execution with PP>1.
  • Requires 1.21.0 or newer. Does not support 1.20.1 or older.
  • Implementation mirrors Implement Pipeline Parallelism support for HPU. #1000 as closely as possible while ensuring DeepSeek R1 functions fully.
  • Adds a benchmark script for sweeping various configs automatically. This can be removed if you feel it shouldnt merge to deepseek_r1 branch.

Additional validation is being done by yabai.hu@intel.com.

@czhu15 youlei.yang@intel.com please help start the review in the meantime.


if [ "$KV_CACHE_DTYPE" = "fp8_inc" ]; then
export VLLM_USE_FP8_MATMUL="true"
export VLLM_USE_SINGLE_TENSOR_CACHE="1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need set "VLLM_USE_FP8_MATMUL" by application? shouldn't INC handle it?
And what's VLLM_USE_SINGLE_TENSOR_CACHE for? Can't find it in the vllm code...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good cache. This was a leftover from Liu, Yi's FP8 matmul implementation on the PRC local version of this branch. I see that the implementation that is provided on deepseek_r1 branch is different, though and no longer uses VLLM_USE_SINGLE_TENSOR_CACHE.

We still need VLLM_USE_FP8_MATMUL though, right? This environment variable should be set for improved performance according to @yiliu30.

Copy link

@yiliu30 yiliu30 May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the VLLM_USE_SINGLE_TENSOR_CACHE flag should be removed, as we have been using one tensor for KVCache since #977.

We still need VLLM_USE_FP8_MATMUL for FP8 Q@K and FP8 A@V, since the current path does not use the INC to replace the Matmul with PatchedMatmul. Instead, it manually patches Q@K and A@V and use 1.0 as the scaling. Please refer to #977 for more details.

cc @xuechendi

@tvoas tvoas force-pushed the enable_pp_g2d_global branch 2 times, most recently from 7cd244f to 2f9b8b1 Compare May 14, 2025 01:54
@tvoas tvoas requested review from czhu15 and xinyu-intel May 14, 2025 01:56
@tvoas tvoas force-pushed the enable_pp_g2d_global branch 3 times, most recently from d8ef146 to 05e298d Compare May 15, 2025 05:46
Co-authored-by: Hu, Yabai <yabai.hu@intel.com>
Co-authored-by: Ji, Kunshang <kunshang.ji@intel.com>
Co-authored-by: Sheng, Yi <yi.sheng@intel.com>
Co-authored-by: Chen, Xinyu <xinyu1.chen@intel.com>
Co-authored-by: Voas, Tanner <tanner.voas@intel.com>
Signed-off-by: Voas, Tanner <tanner.voas@intel.com>
@tvoas tvoas force-pushed the enable_pp_g2d_global branch from 05e298d to ba9bf97 Compare May 15, 2025 07:57
@czhu15 czhu15 merged commit 6767058 into HabanaAI:deepseek_r1 May 15, 2025
1 check failed
@tvoas tvoas deleted the enable_pp_g2d_global branch June 4, 2025 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants