Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a inplace concat custom op based on CUDA VMM API (resubmitted) #9320

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from

Conversation

lszxb
Copy link
Contributor

@lszxb lszxb commented Oct 28, 2024

PR types

Performance optimization

PR changes

Others

Description

这一PR尝试为当前的大模型推理过程增加基于CUDA VMM API的inplace concat支持(原理类似于vAttention),从而避免在每一个解码步都复制一次整个KV Cache。
该功能暂时只实现了自定义算子,未来还需要增加相关的pass以自动适配其他模型。
目前这一PR在llama模型上应用了这一方案,在3072 input+1024 output的情况下大约有10%的提升。

目前主要的思路是:

  • 使用一种特殊的Tensor,其显存由VMM API分配,这种Tensor使用特殊的phi::Allocation,在创建时预留大量的虚拟地址空间,可以在必要时分配物理页映射到虚拟地址空间。
  • 为了兼容剩余的调用,cache的shape为batch x seq_len x num_head x head_dim,但由于状态在cache的尾部追加,cache的内存布局应该是seq_len x batch x num_head x head_dim。
    vtensor_reserve_one_token自定义算子的语义大致如下:
  • 如果key_cache不是VTensor,则新分配一个VTensor,并将原先key_cache中的数据复制到这个新的VTensor中。然后使用VTensor的扩展机制,在尾部预留新的一个token的空间,并将key_states复制到这个新的空间中。
  • 如果key_cache是VTensor,直接使用VTensor的扩展机制,在尾部预留新的一个token的空间,并将key_states复制到这个新的空间中。

目前可能存在的问题:

  • 仅支持每次追加1个token的空间。
  • 目前分配的虚拟地址空间大小和block大小为定值(1GiB与32MiB),可能暴露相关的API给用户进行调整会更好?
  • 输入和输出的key_cache共享同一块空间,在某些情况下可能会产生冲突。
  • 该方法依赖于每个step使用的kv cache是同一个Tensor,若有某些其他操作改变了kv cache的Tensor(比如说clone到另一个Tensor),则会导致失效,因此也需要配合这个PR的优化才可使用(assign_out_操作会导致复制)。
  • 通过该算子分配的显存无法使用现有的Allocator进行统一管理。

本PR还包括了以下两个PR的内容:

Copy link

paddle-bot bot commented Oct 28, 2024

Thanks for your contribution!

@@ -0,0 +1,71 @@
#include "paddle/extension.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的几个文件开头放一下版权声明

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加上了

const paddle::Tensor& append_state,
bool transposed_input
) {
// std::cout << "vtensor_reserve_one_token 1 " << (uintptr_t)cache_transposed.data() << std::endl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类似注释都可以给删掉

Comment on lines 21 to 23
"./gpu/pass/remove_assign_out_pass.cc",
"./gpu/pass/apply_vtensor_concat_pass.cc",
"./gpu/vtensor.cu", # TODO: this haven't tested with hip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件不需要更改,先暂时只在gpu下使用就行

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,这几行删掉了

Comment on lines 336 to 341
if is_paddlenlp_ops_available():
import paddlenlp_ops
inference_config.enable_custom_passes([
"remove_assign_out_pass", # remove the assign_out_ op at the end of while loop
"apply_vtensor_concat_pass", # replace concat op with vtensor implementation
])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_paddlenlp_ops_available():
import paddlenlp_ops
inference_config.enable_custom_passes([
"remove_assign_out_pass", # remove the assign_out_ op at the end of while loop
"apply_vtensor_concat_pass", # replace concat op with vtensor implementation
])
try:
import remove_assign_out_pass, apply_vtensor_concat_pass from paddlenlp_ops
inference_config.enable_custom_passes([
"remove_assign_out_pass", # remove the assign_out_ op at the end of while loop
"apply_vtensor_concat_pass", # replace concat op with vtensor implementation
])
except:
pass

这样修改吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp_ops里没有pass的对象,我换成了新加的算子vtensor_reserve_one_token

Copy link

codecov bot commented Oct 28, 2024

Codecov Report

Attention: Patch coverage is 28.57143% with 5 lines in your changes missing coverage. Please review.

Project coverage is 52.24%. Comparing base (81f5ab5) to head (d5a9393).
Report is 204 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/generation/utils.py 0.00% 4 Missing ⚠️
paddlenlp/generation/logits_process.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9320      +/-   ##
===========================================
- Coverage    52.92%   52.24%   -0.69%     
===========================================
  Files          661      671      +10     
  Lines       107069   109655    +2586     
===========================================
+ Hits         56670    57288     +618     
- Misses       50399    52367    +1968     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yuanlehome yuanlehome requested review from ZHUI and DesmonDay October 28, 2024 05:29
Copy link

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

@github-actions github-actions bot added the stale label Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants