[Performance] Avoid cuda sync in postprocess of LLM decoding #9011
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
Others
Description
目前在LLM decoding每一个解码步的后处理阶段有大量零碎的算子,导致CPU侧存在较大的kernel launch开销,GPU利用率较低。一般情况下CUDA kernel异步执行的特性会让CPU侧可以提前进行kernel launch,掩盖掉这一部分开销,但目前的实现下后处理阶段存在许多会导致CUDA同步的操作,无法掩盖这一部分开销,导致性能下降。如图,红框中存在大量的零碎kernel,GPU利用率低:
本PR通过对后处理阶段的代码进行改写,避免了CUDA的同步操作。经过在llama模型上的测试,能够带来约3%的性能提升。如图,红框部分的零碎kernel由于避免了CPU端kernel launch的开销,耗时大大减小:
本PR主要发现并修正的会导致同步的操作包括:
用以下形式调用full相关算子会导致阻塞的cudaMemcpy。
会导致Pageable cudaMemcpyAsync造成同步。目前改为了使用Gumbel-Max trick进行采样的等价实现。
to_tensor
会导致同步,改为了clip
,并将top_k
改为了CPU Tensor。