-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor pytorch engine #2104
Refactor pytorch engine #2104
Conversation
# Copyright (c) OpenMMLab. All rights reserved. | ||
import torch | ||
|
||
from lmdeploy.pytorch.kernels.cuda import apply_rotary_pos_emb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here cuda directly use apply_rotary_pos_emb
in lmdeploy.pytorch.kernels.cuda
instead of apply_rotary_pos_emb
in lmdeploy.pytorch.kernels
. Is it possible that apply_rotary_pos_emb
in lmdeploy.pytorch.kernels
will not be used and lmdeploy/pytorch/kernels/apply_rotary_pos_emb.py
can be deleted? This is also the issue with other kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is expected.
@zhulinJulia24 Fixed 1, 2 Unable to reproduce Issue 3. 4 is caused by awq_kernels |
@grimoire
|
@zhulinJulia24 This is expected, cudagraph requires more memories. Use small |
I set
|
benchmark result for tp is 2 and 4 benchmark result for tp is 1 |
@@ -212,6 +212,8 @@ class PytorchEngineConfig: | |||
thread_safe: bool = False | |||
enable_prefix_caching: bool = False | |||
device_type: str = 'cuda' | |||
eager_mode: bool = False | |||
custom_module_map: str = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里MODULE_MAP为啥我们要设计成Dict[str, str] 的形式,不是 Dict[str, class]的形式啊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
custom_module_map is a path to the .py
file contain the map and custom module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我是说内部的 MODULE_MAP 字典,之前用 patch 的话字符串能理解,但是改成自己加载权重推理的话,继续用字符串意义不大了吧。直接传类,简单直接
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load all classes is slow.
|
||
1. The custom Triton kernel allows us to incorporate new features, such as `paged_attention_fwd`. | ||
2. Fused kernels offer superior performance compared to the pure PyTorch implementation. | ||
class GemmaModelConfigBuilder(AutoModelConfigBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的例子是否前后统一比较好?这里是 Gemma 后面是 llama。另外例子最后能给一个完整的可执行的python脚本吗,从注册模型到运行一次推理。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a config builder for llama.
The documentation will be improved in future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice to see the PyTorch Engine being refactored. I am looking forward to the performance of the new PyTorch Engine when CUDA Graph is enabled. |
yes, we are working on v0.6.0. |
请问lmdeploy中的w8a8-triton实现是否有 实际llm(如llama2,qwen2)的推理速度加速效果的benchmark测试? |
It is hard to switch kernel implementations in PyTorch Engine, and patching models of transformers makes it difficult for us to carry out more aggressive optimizations.
This PR plan to refactor pytorch engine. We added an operator abstraction layer and made it capable of selecting the most suitable operator backend based on the current context.