Refactor pytorch engine #2104

grimoire · 2024-07-22T13:05:21Z

It is hard to switch kernel implementations in PyTorch Engine, and patching models of transformers makes it difficult for us to carry out more aggressive optimizations.

This PR plan to refactor pytorch engine. We added an operator abstraction layer and made it capable of selecting the most suitable operator backend based on the current context.

lmdeploy/pytorch/layers: The op abstraction layer. Deploy model would be built with these Infrastructure.
lmdeploy/pytorch/backends: Implementation of op would be dispatched here by the device and environments.
cudagraph support, kernel launch will not be the main bottleneck.

yao-fengchen · 2024-07-23T08:56:15Z

lmdeploy/pytorch/backends/cuda/apply_rotary_emb.py

+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from lmdeploy.pytorch.kernels.cuda import apply_rotary_pos_emb


Here cuda directly use apply_rotary_pos_emb in lmdeploy.pytorch.kernels.cuda instead of apply_rotary_pos_emb in lmdeploy.pytorch.kernels. Is it possible that apply_rotary_pos_emb in lmdeploy.pytorch.kernels will not be used and lmdeploy/pytorch/kernels/apply_rotary_pos_emb.py can be deleted? This is also the issue with other kernels.

Yes, that is expected.

grimoire · 2024-09-06T06:02:46Z

@zhulinJulia24 Fixed 1, 2

Unable to reproduce Issue 3. 4 is caused by awq_kernels

zhulinJulia24 · 2024-09-09T02:58:59Z

@grimoire
OOM when do benchmark test with 256 concurrency

root@4c6619530244:/__w/lmdeploy/lmdeploy# CUDA_VISIBLE_DEVICES=2,3 python3 benchmark/profile_throughput.py /nvme/qa_test_models/datasets/ShareGPT_V3_unfiltered_cleaned_split.json /nvme/qa_test_models/internlm/internlm2_5-20b-chat  --backend pytorch --concurrency 256 --num-prompts 3000 --tp 2 
  0%|                                                                                                                                                                                    | 0/3000 [00:00<?, ?it/s][rank0]:[W CUDAGraph.cpp:150] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
  3%|████▎                                                                                                                                                                      | 76/3000 [00:27<07:19,  6.65it/s]2024-09-09 10:57:59,402 - lmdeploy - ERROR - Engine loop failed with error: CUDA out of memory. Tried to allocate 1.40 GiB. GPU 
Traceback (most recent call last):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/request.py", line 17, in _raise_exception_on_finish
    task.result()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 904, in async_loop
    await self._async_loop()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 894, in _async_loop
    await __step(True)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 880, in __step
    raise e
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 872, in __step
    raise out
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 817, in _async_loop_background
    await self._async_step_background(
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 717, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/utils.py", line 236, in __tmp
    return (await func(*args, **kwargs))
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 623, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 601, in __forward
    return await self.model_agent.async_forward(
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 781, in async_forward
    output = self._forward_impl(inputs,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 748, in _forward_impl
    output = model_forward(
  File "/opt/py3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 154, in model_forward
    output = model(**input_dict)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/backends/cuda/graph_runner.py", line 265, in __call__
    output = runner.forward(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/backends/cuda/graph_runner.py", line 193, in forward
    output = self.output_buffers['logits'][:, :num_tokens].clone()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.40 GiB. GPU

grimoire · 2024-09-09T04:13:46Z

@zhulinJulia24 This is expected, cudagraph requires more memories. Use small --cache-max-entry-count.

zhulinJulia24 · 2024-09-09T06:12:02Z

@zhulinJulia24 This is expected, cudagraph requires more memories. Use small --cache-max-entry-count.

@grimoire

I set --cache-max-entry-count=0.7 throughtput benchmark test passed, but api_server still has OOM, reproduce step is:

start api server:
CUDA_VISIBLE_DEVICES=2,3 lmdeploy serve api_server /nvme/qa_test_models/internlm/internlm2_5-20b-chat --session-len 8096 --server-port 23334 --tp 2 --max-batch-size 256 --cache-max-entry-count 0.7 --backend pytorch
start api benchmark:
python3 benchmark/profile_restful_api.py localhost:23334 /nvme/qa_test_models/internlm/internlm2_5-20b-chat /nvme/qa_test_models/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --stream-output True --num-prompts 2000 --concurrency 256

zhulinJulia24 · 2024-09-09T06:22:41Z

It seems FTL is still a litter high

zhulinJulia24 · 2024-09-10T00:06:57Z

benchmark result for tp is 2 and 4
https://github.com/zhulinJulia24/lmdeploy/actions/runs/10770609910

benchmark result for tp is 1
https://github.com/zhulinJulia24/lmdeploy/actions/runs/10741908076

AllentDan · 2024-09-10T07:00:26Z

lmdeploy/messages.py

@@ -212,6 +212,8 @@ class PytorchEngineConfig:
    thread_safe: bool = False
    enable_prefix_caching: bool = False
    device_type: str = 'cuda'
+    eager_mode: bool = False
+    custom_module_map: str = None


这里MODULE_MAP为啥我们要设计成Dict[str, str] 的形式，不是 Dict[str, class]的形式啊

custom_module_map is a path to the .py file contain the map and custom module.

我是说内部的 MODULE_MAP 字典，之前用 patch 的话字符串能理解，但是改成自己加载权重推理的话，继续用字符串意义不大了吧。直接传类，简单直接

load all classes is slow.

AllentDan · 2024-09-10T07:15:36Z

docs/en/advance/pytorch_new_model.md


-1. The custom Triton kernel allows us to incorporate new features, such as `paged_attention_fwd`.
-2. Fused kernels offer superior performance compared to the pure PyTorch implementation.
+class GemmaModelConfigBuilder(AutoModelConfigBuilder):


这里的例子是否前后统一比较好？这里是 Gemma 后面是 llama。另外例子最后能给一个完整的可执行的python脚本吗，从注册模型到运行一次推理。

We don't have a config builder for llama.
The documentation will be improved in future PRs.

zhulinJulia24

LGTM

zhyncs · 2024-09-10T09:32:40Z

Nice to see the PyTorch Engine being refactored. I am looking forward to the performance of the new PyTorch Engine when CUDA Graph is enabled.
Here is the latest build
https://github.com/zhyncs/lmdeploy-build/actions/runs/10789704734
https://github.com/zhyncs/lmdeploy-build/actions/runs/10789706131

zhyncs · 2024-09-10T09:35:20Z

@lvhan028 @grimoire May we release a new version soon? I believe it's a great upgrade.

lvhan028 · 2024-09-10T10:42:43Z

yes, we are working on v0.6.0.
It will be released this week

brisker · 2024-10-09T09:40:16Z

请问lmdeploy中的w8a8-triton实现是否有实际llm（如llama2，qwen2）的推理速度加速效果的benchmark测试？

grimoire added 6 commits July 16, 2024 19:32

attn layer

038f6bf

move to backend

68936c9

add base layer

ccdb3ea

finish llama base

b123e4d

add lora and w8a8

5a09d9f

support awq

4755b1e

yao-fengchen reviewed Jul 23, 2024

View reviewed changes

grimoire added 14 commits July 23, 2024 18:04

add add_rms_norm kernel

60df32f

optimize step context

67aba31

attn meta as input

4312826

add cuda graph support

ef092e5

disable one of mha kernel

9fefda5

share graph pool

839f0be

del graph

3345181

update docstring

6746e67

awq cudagraph

e5a790b

merge main

fbc0912

support llava for llama

a9ec3fa

fix adapter

67e427a

fix support cudagraph flag

5158b96

support lora cudagraph

580cdd0

lvhan028 mentioned this pull request Jul 31, 2024

update base image to support cuda12.4 in dockerfile #2182

Merged

grimoire added 8 commits August 1, 2024 14:31

support logit softcapping

449f947

support transformers 4.43

0e16e69

fix ut

e6a3048

Merge branch 'main' into torch-layers

364737b

fix dynamic ntk cudagraph

93d3746

add moe support

2dfcc6f

add custom module support

93c64ee

optimize awq kernel

3622635

grimoire added 3 commits September 6, 2024 15:09

optimize decoding

e61ddcf

recovery attention

a27bf51

fix fill kv cache

0387730

RunningLeon mentioned this pull request Sep 6, 2024

build nccl in dockerfile for cuda11.8 #2433

Merged

grimoire added 5 commits September 9, 2024 15:01

fix internlm oom

195ed83

fix llama3 memory usage

9b5bc43

remove float deepseekv2

3020ada

fix llama3

331e2c0

update smooth quant flag

aa9c722

lvhan028 approved these changes Sep 9, 2024

View reviewed changes

lvhan028 requested a review from zhulinJulia24 September 9, 2024 10:03

lvhan028 changed the title ~~Custom backend support.~~ Refactor pytorch engine Sep 9, 2024

grimoire added 3 commits September 10, 2024 10:46

fix w8a8

adbc531

merge main

84e9b01

fix w8a8 tp

1fae365

AllentDan mentioned this pull request Sep 10, 2024

Support pytorch engine kv int4/int8 quantization #2438

Merged

AllentDan reviewed Sep 10, 2024

View reviewed changes

zhulinJulia24 approved these changes Sep 10, 2024

View reviewed changes

lvhan028 merged commit e8a1a33 into InternLM:main Sep 10, 2024
5 checks passed

yao-fengchen mentioned this pull request Sep 10, 2024

refactor pytorch engine(ascend) #2440

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pytorch engine #2104

Refactor pytorch engine #2104

grimoire commented Jul 22, 2024 •

edited

Loading

yao-fengchen Jul 23, 2024 •

edited

Loading

grimoire Jul 23, 2024

grimoire commented Sep 6, 2024

zhulinJulia24 commented Sep 9, 2024 •

edited

Loading

grimoire commented Sep 9, 2024

zhulinJulia24 commented Sep 9, 2024

zhulinJulia24 commented Sep 9, 2024

zhulinJulia24 commented Sep 10, 2024

AllentDan Sep 10, 2024

grimoire Sep 10, 2024

AllentDan Sep 10, 2024

grimoire Sep 10, 2024

AllentDan Sep 10, 2024

grimoire Sep 10, 2024

zhulinJulia24 left a comment

zhyncs commented Sep 10, 2024

zhyncs commented Sep 10, 2024

lvhan028 commented Sep 10, 2024

brisker commented Oct 9, 2024

Refactor pytorch engine #2104

Refactor pytorch engine #2104

Conversation

grimoire commented Jul 22, 2024 • edited Loading

yao-fengchen Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

grimoire Jul 23, 2024

Choose a reason for hiding this comment

grimoire commented Sep 6, 2024

zhulinJulia24 commented Sep 9, 2024 • edited Loading

grimoire commented Sep 9, 2024

zhulinJulia24 commented Sep 9, 2024

zhulinJulia24 commented Sep 9, 2024

zhulinJulia24 commented Sep 10, 2024

AllentDan Sep 10, 2024

Choose a reason for hiding this comment

grimoire Sep 10, 2024

Choose a reason for hiding this comment

AllentDan Sep 10, 2024

Choose a reason for hiding this comment

grimoire Sep 10, 2024

Choose a reason for hiding this comment

AllentDan Sep 10, 2024

Choose a reason for hiding this comment

grimoire Sep 10, 2024

Choose a reason for hiding this comment

zhulinJulia24 left a comment

Choose a reason for hiding this comment

zhyncs commented Sep 10, 2024

zhyncs commented Sep 10, 2024

lvhan028 commented Sep 10, 2024

brisker commented Oct 9, 2024

grimoire commented Jul 22, 2024 •

edited

Loading

yao-fengchen Jul 23, 2024 •

edited

Loading

zhulinJulia24 commented Sep 9, 2024 •

edited

Loading