Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 攒batch推理时,同样输入2次前向耗时差别很大 #1138

Open
4 tasks done
CSammyfd opened this issue Dec 20, 2024 · 1 comment
Open
4 tasks done

Comments

@CSammyfd
Copy link

Model Series

Qwen2.5

What are the models used?

Qwen2.5-0.5B-Instruct

What is the scenario where the problem happened?

inference with transformers

Is this a known issue?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find an answer there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu 22.04
Python: Python 3.10.0
GPUs: 1 x NVIDIA A10
CUDA compiler: 12.2
PyTorch: 2.4.0

accelerate 1.2.1
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.7.0
asttokens 3.0.0
async-timeout 5.0.1
attrs 24.2.0
auto_fp8 0.1.0 /juicefs-algorithm/workspace/acg/yuyang_chen/codebases/AutoFP8
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
cloudpickle 3.1.0
datasets 3.2.0
decorator 5.1.1
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
einops 0.8.0
exceptiongroup 1.2.2
executing 2.1.0
fastapi 0.115.6
filelock 3.16.1
frozenlist 1.5.0
fsspec 2024.9.0
gguf 0.10.0
h11 0.14.0
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.26.5
idna 3.10
importlib_metadata 8.5.0
interegular 0.3.3
ipdb 0.13.13
ipython 8.30.0
jedi 0.19.2
Jinja2 3.1.4
jiter 0.8.2
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
lark 1.2.2
llvmlite 0.43.0
lm-format-enforcer 0.10.6
MarkupSafe 3.0.2
matplotlib-inline 0.1.7
mistral_common 1.5.1
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.18.6
multidict 6.1.0
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.4.2
numba 0.60.0
numpy 1.26.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvtx-cu12 12.1.105
openai 1.57.2
outlines 0.0.46
packaging 24.2
pandas 2.2.3
parso 0.8.4
partial-json-parser 0.2.1.1.post4
pexpect 4.9.0
pillow 10.4.0
pip 24.3.1
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.0.0
prompt_toolkit 3.0.48
propcache 0.2.1
protobuf 5.29.1
psutil 6.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 18.1.0
pycountry 24.6.1
pydantic 2.10.3
pydantic_core 2.27.1
Pygments 2.18.0
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
ray 2.40.0
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
rpds-py 0.22.3
safetensors 0.4.5
sentencepiece 0.2.0
setuptools 75.6.0
six 1.17.0
sniffio 1.3.1
stack-data 0.6.3
starlette 0.41.3
sympy 1.13.3
tiktoken 0.7.0
tokenizers 0.15.2
tomli 2.2.1
torch 2.4.0
torchvision 0.19.0
tqdm 4.67.1
traitlets 5.14.3
transformers 4.39.2
triton 3.0.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.1
uvloop 0.21.0
vllm 0.6.2
watchfiles 1.0.3
wcwidth 0.2.13
websockets 14.1
wheel 0.45.1
xformers 0.0.27.post2
xxhash 3.5.0
yarl 1.18.3
zipp 3.21.0

Log output

未注释_prepare_4d_causal_attention_mask_for_sdpa时的推理耗时log(最后一个for循环的部分)

0
preparing posid: 0.0002353191375732422
preparing attmask: 0.0007176399230957031                # -> 耗时短
infer hiddenstates: 0.016635417938232422
1
preparing posid: 5.435943603515625e-05
preparing attmask: 2.227926254272461                   # -> 耗时很长
infer hiddenstates: 0.01725482940673828
2
preparing posid: 8.034706115722656e-05
preparing attmask: 2.245957374572754
infer hiddenstates: 0.01728224754333496
3
preparing posid: 9.560585021972656e-05
preparing attmask: 2.264934778213501
infer hiddenstates: 0.01842355728149414
4
preparing posid: 0.00010800361633300781
preparing attmask: 2.2594733238220215
infer hiddenstates: 0.017111539840698242
5
preparing posid: 7.62939453125e-05
preparing attmask: 2.2598087787628174
infer hiddenstates: 0.016722917556762695

注释_prepare_4d_causal_attention_mask_for_sdpa时的推理耗时log(最后一个for循环的部分)
0
preparing posid: 0.0001976490020751953
preparing attmask: 1.430511474609375e-06
infer hiddenstates: 0.01624274253845215
1
preparing posid: 5.269050598144531e-05
preparing attmask: 4.76837158203125e-07
infer hiddenstates: 1.80722975730896      # 长耗时部分变到了推理
2
preparing posid: 0.0023190975189208984
preparing attmask: 2.384185791015625e-07
infer hiddenstates: 2.2761189937591553
3
preparing posid: 0.0023317337036132812
preparing attmask: 2.384185791015625e-07
infer hiddenstates: 2.2766754627227783
4
preparing posid: 0.0023233890533447266
preparing attmask: 2.384185791015625e-07
infer hiddenstates: 2.27870512008667
5
preparing posid: 0.002339601516723633
preparing attmask: 2.384185791015625e-07
infer hiddenstates: 2.2783405780792236

Description

背景

固定文本,只需要前向一次获取logits,但实际使用时会多并发,需要压低耗时和显存使用
之前用vllm时,遇到了prompt_logprobs设置了之后,batch=2就炸显存的问题,故暂时没法使用vllm进行部署推理,转战transformers
但发现该脚本使用时发现一个奇怪的情况:
1)batch=20推理时,第一次耗时只需要0.1s,而第二次开始耗时升高至2.5s左右
此时发现是_prepare_4d_causal_attention_mask_for_sdpa 产生了近2.2s耗时,推理decodelayer耗时很低
2)将上一个function注释掉,直接写attention_mask=None(本来就会返回None)进行测试
此时发现情况变为了:推理各decodelayer占了2s多耗时
故主要问题是想知道第一二次耗时相差如此大的原因
同时,这个需求的背景是想测试攒batch是否可以达到加速效果,如果其他使用者或者官方有确切的结论(比如这个情况下是gpuutils bound而不是显存,应该信第二次以后的2s而不是0.1s),我这边也可以直接接受这个结论跳过实验

复现 Steps to reproduce

推理脚本见
batchinfer.txt
同时在transformer库中的Qwen modeling代码中也加上了耗时统计,而且强制use_cache = False

@jklj077
Copy link
Collaborator

jklj077 commented Dec 20, 2024

using time to measure CUDA running time is disencouraged, since CUDA kernels run asynchronously. please try using cuda events or manually synchronizing if time must be used.

if I understand correctly, you are not doing autoregressive generation. In such case, batching is generally prefered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants