You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using time to measure CUDA running time is disencouraged, since CUDA kernels run asynchronously. please try using cuda events or manually synchronizing if time must be used.
if I understand correctly, you are not doing autoregressive generation. In such case, batching is generally prefered.
Model Series
Qwen2.5
What are the models used?
Qwen2.5-0.5B-Instruct
What is the scenario where the problem happened?
inference with transformers
Is this a known issue?
Information about environment
accelerate 1.2.1
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.7.0
asttokens 3.0.0
async-timeout 5.0.1
attrs 24.2.0
auto_fp8 0.1.0 /juicefs-algorithm/workspace/acg/yuyang_chen/codebases/AutoFP8
certifi 2024.8.30
charset-normalizer 3.4.0
click 8.1.7
cloudpickle 3.1.0
datasets 3.2.0
decorator 5.1.1
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
einops 0.8.0
exceptiongroup 1.2.2
executing 2.1.0
fastapi 0.115.6
filelock 3.16.1
frozenlist 1.5.0
fsspec 2024.9.0
gguf 0.10.0
h11 0.14.0
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.26.5
idna 3.10
importlib_metadata 8.5.0
interegular 0.3.3
ipdb 0.13.13
ipython 8.30.0
jedi 0.19.2
Jinja2 3.1.4
jiter 0.8.2
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
lark 1.2.2
llvmlite 0.43.0
lm-format-enforcer 0.10.6
MarkupSafe 3.0.2
matplotlib-inline 0.1.7
mistral_common 1.5.1
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.18.6
multidict 6.1.0
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.4.2
numba 0.60.0
numpy 1.26.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 12.560.30
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvtx-cu12 12.1.105
openai 1.57.2
outlines 0.0.46
packaging 24.2
pandas 2.2.3
parso 0.8.4
partial-json-parser 0.2.1.1.post4
pexpect 4.9.0
pillow 10.4.0
pip 24.3.1
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.0.0
prompt_toolkit 3.0.48
propcache 0.2.1
protobuf 5.29.1
psutil 6.1.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pyairports 2.1.1
pyarrow 18.1.0
pycountry 24.6.1
pydantic 2.10.3
pydantic_core 2.27.1
Pygments 2.18.0
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.2
PyYAML 6.0.2
pyzmq 26.2.0
ray 2.40.0
referencing 0.35.1
regex 2024.11.6
requests 2.32.3
rpds-py 0.22.3
safetensors 0.4.5
sentencepiece 0.2.0
setuptools 75.6.0
six 1.17.0
sniffio 1.3.1
stack-data 0.6.3
starlette 0.41.3
sympy 1.13.3
tiktoken 0.7.0
tokenizers 0.15.2
tomli 2.2.1
torch 2.4.0
torchvision 0.19.0
tqdm 4.67.1
traitlets 5.14.3
transformers 4.39.2
triton 3.0.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.1
uvloop 0.21.0
vllm 0.6.2
watchfiles 1.0.3
wcwidth 0.2.13
websockets 14.1
wheel 0.45.1
xformers 0.0.27.post2
xxhash 3.5.0
yarl 1.18.3
zipp 3.21.0
Log output
Description
背景
固定文本,只需要前向一次获取logits,但实际使用时会多并发,需要压低耗时和显存使用
之前用vllm时,遇到了prompt_logprobs设置了之后,batch=2就炸显存的问题,故暂时没法使用vllm进行部署推理,转战transformers
但发现该脚本使用时发现一个奇怪的情况:
1)batch=20推理时,第一次耗时只需要0.1s,而第二次开始耗时升高至2.5s左右
此时发现是_prepare_4d_causal_attention_mask_for_sdpa 产生了近2.2s耗时,推理decodelayer耗时很低
2)将上一个function注释掉,直接写attention_mask=None(本来就会返回None)进行测试
此时发现情况变为了:推理各decodelayer占了2s多耗时
故主要问题是想知道第一二次耗时相差如此大的原因
同时,这个需求的背景是想测试攒batch是否可以达到加速效果,如果其他使用者或者官方有确切的结论(比如这个情况下是gpuutils bound而不是显存,应该信第二次以后的2s而不是0.1s),我这边也可以直接接受这个结论跳过实验
复现 Steps to reproduce
推理脚本见
batchinfer.txt
同时在transformer库中的Qwen modeling代码中也加上了耗时统计,而且强制use_cache = False
The text was updated successfully, but these errors were encountered: