-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: Throughput and Latency degradation with a single LoRA adapter on A100 40 GB #10062
Comments
This is a very detailed and excellent description. |
The latest version; see the container image in the yaml: Do you recommend a specific version to test with? |
There are some similar issues, see: #9496 and #9452. The main reason is due to enable_eager=true, but I can't find this argument in your script. BTW, if I remember correctly, some versions had cuda graph bug, so I suggest you try doing profiling and confirm whether eager mode is disabled correctly, you can refer to :https://docs.vllm.ai/en/latest/dev/profiling/profiling_index.html#openai-server. PS:I will try to reproduce your results tomorrow in my local timezone and #5036 provibe some test results |
Thanks @jeejeelee it will be great if you can reproduce! |
We conducted testing on a local A800 (A800-SXM4-80GB).
vllm serve meta-llama/Llama-2-7b-chat-hf --gpu-memory-utilization 0.90 --served-model-name base --enable-lora --max-loras 3 --max-cpu-loras 15 --max-lora-rank 64 --lora-modules moss=xtuner/Llama-2-7b-qlora-moss-003-sft
python benchmark_serving.py --model base --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset-name random --random-input-len 512 --random-output-len 128 --ignore-eos --num-prompts 24 --metric-percentiles 90 --request-rate 20(from 1to 24)
|
Thanks @jeejeelee, this is very insightful, did you try the smaller adapter? |
Not yet, I will tomorrow |
same question, any solution yet? |
I will also help reproduce this issue from my end this week. What I observe is around 20%-25% overhead which is expected. Seem we need to standarize the lora workloads and benchmark to better help users reproduce the results |
throughput/latency vs kv cache utilizationI did some new benchmarks and noticed that max lora rank has a significant impact on performance, and its best to set it = the rank of lora (or the rank of the largest ranked lora if using multiple lora). This is consistent with what is documented here. With rank = 16, the throughput hit is about 27% at 80% kv cache utilization. (tp-2 indicates tensor parallelism = 2 i.e 2 GPUs were used) I also enabled the vLLM profiler to get a more granular understanding of where the performance hit is coming from. Performance Analysis:vLLM's profiler provides slice flamegraphs. Tweet summary (max rank 64) running online with 96 prompts revealed cudaMemcpyAsync as a major latency contributor 47% of the total 35 seconds. The base model's slice flowgraph showed cudaMemcpyAsync using 40% of the 27.96 seconds (96 prompts). The 8-second difference between base and LoRA models (same number of prompts) was largely due to |
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
Setup Summary for vLLM Benchmarking with Llama-2 Model:
Hardware: A100 40 GB (a2-highgpu-2g) on Google Kubernetes Engine (GKE)
Model:
meta-llama/Llama-2-7b-hf
GPU Count: 1
Experiments:
meta-llama/Llama-2-7b-hf
.vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
(size 160 MB).xtuner/Llama-2-7b-qlora-moss-003-sft
(size 640 MB).For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.
Settings:
Benchmark Metrics:
We measured:
You can view detailed results in the benchmark document: Benchmark 1 server - Sheet7.pdf.
Observations and Questions:
Deployment Command:
Your current environment (if you think it is necessary)
Sample Query:
Deployment YAML Configuration:
This deployment configuration sets up the vLLM server with LoRA adapters on GKE, with health probes, GPU limits, and a volume configuration for adapter management.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: