Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Loading LoRA is super slow when using tensor parallel #6072

Closed
markovalexander opened this issue Jul 2, 2024 · 4 comments · Fixed by #6109 or #6802
Closed

[Bug]: Loading LoRA is super slow when using tensor parallel #6072

markovalexander opened this issue Jul 2, 2024 · 4 comments · Fixed by #6109 or #6802
Labels
bug Something isn't working

Comments

@markovalexander
Copy link

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.17-2136.321.4.el7uek.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB

Nvidia driver version: 525.125.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          256
On-line CPU(s) list:             0-254
Off-line CPU(s) list:            255
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 7J13 64-Core Processor
CPU family:                      25
Model:                           1
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
Stepping:                        1
Frequency boost:                 enabled
CPU max MHz:                     2550.0000
CPU min MHz:                     0.0000
BogoMIPS:                        4899.55
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local nt_good clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization:                  AMD-V
L1d cache:                       4 MiB (128 instances)
L1i cache:                       4 MiB (128 instances)
L2 cache:                        64 MiB (128 instances)
L3 cache:                        512 MiB (16 instances)
NUMA node(s):                    8
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-254
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	NIC9	NIC10	NIC11	NIC12	NIC13	NIC14	NIC15	NIC16	NIC17	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	NV12	 X 	NV12	NV12	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	NV12	NV12	 X 	NV12	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	NV12	NV12	NV12	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	80-95,208-223	5
NIC0	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC1	SYS	SYS	PXB	SYS	SYS	 X 	PIX	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC2	SYS	SYS	PXB	SYS	SYS	PIX	 X 	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC3	SYS	SYS	PXB	SYS	SYS	PXB	PXB	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC4	SYS	SYS	PXB	SYS	SYS	PXB	PXB	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC5	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC6	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC7	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC8	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS
NIC9	SYS	SYS	SYS	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	PXB	PXB	SYS	SYS	SYS	SYS	SYS
NIC10	SYS	SYS	SYS	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	PXB	PXB	SYS	SYS	SYS	SYS	SYS
NIC11	SYS	SYS	SYS	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	 X 	PIX	SYS	SYS	SYS	SYS	SYS
NIC12	SYS	SYS	SYS	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	 X 	SYS	SYS	SYS	SYS	SYS
NIC13	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYS	SYS
NIC14	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	PXB	PXB
NIC15	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	PXB	PXB
NIC16	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	 X 	PIX
NIC17	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17
 

🐛 Describe the bug

Build openai docker from https://github.com/vllm-project/vllm/releases/tag/v0.5.0.post1

Started vLLM for llama-2-70b model with LoRA support and tensor-parallel=4. First lora-request will take more than 1 minute. Problem is in this function: it is very fast on first process and super slow (>40second) on all other processes. Here is log output:

INFO 07-02 14:44:43 async_llm_engine.py:564] Received request cmpl-007f58184aa54c3fb54155c3ad325b95-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_
of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_p
enalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_l
ogprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 3087, 8970, 338, 263], lora_request: L
oRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lora_max_len=None).
INFO:     172.17.0.1:56466 - "POST /v1/completions HTTP/1.1" 200 OK
DEBUG 07-02 14:44:43 async_llm_engine.py:525] Waiting for new requests...
DEBUG 07-02 14:44:43 async_llm_engine.py:527] Got new requests!
WARNING 07-02 14:44:43 tokenizer.py:142] No tokenizer found in /adapters/2, using base model tokenizer instead. (Exception: /adapters/2 does not appear to have a file
 named config.json. Checkout 'https://huggingface.co//adapters/2/tree/None' for available files.)
INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lora_max_len=None)
(VllmWorkerProcess pid=14156) INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None)
(VllmWorkerProcess pid=14157) INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None)
(VllmWorkerProcess pid=14155) INFO 07-02 14:44:43 worker_manager.py:165] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None)
INFO 07-02 14:44:43 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:44:46 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lora_max_len=None) finished!
DEBUG 07-02 14:44:46 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
DEBUG 07-02 14:44:46 models.py:459] Activating LoRA. int id: 1, slot index: 0
INFO 07-02 14:44:53 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:45:03 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:45:13 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=14155) INFO 07-02 14:45:23 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None) finished!
(VllmWorkerProcess pid=14157) INFO 07-02 14:45:23 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None) finished!
(VllmWorkerProcess pid=14155) DEBUG 07-02 14:45:23 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
(VllmWorkerProcess pid=14156) INFO 07-02 14:45:23 worker_manager.py:188] Loading LoRA LoRARequest(lora_name='2', lora_int_id=1, lora_local_path='/adapters/2', long_lo
ra_max_len=None) finished!
(VllmWorkerProcess pid=14157) DEBUG 07-02 14:45:23 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
(VllmWorkerProcess pid=14156) DEBUG 07-02 14:45:23 models.py:749] Adding lora. Model id: 1, int id: 1, scaling factor: None
(VllmWorkerProcess pid=14155) DEBUG 07-02 14:45:23 models.py:459] Activating LoRA. int id: 1, slot index: 0
(VllmWorkerProcess pid=14157) DEBUG 07-02 14:45:23 models.py:459] Activating LoRA. int id: 1, slot index: 0
(VllmWorkerProcess pid=14156) DEBUG 07-02 14:45:23 models.py:459] Activating LoRA. int id: 1, slot index: 0
INFO 07-02 14:45:23 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, G
PU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-02 14:45:25 async_llm_engine.py:133] Finished request cmpl-007f58184aa54c3fb54155c3ad325b95-0.
DEBUG 07-02 14:45:25 async_llm_engine.py:525] Waiting for new requests...

As you can see, all 4 processes start loading lora at 14:44:43. But only first one finished at 14:44:46 and 3 other finish only at 14:45:23. What is the problem?

@markovalexander markovalexander added the bug Something isn't working label Jul 2, 2024
@DarkLight1337
Copy link
Member

You may want to refer to this troubleshooting guide.

@sampritipanda
Copy link

sampritipanda commented Jul 3, 2024

I think you maybe able to fix this by changing the following line to device=self.device:

device="cpu",

Be-aware though there is some instability. I was facing crashes in the custom all reduce CUDA kernels when trying to use LoRAs of rank=2, but it worked fine with a LoRA of rank = 8.

Another option is to disable memory pinning here:

https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#L219

Just set this line to be pin_memory = False

@markovalexander
Copy link
Author

Thanks for your suggestions @sampritipanda ! I tried them, but they all did not help in my case.

What did help though is adding device='cuda:0' in this line. I understand that this could probably cause slight memory overhead on 1st process, but not sure if this is actually the case since self.device is just cuda and not cuda:PROCESS_IDX meaning that later in code LoRA still moves to 0 device and then gets sharded across all processes.

I also set OMP_NUM_THREADS env to 15 in all processes (just random number), it also slightly improved model and lora loading speed.

@tjohnson31415
Copy link
Contributor

Hello. I have been investigating this behavior as well. I believe the root cause of the slow down is CPU contention and throttling, particularly in an env like Kubernetes with containers with CPU limits.

Setting OMP_NUM_THREADS to a low number (like 1 or 2) was the solution I found, it reduced the load time for the LoRA adapter I was testing with from 2 mins to 3.5s with TP=4.

My test env was in Kubernetes on a node with 80 total cores, cpu requests set to 8 and cpu limits set to 16. Without OMP_NUM_THREADS, I think each shard would spawn 80 threads and they'd fight for the CPU. Watching the load with top I noticed CPU time exceeding 16, which would result in throttling on top of context switching costs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants