[Bug]: KeyError during loading of Mixtral 8x22B in FP8 #773

IowaSovereign · 2024-10-12T19:57:50Z

The output of `python collect_env.py`

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.12.6 (main, Sep 10 2024, 00:05:17) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
GPU 3: NVIDIA L40S

Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      43 bits physical, 48 bits virtual
CPU(s):                             256
On-line CPU(s) list:                0-255
Thread(s) per core:                 2
Core(s) per socket:                 64
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              49
Model name:                         AMD EPYC 7702 64-Core Processor
Stepping:                           0
Frequency boost:                    enabled
CPU MHz:                            1494.317
CPU max MHz:                        2183.5930
CPU min MHz:                        1500.0000
BogoMIPS:                           3992.65
Virtualization:                     AMD-V
L1d cache:                          4 MiB
L1i cache:                          4 MiB
L2 cache:                           64 MiB
L3 cache:                           512 MiB
NUMA node0 CPU(s):                  0-63,128-191
NUMA node1 CPU(s):                  64-127,192-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es

Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.0.dev0
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.1.post2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     0-63,128-191    0               N/A
GPU1    SYS      X      SYS     SYS     0-63,128-191    0               N/A
GPU2    SYS     SYS      X      SYS     0-63,128-191    0               N/A
GPU3    SYS     SYS     SYS      X      0-63,128-191    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

After quantizing a Mixtral 8x22B model to FP8 Dynamic the following error appears at the start of shard loading.

Tracelog:

2024-10-04T08:51:22.547524280Z /usr/local/lib/python3.10/dist-packages/aphrodite/engine/aphrodite_engine.py:49: RuntimeWarning: Failed to read commit hash:
2024-10-04T08:51:22.547631197Z No module named 'aphrodite.commit_id'
2024-10-04T08:51:22.547638904Z   from aphrodite.version import __version__ as APHRODITE_VERSION
2024-10-04T08:51:31.621577132Z INFO:     Multiprocessing frontend to use 
2024-10-04T08:51:31.621643270Z ipc:///tmp/64b9d1e4-fab7-4210-9851-d4197a91a14d for RPC Path.
2024-10-04T08:51:31.628272413Z INFO:     Started engine process with PID 99
2024-10-04T08:51:37.855640272Z /usr/local/lib/python3.10/dist-packages/aphrodite/engine/aphrodite_engine.py:49: RuntimeWarning: Failed to read commit hash:
2024-10-04T08:51:37.855688823Z No module named 'aphrodite.commit_id'
2024-10-04T08:51:37.855694802Z   from aphrodite.version import __version__ as APHRODITE_VERSION
2024-10-04T08:51:46.360598845Z INFO:     Using fp8 data type to store kv cache. It reduces the GPU memory 
2024-10-04T08:51:46.360651537Z footprint and boosts the performance. Meanwhile, it may cause accuracy drop 
2024-10-04T08:51:46.360657443Z without a proper scaling factor
2024-10-04T08:51:46.403606888Z INFO:     Defaulting to use mp for distributed inference.
2024-10-04T08:51:46.411349626Z INFO:     
2024-10-04T08:51:46.411393306Z --------------------------------------------------------------------------------
2024-10-04T08:51:46.411399173Z -----
2024-10-04T08:51:46.413415817Z INFO:     Initializing Aphrodite Engine (v0.6.2) with the following config:
2024-10-04T08:51:46.414465667Z INFO:     Model = 'CalamitousFelicitousness/SorcererLM-8x22b-FP8-Dynamic'
2024-10-04T08:51:46.415676716Z INFO:     DataType = torch.bfloat16
2024-10-04T08:51:46.416621190Z INFO:     Tensor Parallel Size = 2
2024-10-04T08:51:46.417541676Z INFO:     Pipeline Parallel Size = 1
2024-10-04T08:51:46.418555584Z INFO:     Disable Custom All-Reduce = False
2024-10-04T08:51:46.419487572Z INFO:     Quantization Format = 'fp8'
2024-10-04T08:51:46.420087958Z INFO:     Context Length = 16384
2024-10-04T08:51:46.421058234Z INFO:     Enforce Eager Mode = True
2024-10-04T08:51:46.421972215Z INFO:     Prefix Caching = False
2024-10-04T08:51:46.423121723Z INFO:     KV Cache DataType = 'fp8'
2024-10-04T08:51:46.424269583Z INFO:     Device = device(type='cuda')
2024-10-04T08:51:46.425717513Z INFO:     Guided Decoding Backend = 
2024-10-04T08:51:46.425761460Z DecodingConfig(guided_decoding_backend='outlines')
2024-10-04T08:51:46.426842144Z INFO:     
2024-10-04T08:51:46.426875731Z --------------------------------------------------------------------------------
2024-10-04T08:51:46.426880877Z -----
2024-10-04T08:51:49.424241541Z WARNING:  Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary 
2024-10-04T08:51:49.424288280Z CPU contention. Set OMP_NUM_THREADS in the external environment to tune this 
2024-10-04T08:51:49.424294724Z value as needed.
2024-10-04T08:51:49.750953778Z INFO:     Cannot use FlashAttention-2 backend for FP8 KV cache.
2024-10-04T08:51:49.752124530Z INFO:     Using XFormers backend.
2024-10-04T08:51:49.807084102Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m INFO:     Cannot use FlashAttention-2 backend for FP8 KV cache.
2024-10-04T08:51:49.808211285Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m INFO:     Using XFormers backend.
2024-10-04T08:51:51.274009994Z /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-04T08:51:51.274064538Z   @torch.library.impl_abstract("xformers_flash::flash_fwd")
2024-10-04T08:51:51.302890937Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-04T08:51:51.302940787Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m   @torch.library.impl_abstract("xformers_flash::flash_fwd")
2024-10-04T08:51:51.775084365Z /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-04T08:51:51.775163000Z   @torch.library.impl_abstract("xformers_flash::flash_bwd")
2024-10-04T08:51:51.801363085Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
2024-10-04T08:51:51.801409745Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m   @torch.library.impl_abstract("xformers_flash::flash_bwd")
2024-10-04T08:51:54.392022592Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m INFO:     Worker ready; awaiting tasks
2024-10-04T08:51:56.629474136Z INFO:     generating GPU P2P access cache in 
2024-10-04T08:51:56.629524099Z /root/.config/aphrodite/gpu_p2p_access_cache_for_0,1.json
2024-10-04T08:52:32.498716165Z INFO:     Loading model CalamitousFelicitousness/SorcererLM-8x22b-FP8-Dynamic...
2024-10-04T08:52:32.526941726Z WARNING:  Detected fp8 checkpoint. Please note that the format is experimental 
2024-10-04T08:52:32.526990026Z and subject to change.
2024-10-04T08:52:32.534555663Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m WARNING:  Detected fp8 checkpoint. Please note that the format is experimental 
2024-10-04T08:52:32.534602502Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m and subject to change.
2024-10-04T08:52:32.721690938Z INFO:     Cannot use FlashAttention-2 backend for FP8 KV cache.
2024-10-04T08:52:32.722631621Z INFO:     Using XFormers backend.
2024-10-04T08:52:32.744611966Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m INFO:     Cannot use FlashAttention-2 backend for FP8 KV cache.
2024-10-04T08:52:32.745727579Z �[1;36m(AphroditeWorkerProcess pid=229)�[0;0m INFO:     Using XFormers backend.
2024-10-04T08:52:33.303019692Z INFO:     Using model weights format ['*.safetensors']
2024-10-04T08:56:37.029182935Z ⠴ Loading modules... ╸                                      50/3363   1% 0:00:01
2024-10-04T08:56:38.103234101Z Process SpawnProcess-1:
2024-10-04T08:56:38.106123484Z ERROR:    Worker AphroditeWorkerProcess pid 229 died, exit code: -15
2024-10-04T08:56:38.107375306Z INFO:     Killing local Aphrodite worker processes
2024-10-04T08:56:38.111608586Z Traceback (most recent call last):
2024-10-04T08:56:38.111639606Z   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
2024-10-04T08:56:38.111646237Z     self.run()
2024-10-04T08:56:38.111652418Z   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
2024-10-04T08:56:38.111658597Z     self._target(*self._args, **self._kwargs)
2024-10-04T08:56:38.111665385Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/endpoints/openai/rpc/server.py", line 209, in run_rpc_server
2024-10-04T08:56:38.111671956Z     server = AsyncEngineRPCServer(async_engine_args, rpc_path)
2024-10-04T08:56:38.111678183Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/endpoints/openai/rpc/server.py", line 24, in __init__
2024-10-04T08:56:38.111684812Z     self.engine = AsyncAphrodite.from_engine_args(async_engine_args)
2024-10-04T08:56:38.111690556Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 601, in from_engine_args
2024-10-04T08:56:38.111696070Z     engine = cls(
2024-10-04T08:56:38.111701996Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 510, in __init__
2024-10-04T08:56:38.111707616Z     self.engine = self._init_engine(*args, **kwargs)
2024-10-04T08:56:38.111713304Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 694, in _init_engine
2024-10-04T08:56:38.111718802Z     return engine_class(*args, **kwargs)
2024-10-04T08:56:38.111725604Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/aphrodite_engine.py", line 261, in __init__
2024-10-04T08:56:38.111733124Z     self.model_executor = executor_class(
2024-10-04T08:56:38.111738876Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/executor/multiproc_gpu_executor.py", line 212, in __init__
2024-10-04T08:56:38.111770455Z     super().__init__(*args, **kwargs)
2024-10-04T08:56:38.111819906Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/executor/distributed_gpu_executor.py", line 24, in __init__
2024-10-04T08:56:38.111855413Z     super().__init__(*args, **kwargs)
2024-10-04T08:56:38.111861081Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/executor/executor_base.py", line 45, in __init__
2024-10-04T08:56:38.111871601Z     self._init_executor()
2024-10-04T08:56:38.111876821Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/executor/multiproc_gpu_executor.py", line 137, in _init_executor
2024-10-04T08:56:38.111881228Z     self._run_workers("load_model",
2024-10-04T08:56:38.111885453Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/executor/multiproc_gpu_executor.py", line 189, in _run_workers
2024-10-04T08:56:38.111890357Z     driver_worker_output = driver_worker_method(*args, **kwargs)
2024-10-04T08:56:38.111895012Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/worker.py", line 153, in load_model
2024-10-04T08:56:38.111900252Z     self.model_runner.load_model()
2024-10-04T08:56:38.111903740Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/model_runner.py", line 888, in load_model
2024-10-04T08:56:38.111907604Z     self.model = get_model(model_config=self.model_config,
2024-10-04T08:56:38.111911552Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/model_loader/__init__.py", line 20, in get_model
2024-10-04T08:56:38.111917276Z     return loader.load_model(model_config=model_config,
2024-10-04T08:56:38.111921178Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/model_loader/loader.py", line 340, in load_model
2024-10-04T08:56:38.111926203Z     model.load_weights(
2024-10-04T08:56:38.111931460Z   File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/mixtral.py", line 474, in load_weights
2024-10-04T08:56:38.111936419Z     param = params_dict[name]
2024-10-04T08:56:38.111940199Z KeyError: 'model.layers.0.block_sparse_moe.gate.weight_scale'
2024-10-04T08:56:39.576873513Z [rank0]:[W1004 08:56:39.463388442 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

The text was updated successfully, but these errors were encountered:

IowaSovereign added the bug Something isn't working label Oct 12, 2024

IowaSovereign changed the title ~~[Bug]: Mixtral FP8 quantization KeyError~~ [Bug]: KeyError during loading of Mixtral 8x22B in FP8 Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KeyError during loading of Mixtral 8x22B in FP8 #773

[Bug]: KeyError during loading of Mixtral 8x22B in FP8 #773

IowaSovereign commented Oct 12, 2024 •

edited

Loading

[Bug]: KeyError during loading of Mixtral 8x22B in FP8 #773

[Bug]: KeyError during loading of Mixtral 8x22B in FP8 #773

Comments

IowaSovereign commented Oct 12, 2024 • edited Loading

🐛 Describe the bug

IowaSovereign commented Oct 12, 2024 •

edited

Loading