-
-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ValueError: XFormers does not support attention logits soft capping. #696
Comments
@daegonYu It looks like vllm error. |
@daegonYu |
Yes, this issue is related with vllm. When you try to serve gemma-2 model using vllm without using Flashinfer backend, it automatically uses xformers backend. Unfortunately, xformers backend does not support attention logits soft capping. One way that you can serve in xformers backend is removing all of the components related to attention logits soft capping. It may be minimal performance drop in gemma-2 9b, but will hugely impact on gemma-2 27b so be aware. CC @vkehfdl1 |
@effortprogrammer |
Well, if you look at vllm hyperparameter options, there's environment variable to use backend whether using xformers, flashinfer, etc. Please note that when trying to use Flashinfer, your gpu should support flash attention 2 and you need to ensure that flash attention 2 is installed in your environment. |
@effortprogrammer Thanks for valuable information! @daegonYu You can set vllm parameter at the YAML file easily. Just add parameter name and value at the YAML file at the |
It seems resolved using different |
Describe the bug
ValueError: XFormers does not support attention logits soft capping.
Full Error log
{
"name": "ValueError",
"message": "XFormers does not support attention logits soft capping.",
"stack": "---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 3
1 import nest_asyncio
2 nest_asyncio.apply()
----> 3 evaluator.start_trial(yaml_path)
File /home3/dgon/NLP/gits/AutoRAG/autorag/evaluator.py:126, in Evaluator.start_trial(self, yaml_path)
124 \t\tprevious_result = self.qa_data
125 \tlogger.info(f"Running node line {node_line_name}...")
--> 126 \tprevious_result = run_node_line(node_line, node_line_dir, previous_result)
128 \ttrial_summary_df = self._append_node_line_summary(
129 \t\tnode_line_name, node_line_dir, trial_summary_df
130 \t)
132 trial_summary_df.to_csv(
133 \tos.path.join(self.project_dir, trial_name, "summary.csv"), index=False
134 )
File /home3/dgon/NLP/gits/AutoRAG/autorag/node_line.py:47, in run_node_line(nodes, node_line_dir, previous_result)
45 summary_lst = []
46 for node in nodes:
---> 47 \tprevious_result = node.run(previous_result, node_line_dir)
48 \tnode_summary_df = load_summary_file(
49 \t\tos.path.join(node_line_dir, node.node_type, "summary.csv")
50 \t)
51 \tbest_node_row = node_summary_df.loc[node_summary_df["is_best"]]
File /home3/dgon/NLP/gits/AutoRAG/autorag/schema/node.py:57, in Node.run(self, previous_result, node_line_dir)
55 logger.info(f"Running node {self.node_type}...")
56 input_modules, input_params = self.get_param_combinations()
---> 57 return self.run_node(
58 \tmodules=input_modules,
59 \tmodule_params=input_params,
60 \tprevious_result=previous_result,
61 \tnode_line_dir=node_line_dir,
62 \tstrategies=self.strategy,
63 )
File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/run.py:46, in run_generator_node(modules, module_params, previous_result, node_line_dir, strategies)
43 \traise ValueError("You must have 'generation_gt' column in qa.parquet.")
44 generation_gt = list(map(lambda x: x.tolist(), qa_data["generation_gt"].tolist()))
---> 46 results, execution_times = zip(
47 \t*map(
48 \t\tlambda x: measure_speed(
49 \t\t\tx[0], project_dir=project_dir, previous_result=previous_result, **x[1]
50 \t\t),
51 \t\tzip(modules, module_params),
52 \t)
53 )
54 average_times = list(map(lambda x: x / len(results[0]), execution_times))
56 # get average token usage
File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/run.py:48, in run_generator_node..(x)
43 \traise ValueError("You must have 'generation_gt' column in qa.parquet.")
44 generation_gt = list(map(lambda x: x.tolist(), qa_data["generation_gt"].tolist()))
46 results, execution_times = zip(
47 \t*map(
---> 48 \t\tlambda x: measure_speed(
49 \t\t\tx[0], project_dir=project_dir, previous_result=previous_result, **x[1]
50 \t\t),
51 \t\tzip(modules, module_params),
52 \t)
53 )
54 average_times = list(map(lambda x: x / len(results[0]), execution_times))
56 # get average token usage
File /home3/dgon/NLP/gits/AutoRAG/autorag/strategy.py:14, in measure_speed(func, *args, **kwargs)
10 """
11 Method for measuring execution speed of the function.
12 """
13 start_time = time.time()
---> 14 result = func(*args, **kwargs)
15 end_time = time.time()
16 return result, end_time - start_time
File /home3/dgon/NLP/gits/AutoRAG/autorag/utils/util.py:67, in result_to_dataframe..decorator_result_to_dataframe..wrapper(*args, **kwargs)
65 @functools.wraps(func)
66 def wrapper(*args, **kwargs) -> pd.DataFrame:
---> 67 \tresults = func(*args, **kwargs)
68 \tif len(column_names) == 1:
69 \t\tdf_input = {column_names[0]: results}
File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/base.py:49, in generator_node..wrapper(project_dir, previous_result, llm, **kwargs)
47 \treturn result
48 else:
---> 49 \treturn func(prompts=prompts, llm=llm, **kwargs)
File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/vllm.py:38, in vllm(prompts, llm, **kwargs)
33 \traise ImportError(
34 \t\t"Please install vllm library. You can install it by running
pip install vllm
."35 \t)
37 input_kwargs = deepcopy(kwargs)
---> 38 vllm_model = make_vllm_instance(llm, input_kwargs)
40 if "logprobs" not in input_kwargs:
41 \tinput_kwargs["logprobs"] = 1
File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/vllm.py:74, in make_vllm_instance(llm, input_args)
72 \tif v is not None:
73 \t\tinput_kwargs[param] = v
---> 74 return LLM(model, **input_kwargs)
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/entrypoints/llm.py:177, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, **kwargs)
153 raise TypeError(
154 "There is no need to pass vision-related arguments anymore.")
155 engine_args = EngineArgs(
156 model=model,
157 tokenizer=tokenizer,
(...)
175 **kwargs,
176 )
--> 177 self.llm_engine = LLMEngine.from_engine_args(
178 engine_args, usage_context=UsageContext.LLM_CLASS)
179 self.request_counter = Counter()
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/engine/llm_engine.py:538, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
536 executor_class = cls._get_executor_cls(engine_config)
537 # Create the LLM engine.
--> 538 engine = cls(
539 **engine_config.to_dict(),
540 executor_class=executor_class,
541 log_stats=not engine_args.disable_log_stats,
542 usage_context=usage_context,
543 stat_loggers=stat_loggers,
544 )
546 return engine
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/engine/llm_engine.py:305, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, step_return_finished_only)
301 self.input_registry = input_registry
302 self.input_processor = input_registry.create_input_processor(
303 model_config)
--> 305 self.model_executor = executor_class(
306 model_config=model_config,
307 cache_config=cache_config,
308 parallel_config=parallel_config,
309 scheduler_config=scheduler_config,
310 device_config=device_config,
311 lora_config=lora_config,
312 speculative_config=speculative_config,
313 load_config=load_config,
314 prompt_adapter_config=prompt_adapter_config,
315 observability_config=self.observability_config,
316 )
318 if not self.model_config.embedding_mode:
319 self._initialize_kv_caches()
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/executor/executor_base.py:47, in ExecutorBase.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
45 self.prompt_adapter_config = prompt_adapter_config
46 self.observability_config = observability_config
---> 47 self._init_executor()
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:40, in GPUExecutor._init_executor(self)
38 self.driver_worker = self._create_worker()
39 self.driver_worker.init_device()
---> 40 self.driver_worker.load_model()
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/worker/worker.py:182, in Worker.load_model(self)
181 def load_model(self):
--> 182 self.model_runner.load_model()
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/worker/model_runner.py:917, in GPUModelRunnerBase.load_model(self)
915 logger.info("Starting to load model %s...", self.model_config.model)
916 with CudaMemoryProfiler() as m:
--> 917 self.model = get_model(model_config=self.model_config,
918 device_config=self.device_config,
919 load_config=self.load_config,
920 lora_config=self.lora_config,
921 parallel_config=self.parallel_config,
922 scheduler_config=self.scheduler_config,
923 cache_config=self.cache_config)
925 self.model_memory_usage = m.consumed_memory
926 logger.info("Loading model weights took %.4f GB",
927 self.model_memory_usage / float(2**30))
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, cache_config)
13 def get_model(*, model_config: ModelConfig, load_config: LoadConfig,
14 device_config: DeviceConfig, parallel_config: ParallelConfig,
15 scheduler_config: SchedulerConfig,
16 lora_config: Optional[LoRAConfig],
17 cache_config: CacheConfig) -> nn.Module:
18 loader = get_model_loader(load_config)
---> 19 return loader.load_model(model_config=model_config,
20 device_config=device_config,
21 lora_config=lora_config,
22 parallel_config=parallel_config,
23 scheduler_config=scheduler_config,
24 cache_config=cache_config)
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:341, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, parallel_config, scheduler_config, cache_config)
339 with set_default_torch_dtype(model_config.dtype):
340 with target_device:
--> 341 model = _initialize_model(model_config, self.load_config,
342 lora_config, cache_config,
343 scheduler_config)
344 model.load_weights(
345 self._get_weights_iterator(model_config.model,
346 model_config.revision,
(...)
349 "fall_back_to_pt_during_load",
350 True)), )
352 for _, module in model.named_modules():
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:170, in _initialize_model(model_config, load_config, lora_config, cache_config, scheduler_config)
167 """Initialize a model with the given configurations."""
168 model_class, _ = get_model_architecture(model_config)
--> 170 return build_model(
171 model_class,
172 model_config.hf_config,
173 cache_config=cache_config,
174 quant_config=_get_quantization_config(model_config, load_config),
175 lora_config=lora_config,
176 multimodal_config=model_config.multimodal_config,
177 scheduler_config=scheduler_config,
178 )
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:155, in build_model(model_class, hf_config, cache_config, quant_config, lora_config, multimodal_config, scheduler_config)
145 def build_model(model_class: Type[nn.Module], hf_config: PretrainedConfig,
146 cache_config: Optional[CacheConfig],
147 quant_config: Optional[QuantizationConfig], *,
148 lora_config: Optional[LoRAConfig],
149 multimodal_config: Optional[MultiModalConfig],
150 scheduler_config: Optional[SchedulerConfig]) -> nn.Module:
151 extra_kwargs = _get_model_initialization_kwargs(model_class, lora_config,
152 multimodal_config,
153 scheduler_config)
--> 155 return model_class(config=hf_config,
156 cache_config=cache_config,
157 quant_config=quant_config,
158 **extra_kwargs)
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:329, in Gemma2ForCausalLM.init(failed resolving arguments)
327 assert config.tie_word_embeddings
328 self.quant_config = quant_config
--> 329 self.model = Gemma2Model(config, cache_config, quant_config)
330 self.logits_processor = LogitsProcessor(
331 config.vocab_size, soft_cap=config.final_logit_softcapping)
332 self.sampler = Sampler()
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:255, in Gemma2Model.init(self, config, cache_config, quant_config)
249 self.config = config
251 self.embed_tokens = VocabParallelEmbedding(
252 config.vocab_size,
253 config.hidden_size,
254 )
--> 255 self.layers = nn.ModuleList([
256 Gemma2DecoderLayer(layer_idx, config, cache_config, quant_config)
257 for layer_idx in range(config.num_hidden_layers)
258 ])
259 self.norm = GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
261 # Normalize the embedding by sqrt(hidden_size)
262 # The normalizer's data type should be downcasted to the model's
263 # data type such as bfloat16, not float32.
264 # See huggingface/transformers#29402
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:256, in (.0)
249 self.config = config
251 self.embed_tokens = VocabParallelEmbedding(
252 config.vocab_size,
253 config.hidden_size,
254 )
255 self.layers = nn.ModuleList([
--> 256 Gemma2DecoderLayer(layer_idx, config, cache_config, quant_config)
257 for layer_idx in range(config.num_hidden_layers)
258 ])
259 self.norm = GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
261 # Normalize the embedding by sqrt(hidden_size)
262 # The normalizer's data type should be downcasted to the model's
263 # data type such as bfloat16, not float32.
264 # See huggingface/transformers#29402
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:181, in Gemma2DecoderLayer.init(self, layer_idx, config, cache_config, quant_config)
179 super().init()
180 self.hidden_size = config.hidden_size
--> 181 self.self_attn = Gemma2Attention(
182 layer_idx=layer_idx,
183 config=config,
184 hidden_size=self.hidden_size,
185 num_heads=config.num_attention_heads,
186 num_kv_heads=config.num_key_value_heads,
187 head_dim=config.head_dim,
188 max_position_embeddings=config.max_position_embeddings,
189 rope_theta=config.rope_theta,
190 cache_config=cache_config,
191 quant_config=quant_config,
192 attn_logits_soft_cap=config.attn_logit_softcapping,
193 )
194 self.hidden_size = config.hidden_size
195 self.mlp = Gemma2MLP(
196 hidden_size=self.hidden_size,
197 intermediate_size=config.intermediate_size,
(...)
200 quant_config=quant_config,
201 )
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:147, in Gemma2Attention.init(self, layer_idx, config, hidden_size, num_heads, num_kv_heads, head_dim, max_position_embeddings, rope_theta, cache_config, quant_config, attn_logits_soft_cap)
144 use_sliding_window = (layer_idx % 2 == 1
145 and config.sliding_window is not None)
146 del use_sliding_window # Unused.
--> 147 self.attn = Attention(self.num_heads,
148 self.head_dim,
149 self.scaling,
150 num_kv_heads=self.num_kv_heads,
151 cache_config=cache_config,
152 quant_config=quant_config,
153 logits_soft_cap=attn_logits_soft_cap)
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/attention/layer.py:84, in Attention.init(self, num_heads, head_size, scale, num_kv_heads, alibi_slopes, cache_config, quant_config, blocksparse_params, logits_soft_cap, prefix)
79 attn_backend = get_attn_backend(num_heads, head_size, num_kv_heads,
80 sliding_window, dtype, kv_cache_dtype,
81 block_size, blocksparse_params
82 is not None)
83 impl_cls = attn_backend.get_impl_cls()
---> 84 self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
85 alibi_slopes, sliding_window, kv_cache_dtype,
86 blocksparse_params, logits_soft_cap)
File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/attention/backends/xformers.py:422, in XFormersImpl.init(self, num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, blocksparse_params, logits_soft_cap)
419 raise ValueError(
420 "XFormers does not support block-sparse attention.")
421 if logits_soft_cap is not None:
--> 422 raise ValueError(
423 "XFormers does not support attention logits soft capping.")
424 self.num_heads = num_heads
425 self.head_size = head_size
ValueError: XFormers does not support attention logits soft capping."
}
Code that bug is happened
Desktop (please complete the following information):
AutoRAG 0.2.15
torch 2.4.0+cu118
The text was updated successfully, but these errors were encountered: