-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Installation&Bug]: First example. TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state' #2
Comments
I manually added [rank0]: Traceback (most recent call last): As a fix, I added |
With the above changes, I managed to run it. However this time I am getting a different error as, [rank0]: Traceback (most recent call last): I investigated it. bests |
Hi, thanks for the interest! At the moment we only support the FlashAttention attention backend which seems to be incompatible with your GPU architecture, as can be seen from these logs:
I'll add clearer error reporting for this case, and will update you when support for the xformers backend is added. |
Thank you for the quick response! Looking forward to seeing it for xformers backend |
Thank you for the great work and the pre-print! I have a question in running the code. I would appreciate if you could answer it.
As for installation, I followed the standard steps as in,
Then, I tried a simple longbench run as,
python run_longbench.py --dataset narrativeqa --model llama3--protected-window-size 8 --prefill-metric-collection window-size 8 --max-cache-tokens 512
However, I am getting
missing 1 required positional argument: 'block_state'
error. The full error track is the following,/workspace/vllm-kvcompress-main/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm.commit_id'
from vllm.version import version as VLLM_VERSION
WARNING 10-28 20:18:13 config.py:632] Model has sliding window configured, but it will be disabled due to incompatibility with KV-Compress.
WARNING 10-28 20:18:13 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-28 20:18:13 llm_engine.py:219] Initializing an LLM engine (v0.6.0) with config: model='daryl149/llama-2-7b-chat-hf', speculative_config=None, tokenizer='daryl149/llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=daryl149/llama-2-7b-chat-hf, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.Allocating context_lens - Mem: 0.0
Allocating block table - Mem: 0.001048576
Allocating head bias - Mem: 0.13526732800000002
INFO 10-28 20:18:14 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:14 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-28 20:18:15 model_runner.py:964] Starting to load model daryl149/llama-2-7b-chat-hf...
INFO 10-28 20:18:15 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:15 selector.py:116] Using XFormers backend.
INFO 10-28 20:18:15 weight_utils.py:236] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
/workspace/vllm-kvcompress-main/vllm/model_executor/model_loader/weight_utils.py:416: FutureWarning: You are using
torch.load
withweights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_only
will be flipped toTrue
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals
. We recommend you start settingweights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 50% Completed | 1/2 [00:06<00:06, 6.06s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 3.74s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 4.08s/it]
INFO 10-28 20:18:23 model_runner.py:975] Loading model weights took 12.5518 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 185, in
[rank0]: main(args)
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 63, in main
[rank0]: model = LLM(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 177, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 584, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 359, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 494, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks(kv_metrics))
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/executor/gpu_executor.py", line 122, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks(kv_metrics)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/worker.py", line 237, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1175, in profile_run
[rank0]: model_input = self.prepare_model_input(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1430, in prepare_model_input
[rank0]: model_input = self._prepare_model_input_tensors(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1091, in _prepare_model_input_tensors
[rank0]: return builder.build() # type: ignore
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 784, in build
[rank0]: attn_metadata = self.attn_metadata_builder.build(
[rank0]: TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state'
Could you help me with how to fix this?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: