Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Installation&Bug]: First example. TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state' #2

Open
1 task done
alpemreacar opened this issue Oct 28, 2024 · 4 comments

Comments

@alpemreacar
Copy link

alpemreacar commented Oct 28, 2024

Thank you for the great work and the pre-print! I have a question in running the code. I would appreciate if you could answer it.

As for installation, I followed the standard steps as in,

docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:24.04-py3
pip install -e .

Then, I tried a simple longbench run as,

python run_longbench.py --dataset narrativeqa --model llama3--protected-window-size 8 --prefill-metric-collection window-size 8 --max-cache-tokens 512

However, I am getting missing 1 required positional argument: 'block_state' error. The full error track is the following,

/workspace/vllm-kvcompress-main/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm.commit_id'
from vllm.version import version as VLLM_VERSION
WARNING 10-28 20:18:13 config.py:632] Model has sliding window configured, but it will be disabled due to incompatibility with KV-Compress.
WARNING 10-28 20:18:13 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-28 20:18:13 llm_engine.py:219] Initializing an LLM engine (v0.6.0) with config: model='daryl149/llama-2-7b-chat-hf', speculative_config=None, tokenizer='daryl149/llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=daryl149/llama-2-7b-chat-hf, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Allocating context_lens - Mem: 0.0
Allocating block table - Mem: 0.001048576
Allocating head bias - Mem: 0.13526732800000002
INFO 10-28 20:18:14 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:14 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-28 20:18:15 model_runner.py:964] Starting to load model daryl149/llama-2-7b-chat-hf...
INFO 10-28 20:18:15 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:15 selector.py:116] Using XFormers backend.
INFO 10-28 20:18:15 weight_utils.py:236] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
/workspace/vllm-kvcompress-main/vllm/model_executor/model_loader/weight_utils.py:416: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 50% Completed | 1/2 [00:06<00:06, 6.06s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 3.74s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 4.08s/it]

INFO 10-28 20:18:23 model_runner.py:975] Loading model weights took 12.5518 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 185, in
[rank0]: main(args)
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 63, in main
[rank0]: model = LLM(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 177, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 584, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 359, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 494, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks(kv_metrics))
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/executor/gpu_executor.py", line 122, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks(kv_metrics)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/worker.py", line 237, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1175, in profile_run
[rank0]: model_input = self.prepare_model_input(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1430, in prepare_model_input
[rank0]: model_input = self._prepare_model_input_tensors(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1091, in _prepare_model_input_tensors
[rank0]: return builder.build() # type: ignore
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 784, in build
[rank0]: attn_metadata = self.attn_metadata_builder.build(
[rank0]: TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state'

Could you help me with how to fix this?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@alpemreacar alpemreacar changed the title [Installation]: [Installation&Bug]: First example. TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state' Oct 28, 2024
@alpemreacar
Copy link
Author

alpemreacar commented Oct 28, 2024

I manually added block_state=None in L785. Then I get another argument error as

[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/experiments/run_longbench.py", line 185, in
[rank0]: main(args)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/experiments/run_longbench.py", line 63, in main
[rank0]: model = LLM(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 177, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 584, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 359, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 494, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks(kv_metrics))
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/executor/gpu_executor.py", line 122, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks(kv_metrics)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/worker.py", line 237, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1175, in profile_run
[rank0]: model_input = self.prepare_model_input(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1430, in prepare_model_input
[rank0]: model_input = self._prepare_model_input_tensors(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1091, in _prepare_model_input_tensors
[rank0]: return builder.build() # type: ignore
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/model_runner.py", line 784, in build
[rank0]: attn_metadata = self.attn_metadata_builder.build(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/attention/backends/utils.py", line 264, in build
[rank0]: return self._metadata_cls( # type: ignore
[rank0]: TypeError: XFormersMetadata.init() missing 1 required positional argument: 'kv_cache_dtype'

As a fix, I added kv_cache_dtype='auto' to corresponding L278. I got a similar kv_cache_dtype error and I added the same argument in L226 as well.

@alpemreacar
Copy link
Author

With the above changes, I managed to run it. However this time I am getting a different error as,

[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/experiments/run_longbench.py", line 188, in
[rank0]: main(args)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/experiments/run_longbench.py", line 172, in main
[rank0]: output = model.generate(prompt_token_ids=[input_ids],
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/utils.py", line 1038, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 349, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 706, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 1627, in step
[rank0]: output = self.model_executor.execute_model(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/executor/gpu_executor.py", line 138, in execute_model
[rank0]: output = self.driver_worker.execute_model(execute_model_req)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/worker_base.py", line 313, in execute_model
[rank0]: inputs = self.prepare_input(execute_model_req)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/worker_base.py", line 301, in prepare_input
[rank0]: return self._get_driver_input_and_broadcast(execute_model_req)
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/worker_base.py", line 263, in _get_driver_input_and_broadcast
[rank0]: self.model_runner.prepare_model_input(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1430, in prepare_model_input
[rank0]: model_input = self._prepare_model_input_tensors(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1091, in _prepare_model_input_tensors
[rank0]: return builder.build() # type: ignore
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/worker/model_runner.py", line 784, in build
[rank0]: attn_metadata = self.attn_metadata_builder.build(
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/attention/backends/utils.py", line 207, in build
[rank0]: self._add_seq_group(inter_data,
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/attention/backends/utils.py", line 190, in _add_seq_group
[rank0]: compute_slot_mapping(is_profile_run, self.slot_mapping, seq_id,
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/attention/backends/utils.py", line 119, in compute_slot_mapping
[rank0]: _compute_slot_mapping_numpy(slot_mapping, block_table, range_start,
[rank0]: File "/workspace/mlrsh-filer/alpacar/projects/kv-caching/vllm-kvcompress-main/vllm/attention/backends/utils.py", line 78, in _compute_slot_mapping_numpy
[rank0]: seq_slot_mapping_array = block_table_array[idx]
[rank0]: IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

I investigated it. block_table is not a list, so that even if the 0th index causes the problem. Could you help me solve this problem?

bests

@IsaacRe
Copy link
Owner

IsaacRe commented Oct 28, 2024

Hi, thanks for the interest!

At the moment we only support the FlashAttention attention backend which seems to be incompatible with your GPU architecture, as can be seen from these logs:

INFO 10-28 20:18:14 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:14 selector.py:116] Using XFormers backend.

I'll add clearer error reporting for this case, and will update you when support for the xformers backend is added.

@alpemreacar
Copy link
Author

Thank you for the quick response! Looking forward to seeing it for xformers backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants