Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc]: Support for Shieldgemma model #7084

Closed
sudarshan-kamath opened this issue Aug 2, 2024 · 6 comments
Closed

[Misc]: Support for Shieldgemma model #7084

sudarshan-kamath opened this issue Aug 2, 2024 · 6 comments
Labels

Comments

@sudarshan-kamath
Copy link

sudarshan-kamath commented Aug 2, 2024

Trying to run the Shieldgemma model.

The architecture is Gemma2ForCausalLM which should be already supported. The config file specifies the transformers version to be 4.42.4.

I have the following installed:

pip list | grep "vllm\|flash"
flash-attn                        2.0.4
flashinfer                        0.1.3+cu124torch2.4
vllm                              0.5.3.post1
vllm-flash-attn                   2.5.9.post1

I have also the Transformers 4.43.3.

After checking the config file, it appears that the config specifies hidden_activation instead of hidden_act. After changing it manually in the config.json file, I get an error which specifies that I should use flashinfer backend.

VLLM_ATTENTION_BACKEND=FLASHINFER
After which, the following error is occurring:

INFO 08-02 17:46:35 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
INFO 08-02 17:46:36 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
2024-08-02 17:46:37 | ERROR | stderr | Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-02 17:46:37 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.41s/it]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.31it/s]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.17it/s]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | 
INFO 08-02 17:46:38 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/project/vllm_worker.py", line 236, in <module>
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     engine = cls(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.engine = self._init_engine(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return engine_class(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self._initialize_kv_caches()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.model_executor.determine_num_available_blocks())
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return self.driver_worker.determine_num_available_blocks()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.model_runner.profile_run()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1272, in execute_model
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     BatchDecodeWithPagedKVCacheWrapper(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: TypeError: 'NoneType' object is not callable
@sudarshan-kamath sudarshan-kamath changed the title [Misc]: Config for Shieldgemma model : [Misc]: Support or Shieldgemma model Aug 2, 2024
@sudarshan-kamath sudarshan-kamath changed the title [Misc]: Support or Shieldgemma model [Misc]: Support for Shieldgemma model Aug 2, 2024
@DarkLight1337
Copy link
Member

The error means that you don't have FlashInfer installed. Please follow the steps shared here.

@sudarshan-kamath
Copy link
Author

sudarshan-kamath commented Aug 6, 2024

Thanks, there was an error with the flashinfer installed, so I tried the flashinfer installation using this

pip install flashinfer==0.1.3 -i https://flashinfer.ai/whl/cu121/torch2.3/

Now, I have a different error:

WARNING 08-06 12:10:50 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 08-06 12:10:50 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=True, enable_prefix_caching=False)
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
INFO 08-06 12:10:51 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
2024-08-06 12:10:51 | ERROR | stderr | Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-06 12:10:51 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.43s/it]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.32it/s]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.17it/s]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | 
INFO 08-06 12:10:53 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/project/vllm_worker.py", line 236, in <module>
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     engine = cls(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.engine = self._init_engine(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return engine_class(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self._initialize_kv_caches()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.model_executor.determine_num_available_blocks())
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return self.driver_worker.determine_num_available_blocks()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.model_runner.profile_run()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1292, in execute_model
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     model_input.attn_metadata.begin_forward()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flashinfer.py", line 146, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.prefill_wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 791, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self._wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257

UPDATE: Looks similar to this error #7070

This works. Please use pip install flashinfer==0.1.2 -i https://flashinfer.ai/whl/cu121/torch2.3

Thanks @DarkLight1337

@sudarshan-kamath
Copy link
Author

Thanks @DarkLight1337

@JerryGamble1
Copy link

Is there more context on the change the OP made in regards to the "hidden_act" versus "hidden_activation" reference? I am seeing the following error as well:

AttributeError: 'Gemma2Config' object has no attribute 'hidden_act'

@sudarshan-kamath
Copy link
Author

sudarshan-kamath commented Aug 14, 2024

@JerryGamble1 When the weights are downloaded, please change "hidden_activation" to "hidden_act" in the file "config.json". Usually the weights are present in the huggingface cache directory.

https://huggingface.co/google/shieldgemma-2b/blob/main/config.json

If you use the command huggingface-cli download model_name, it should download the model and then output the location where the weights are stored.

@JerryGamble1
Copy link

We've moved on from trying to get this to work on VLLM for now so no need to respond, but just FYI...

Modifying the config file I was able to load the model into VLLM, but every requests generates a bad request error with this log message...

INFO: 172.17.0.2:57780 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request ERROR 08-14 11:02:07 serving_chat.py:112] Error in applying chat template from request: 'guideline' is undefined

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants