[Misc]: Support for Shieldgemma model #7084

sudarshan-kamath · 2024-08-02T17:50:35Z

Trying to run the Shieldgemma model.

The architecture is Gemma2ForCausalLM which should be already supported. The config file specifies the transformers version to be 4.42.4.

I have the following installed:

pip list | grep "vllm\|flash"
flash-attn                        2.0.4
flashinfer                        0.1.3+cu124torch2.4
vllm                              0.5.3.post1
vllm-flash-attn                   2.5.9.post1

I have also the Transformers 4.43.3.

After checking the config file, it appears that the config specifies hidden_activation instead of hidden_act. After changing it manually in the config.json file, I get an error which specifies that I should use flashinfer backend.

VLLM_ATTENTION_BACKEND=FLASHINFER
After which, the following error is occurring:

INFO 08-02 17:46:35 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
INFO 08-02 17:46:36 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-02 17:46:36 selector.py:80] Using Flashinfer backend.
2024-08-02 17:46:37 | ERROR | stderr | Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-02 17:46:37 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.41s/it]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.31it/s]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.17it/s]
2024-08-02 17:46:38 | ERROR | stderr | 
2024-08-02 17:46:38 | ERROR | stderr | 
INFO 08-02 17:46:38 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/project/vllm_worker.py", line 236, in <module>
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     engine = cls(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.engine = self._init_engine(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return engine_class(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self._initialize_kv_caches()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.model_executor.determine_num_available_blocks())
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return self.driver_worker.determine_num_available_blocks()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.model_runner.profile_run()
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1272, in execute_model
2024-08-02 17:46:38 | ERROR | stderr | [rank0]:     BatchDecodeWithPagedKVCacheWrapper(
2024-08-02 17:46:38 | ERROR | stderr | [rank0]: TypeError: 'NoneType' object is not callable

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-08-03T01:52:50Z

The error means that you don't have FlashInfer installed. Please follow the steps shared here.

sudarshan-kamath · 2024-08-06T12:13:15Z

Thanks, there was an error with the flashinfer installed, so I tried the flashinfer installation using this

pip install flashinfer==0.1.3 -i https://flashinfer.ai/whl/cu121/torch2.3/

Now, I have a different error:

WARNING 08-06 12:10:50 utils.py:569] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 08-06 12:10:50 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', speculative_config=None, tokenizer='/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d, use_v2_block_manager=True, enable_prefix_caching=False)
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
INFO 08-06 12:10:51 model_runner.py:680] Starting to load model /modelcache/models--google--shieldgemma-2b/snapshots/091a5128690e57ca6a30f6fbec4a766d8b77e48d...
INFO 08-06 12:10:51 selector.py:80] Using Flashinfer backend.
2024-08-06 12:10:51 | ERROR | stderr | Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
2024-08-06 12:10:51 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.43s/it]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.32it/s]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.17it/s]
2024-08-06 12:10:53 | ERROR | stderr | 
2024-08-06 12:10:53 | ERROR | stderr | 
INFO 08-06 12:10:53 model_runner.py:692] Loading model weights took 4.9975 GB
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/project/vllm_worker.py", line 236, in <module>
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     engine = cls(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.engine = self._init_engine(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return engine_class(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 265, in __init__
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self._initialize_kv_caches()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 364, in _initialize_kv_caches
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.model_executor.determine_num_available_blocks())
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return self.driver_worker.determine_num_available_blocks()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.model_runner.profile_run()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 896, in profile_run
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     return func(*args, **kwargs)
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1292, in execute_model
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     model_input.attn_metadata.begin_forward()
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flashinfer.py", line 146, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self.prefill_wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:   File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 791, in begin_forward
2024-08-06 12:10:53 | ERROR | stderr | [rank0]:     self._wrapper.begin_forward(
2024-08-06 12:10:53 | ERROR | stderr | [rank0]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257

UPDATE: Looks similar to this error #7070

This works. Please use pip install flashinfer==0.1.2 -i https://flashinfer.ai/whl/cu121/torch2.3

Thanks @DarkLight1337

sudarshan-kamath · 2024-08-06T12:19:13Z

Thanks @DarkLight1337

JerryGamble1 · 2024-08-14T14:56:34Z

Is there more context on the change the OP made in regards to the "hidden_act" versus "hidden_activation" reference? I am seeing the following error as well:

AttributeError: 'Gemma2Config' object has no attribute 'hidden_act'

sudarshan-kamath · 2024-08-14T15:21:04Z

@JerryGamble1 When the weights are downloaded, please change "hidden_activation" to "hidden_act" in the file "config.json". Usually the weights are present in the huggingface cache directory.

https://huggingface.co/google/shieldgemma-2b/blob/main/config.json

If you use the command huggingface-cli download model_name, it should download the model and then output the location where the weights are stored.

JerryGamble1 · 2024-08-14T16:15:06Z

We've moved on from trying to get this to work on VLLM for now so no need to respond, but just FYI...

Modifying the config file I was able to load the model into VLLM, but every requests generates a bad request error with this log message...

INFO: 172.17.0.2:57780 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request ERROR 08-14 11:02:07 serving_chat.py:112] Error in applying chat template from request: 'guideline' is undefined

sudarshan-kamath added the misc label Aug 2, 2024

sudarshan-kamath changed the title ~~[Misc]: Config for Shieldgemma model :~~ [Misc]: Support or Shieldgemma model Aug 2, 2024

sudarshan-kamath changed the title ~~[Misc]: Support or Shieldgemma model~~ [Misc]: Support for Shieldgemma model Aug 2, 2024

sudarshan-kamath closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: Support for Shieldgemma model #7084

[Misc]: Support for Shieldgemma model #7084

sudarshan-kamath commented Aug 2, 2024 •

edited

Loading

DarkLight1337 commented Aug 3, 2024

sudarshan-kamath commented Aug 6, 2024 •

edited

Loading

sudarshan-kamath commented Aug 6, 2024

JerryGamble1 commented Aug 14, 2024

sudarshan-kamath commented Aug 14, 2024 •

edited

Loading

JerryGamble1 commented Aug 14, 2024

[Misc]: Support for Shieldgemma model #7084

[Misc]: Support for Shieldgemma model #7084

Comments

sudarshan-kamath commented Aug 2, 2024 • edited Loading

Trying to run the Shieldgemma model.

DarkLight1337 commented Aug 3, 2024

sudarshan-kamath commented Aug 6, 2024 • edited Loading

sudarshan-kamath commented Aug 6, 2024

JerryGamble1 commented Aug 14, 2024

sudarshan-kamath commented Aug 14, 2024 • edited Loading

JerryGamble1 commented Aug 14, 2024

sudarshan-kamath commented Aug 2, 2024 •

edited

Loading

sudarshan-kamath commented Aug 6, 2024 •

edited

Loading

sudarshan-kamath commented Aug 14, 2024 •

edited

Loading