You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm wondering if I can get an easier pipeline by loading the awq weights with vllm:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model_id = 'Efficient-Large-Model/VILA1.5-13b-AWQ'
llm = LLM(model=model_id, quantization="awq", dtype="half")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
The first issue seems to be that the config.json is trying to use a model type called llava_llama, which transformers doesn't know about.
/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 945, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 647, in __getitem__
raise KeyError(key)
KeyError: 'llava_llama'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "//testvllm.py", line 13, in <module>
llm = LLM(model=model_id, quantization="awq", dtype="half")
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 520, in create_engine_config
model_config = ModelConfig(
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/config.py", line 121, in __init__
self.hf_config = get_config(self.model, trust_remote_code, revision,
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 38, in get_config
raise e
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 23, in get_config
config = AutoConfig.from_pretrained(
File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 947, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `llava_llama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
if I change the type in config.json to just llava I get:
/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
WARNING 05-09 09:38:26 config.py:205] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 05-09 09:38:26 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='Efficient-Large-Model/VILA1.5-13b-AWQ', speculative_config=None, tokenizer='Efficient-Large-Model/VILA1.5-13b-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Efficient-Large-Model/VILA1.5-13b-AWQ)
Traceback (most recent call last):
File "//testvllm.py", line 13, in <module>
llm = LLM(model=model_id, quantization="awq", dtype="half")
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 292, in from_engine_args
engine = cls(
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 150, in __init__
self._init_tokenizer()
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 328, in _init_tokenizer
self.tokenizer = get_tokenizer_group(
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
return TokenizerGroup(**init_kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer.py", line 92, in get_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 880, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2073, in from_pretrained
raise EnvironmentError(
OSError: Can't load tokenizer for 'Efficient-Large-Model/VILA1.5-13b-AWQ'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Efficient-Large-Model/VILA1.5-13b-AWQ' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
Which seems to suggest that the LLama tokenizer isn't in the llm directory? Do we need a tokenizer.json in the repo? Even if I add that, it seems to have trouble loading the tokenizer.
The text was updated successfully, but these errors were encountered:
Hi @kousun12 , thanks for your interest in VILA! For the first question, I am wondering what is the version of your transformers, and how do you install VILA? The model arch llava_llama should be already defined if you have installed VILA and the right version of transformers. For the second question, tokenizer.json is here, under the llm folder rather than the root of the model. You may need to modify the code for loading VILA to solve the problem. And please note that you need to make sure that the VILA model is served with newest TinyChat backends.
I'm wondering if I can get an easier pipeline by loading the awq weights with vllm:
The first issue seems to be that the
config.json
is trying to use a model type calledllava_llama
, whichtransformers
doesn't know about.if I change the type in
config.json
to justllava
I get:Which seems to suggest that the LLama tokenizer isn't in the llm directory? Do we need a
tokenizer.json
in the repo? Even if I add that, it seems to have trouble loading the tokenizer.The text was updated successfully, but these errors were encountered: