Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working with VLLM #53

Open
kousun12 opened this issue May 9, 2024 · 2 comments
Open

working with VLLM #53

kousun12 opened this issue May 9, 2024 · 2 comments

Comments

@kousun12
Copy link

kousun12 commented May 9, 2024

I'm wondering if I can get an easier pipeline by loading the awq weights with vllm:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model_id = 'Efficient-Large-Model/VILA1.5-13b-AWQ'

llm = LLM(model=model_id, quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The first issue seems to be that the config.json is trying to use a model type called llava_llama, which transformers doesn't know about.

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 945, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 647, in __getitem__
    raise KeyError(key)
KeyError: 'llava_llama'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 520, in create_engine_config
    model_config = ModelConfig(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/config.py", line 121, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 38, in get_config
    raise e
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 947, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `llava_llama` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

if I change the type in config.json to just llava I get:

/home/ray/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING 05-09 09:38:26 config.py:205] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 05-09 09:38:26 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='Efficient-Large-Model/VILA1.5-13b-AWQ', speculative_config=None, tokenizer='Efficient-Large-Model/VILA1.5-13b-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Efficient-Large-Model/VILA1.5-13b-AWQ)
Traceback (most recent call last):
  File "//testvllm.py", line 13, in <module>
    llm = LLM(model=model_id, quantization="awq", dtype="half")
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 292, in from_engine_args
    engine = cls(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 150, in __init__
    self._init_tokenizer()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 328, in _init_tokenizer
    self.tokenizer = get_tokenizer_group(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/__init__.py", line 20, in get_tokenizer_group
    return TokenizerGroup(**init_kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/transformers_utils/tokenizer.py", line 92, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 880, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2073, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'Efficient-Large-Model/VILA1.5-13b-AWQ'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Efficient-Large-Model/VILA1.5-13b-AWQ' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Which seems to suggest that the LLama tokenizer isn't in the llm directory? Do we need a tokenizer.json in the repo? Even if I add that, it seems to have trouble loading the tokenizer.

@ys-2020
Copy link
Collaborator

ys-2020 commented May 11, 2024

Hi @kousun12 , thanks for your interest in VILA! For the first question, I am wondering what is the version of your transformers, and how do you install VILA? The model arch llava_llama should be already defined if you have installed VILA and the right version of transformers. For the second question, tokenizer.json is here, under the llm folder rather than the root of the model. You may need to modify the code for loading VILA to solve the problem. And please note that you need to make sure that the VILA model is served with newest TinyChat backends.

@Griffintaur
Copy link

@ys-2020 Can you point out the key differences between the modelling falcon used in VILA vs the modelling falcon in the transformer library?

gheinrich pushed a commit to gheinrich/VILA that referenced this issue Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants