AWQ support #464

anslin-raj · 2024-05-14T19:19:31Z

I have faced an error with the VLLM framework when I tried to inferencing an Unsloth fine-tuned LLAMA3-8b model...

Error:

(venv) ubuntu@ip-192-168-68-10:~/ans/vllm-server$ python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --dtype=half
INFO 05-14 09:46:09 api_server.py:151] vLLM API server version 0.4.1
INFO 05-14 09:46:09 api_server.py:152] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', tokenizer='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in
engine = AsyncLLMEngine.from_engine_args(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 341, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 464, in create_engine_config
model_config = ModelConfig(
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 115, in init
self._verify_quantization()
File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 160, in _verify_quantization
raise ValueError(
ValueError: Unknown quantization method: bitsandbytes. Must be one of ['aqlm', 'awq', 'fp8', 'gptq', 'squeezellm', 'marlin'].

Code:

model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/Meta-Llama-3-8B",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
callbacks=[RichProgressCallback],
args = TrainingArguments(
# num_train_epochs=1,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# max_steps = 2048,
max_steps = 5,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
# logging_dir=f"/home/ubuntu/ans/llama3_pipeline/fine_tuning/logs",
),
)

trainer_stats = trainer.train()
if True: model.save_pretrained_merged("/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit", tokenizer, save_method="merged_4bit_forced",)

VLLM cli:

python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit

Package Versions:

unsloth 2024.4
vllm 0.4.1
NVIDIA-SMI 550.67
Driver Version 550.67
CUDA Version 12.4
Python 3.10.12
torch 2.2.1

Hardware used:

Tesla T4 GPU
Memory 32 GB
8 core CPU

The text was updated successfully, but these errors were encountered:

Karry11 · 2024-05-15T14:55:11Z

#253 ,I think you can refer to this answer; it seems that vLLM currently only supports AWQ-4b or 8b

danielhanchen · 2024-05-15T19:12:56Z

You need to change merged_4bit_forced to merged_16bit

anslin-raj · 2024-05-18T16:28:21Z

Thanks for the response @Karry11 @danielhanchen,

I tried merged_16bit, and it required more VRAM, but I only have 16 GB VRAM, is there any other way to run the model in VLLM with 4-bit quantization method?

sparsh35 · 2024-05-24T03:42:58Z

Convert it to AWQ if want to use VLLM , other wise Unsloth inference for 4bit models

danielhanchen · 2024-05-24T10:27:39Z

Ye AWQ is nice :) We might be adding a AWQ option for exporting!

subhamiitk · 2024-05-24T18:56:53Z

What's the current best option if I have to use this 4bit finetuned model using vLLM inference- Is it to convert it to 16bit and then perform the inference?

danielhanchen · 2024-05-25T09:33:18Z

@subhamiitk Use model.save_pretrained_merged("location", tokenizer, save_method = "merged_16bit",) then use vLLM

anslin-raj · 2024-05-30T06:14:45Z

Thanks for the consideration @danielhanchen

wrisigo · 2024-06-25T15:10:08Z

vLLM's MultiLoRA deployment option + PEFT's recent feature release - training adapters on top of already AWQ quantized models opens up some really useful possibilities for inference. Mainly, budget GPU's could easily serve multiple adapters under one awq model - aka minimizing memory footprint thus pushing faster throughput.

Exporting an AWQ model is great, but I also see value in training adapters on already AWQ quantized models. Is there any desire to support this? Would be killer to have unsloth's performance boosts for this type of fine tuning.

danielhanchen · 2024-07-01T00:36:22Z

So sorry on the delay - just relocated to SF - exporting to AWQ is for now on the roadmap - directly finetuning AWQ could work as well, but will require changing fast_dequantize

anslin-raj · 2024-07-02T05:20:11Z

@danielhanchen no issues, thanks for the update... ✨

vladrad · 2024-07-03T21:59:43Z

Finetuning a AWQ image would be amazing. I see it has support for PEFT in transformers huggingface/transformers#28987 .
this would be amazing to have, it would mean everyone can just work with awq models. @danielhanchen

danielhanchen · 2024-07-04T05:43:55Z

I'll see what I can do!

vladrad · 2024-07-05T18:14:32Z

Thank you! Let me know if there is anything I can do to help test. I can write code as well though this stuff is not my specialty but id love to learn! Feel free to point me somewhere. Being able to fine a AWQ model on low end hardware and then not having to wait an hour to convert it is going to be huge!

danielhanchen · 2024-07-06T03:24:30Z

Oh ye converting it to AWQ takes a lot of time!!

StrangeTcy · 2024-08-05T02:17:11Z

Waiting for automagic support of awq models as well.
Anything I can do to help/speed things along?

danielhanchen added the feature request Feature request pending on roadmap label May 24, 2024

danielhanchen changed the title ~~Faced an issue with - vllm - inference - llama3 - 8b - 4bit~~ AWQ support May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ support #464

AWQ support #464

anslin-raj commented May 14, 2024

Karry11 commented May 15, 2024 •

edited

Loading

danielhanchen commented May 15, 2024

anslin-raj commented May 18, 2024

sparsh35 commented May 24, 2024

danielhanchen commented May 24, 2024

subhamiitk commented May 24, 2024

danielhanchen commented May 25, 2024

anslin-raj commented May 30, 2024

wrisigo commented Jun 25, 2024

danielhanchen commented Jul 1, 2024

anslin-raj commented Jul 2, 2024

vladrad commented Jul 3, 2024

danielhanchen commented Jul 4, 2024

vladrad commented Jul 5, 2024

danielhanchen commented Jul 6, 2024

StrangeTcy commented Aug 5, 2024

AWQ support #464

AWQ support #464

Comments

anslin-raj commented May 14, 2024

Error:

Code:

VLLM cli:

Package Versions:

Hardware used:

Karry11 commented May 15, 2024 • edited Loading

danielhanchen commented May 15, 2024

anslin-raj commented May 18, 2024

sparsh35 commented May 24, 2024

danielhanchen commented May 24, 2024

subhamiitk commented May 24, 2024

danielhanchen commented May 25, 2024

anslin-raj commented May 30, 2024

wrisigo commented Jun 25, 2024

danielhanchen commented Jul 1, 2024

anslin-raj commented Jul 2, 2024

vladrad commented Jul 3, 2024

danielhanchen commented Jul 4, 2024

vladrad commented Jul 5, 2024

danielhanchen commented Jul 6, 2024

StrangeTcy commented Aug 5, 2024

Karry11 commented May 15, 2024 •

edited

Loading