-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Closed
Labels
usageHow to use vllmHow to use vllm
Description
Your current environment
The output of `python collect_env.py` -
Cannot run the code
OS: Ubuntu 22.04.5 LTS
CUDA Driver Version: 535.230.02
CUDA Version: 12.2
GPU Configuration: NVIDIA L40S 48GB
I am trying to serve nomic-embed-v1 and v1.5 embedding models through vllm docker, using the following command -
docker run --name nomic_inf -d --rm --runtime nvidia --gpus device=0 -v /var/models:/var/models -p 8070:8000 --ipc=host vllm/vllm-openai:latest --model /var/models/nomic-embed-text-v1.5 --served-model-name nomic-embed-text --task=embed --trust-remote-code
But I am getting the following error -
INFO 06-26 06:53:05 [__init__.py:239] Automatically detected platform cuda.
INFO 06-26 06:53:07 [api_server.py:1034] vLLM API server version 0.8.4
INFO 06-26 06:53:07 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/var/models/nomic-embed-text-v1.5', task='embed', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['nomic-embed-text'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
INFO 06-26 06:53:07 [config.py:553] Found sentence-transformers tokenize configuration.
INFO 06-26 06:53:07 [config.py:2832] Downcasting torch.float32 to torch.float16.
INFO 06-26 06:53:13 [config.py:449] Found sentence-transformers modules configuration.
INFO 06-26 06:53:13 [config.py:469] Found pooling configuration.
WARNING 06-26 06:53:13 [arg_utils.py:1731] --task embed is not supported by the V1 Engine. Falling back to V0.
INFO 06-26 06:53:13 [api_server.py:246] Started engine process with PID 221
INFO 06-26 06:53:15 [__init__.py:239] Automatically detected platform cuda.
INFO 06-26 06:53:16 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/var/models/nomic-embed-text-v1.5', speculative_config=None, tokenizer='/var/models/nomic-embed-text-v1.5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=nomic-embed-text, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='MEAN', normalize=False, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 06-26 06:53:17 [cuda.py:292] Using Flash Attention backend.
INFO 06-26 06:53:17 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-26 06:53:17 [model_runner.py:1110] Starting to load model /var/models/nomic-embed-text-v1.5...
A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Process SpawnProcess-1:
ERROR 06-26 06:53:18 [engine.py:448] NomicBertModel has no vLLM implementation and the Transformers implementation is not compatible with vLLM. Try setting VLLM_USE_V1=0.
ERROR 06-26 06:53:18 [engine.py:448] Traceback (most recent call last):
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 06-26 06:53:18 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 06-26 06:53:18 [engine.py:448] return cls(
ERROR 06-26 06:53:18 [engine.py:448] ^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 06-26 06:53:18 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 282, in __init__
ERROR 06-26 06:53:18 [engine.py:448] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 06-26 06:53:18 [engine.py:448] self._init_executor()
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 06-26 06:53:18 [engine.py:448] self.collective_rpc("load_model")
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 06-26 06:53:18 [engine.py:448] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method
ERROR 06-26 06:53:18 [engine.py:448] return func(*args, **kwargs)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model
ERROR 06-26 06:53:18 [engine.py:448] self.model_runner.load_model()
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1113, in load_model
ERROR 06-26 06:53:18 [engine.py:448] self.model = get_model(vllm_config=self.vllm_config)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 06-26 06:53:18 [engine.py:448] return loader.load_model(vllm_config=vllm_config)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
ERROR 06-26 06:53:18 [engine.py:448] model = _initialize_model(vllm_config=vllm_config)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 123, in _initialize_model
ERROR 06-26 06:53:18 [engine.py:448] model_class, _ = get_model_architecture(model_config)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 104, in get_model_architecture
ERROR 06-26 06:53:18 [engine.py:448] architectures = resolve_transformers_arch(model_config, architectures)
ERROR 06-26 06:53:18 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-26 06:53:18 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 72, in resolve_transformers_arch
ERROR 06-26 06:53:18 [engine.py:448] raise ValueError(
ERROR 06-26 06:53:18 [engine.py:448] ValueError: NomicBertModel has no vLLM implementation and the Transformers implementation is not compatible with vLLM. Try setting VLLM_USE_V1=0.
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 282, in __init__
self.model_executor = executor_class(vllm_config=vllm_config, )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
self.collective_rpc("load_model")
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2378, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1113, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
return loader.load_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
model = _initialize_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 123, in _initialize_model
model_class, _ = get_model_architecture(model_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 104, in get_model_architecture
architectures = resolve_transformers_arch(model_config, architectures)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 72, in resolve_transformers_arch
raise ValueError(
ValueError: NomicBertModel has no vLLM implementation and the Transformers implementation is not compatible with vLLM. Try setting VLLM_USE_V1=0.
[rank0]:[W626 06:53:18.776894509 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
usageHow to use vllmHow to use vllm