Skip to content

Conversation

@MengqingCao
Copy link
Collaborator

@MengqingCao MengqingCao commented Apr 28, 2025

What this PR does / why we need it?

Fix None output issue caused by outputs patch

How was this patch tested?

test script:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

before this pr:

(atb) (base) cmq@cmq-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py 
INFO 04-28 13:47:37 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-28 13:47:37 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-28 13:47:37 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-28 13:47:37 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:47:37 __init__.py:44] plugin ascend loaded.
INFO 04-28 13:47:37 __init__.py:198] Platform plugin ascend is activated
INFO 04-28 13:47:37 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-28 13:47:37 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-28 13:47:37 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-28 13:47:37 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:47:37 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-28 13:47:37 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-28 13:47:37 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-28 13:47:37 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-28 13:47:37 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-28 13:47:37 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-28 13:47:37 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-28 13:47:53 config.py:549] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-28 13:47:53 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-28 13:47:55 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd0be3f340>
INFO 04-28 13:47:57 model_runner.py:822] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
INFO 04-28 13:47:58 weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 04-28 13:47:59 weight_utils.py:304] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.06it/s]

INFO 04-28 13:48:00 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-28 13:48:05 executor_base.py:111] # npu blocks: 34853, # CPU blocks: 2730
INFO 04-28 13:48:05 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 136.14x
INFO 04-28 13:48:05 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.71 seconds
Processed prompts:   0%|                                                                          | 0/4 [00:02<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
/home/cmq/miniconda3/envs/atb/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp8hswfhml'>
  _warnings.warn(warn_message, ResourceWarning)

after this pr:

(atb) (base) cmq@cmq-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py 
INFO 04-28 13:45:22 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-28 13:45:22 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-28 13:45:22 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-28 13:45:22 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:45:22 __init__.py:44] plugin ascend loaded.
INFO 04-28 13:45:22 __init__.py:198] Platform plugin ascend is activated
INFO 04-28 13:45:22 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-28 13:45:22 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-28 13:45:22 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-28 13:45:22 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:45:22 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-28 13:45:22 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-28 13:45:22 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-28 13:45:22 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-28 13:45:22 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-28 13:45:22 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-28 13:45:22 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-28 13:45:37 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'score', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 04-28 13:45:37 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-28 13:45:40 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcfb017280>
INFO 04-28 13:45:41 model_runner.py:822] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
INFO 04-28 13:45:42 weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 04-28 13:45:43 weight_utils.py:270] Time spent downloading weights for Qwen/Qwen2.5-0.5B-Instruct: 0.839531 seconds
INFO 04-28 13:45:45 weight_utils.py:304] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]

INFO 04-28 13:45:45 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-28 13:45:50 executor_base.py:111] # npu blocks: 34853, # CPU blocks: 2730
INFO 04-28 13:45:50 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 136.14x
INFO 04-28 13:45:51 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.41 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.38it/s, est. speed input: 7.60 toks/s, output: 138.11 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Alex and I am a 17 year old male. I have been diagnosed with a rare genetic disorder called X-linked recessive. I have been told that I will not be able to have children. I have been told that I will not be able to have children because of the gene that I have. I have been told that I will not be able to have children because of the gene that I have. I have been told that I will not be able to have children because of the gene'
Prompt: 'The president of the United States is', Generated text: ' a very important person. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country'
Prompt: 'The capital of France is', Generated text: ' Paris. It is the largest city in Europe and the second largest city in the world. It is located in the south of France, on the banks of the Seine River. It is situated on the Île de la Cité, which is a small island in the center of the city. The city is surrounded by the Seine River, which flows through the city. The city is also surrounded by the Pyrenees mountains, which are located to the north of the city. The city'
Prompt: 'The future of AI is', Generated text: ' in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of'
/home/cmq/miniconda3/envs/atb/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmplxgsfg2t'>
  _warnings.warn(warn_message, ResourceWarning)

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
@zouyida2052
Copy link
Contributor

Fine, I've tested it on Qwen2-VL and Qwen2.5-VL and it proves to be normal.

@MengqingCao
Copy link
Collaborator Author

Fine, I've tested it on Qwen2-VL and Qwen2.5-VL and it proves to be normal.

Thanks for this check! I think this pr is ready for merge @wangxiyuan @ganyi1996ppo

@wangxiyuan wangxiyuan merged commit f2e5501 into vllm-project:v0.7.3-dev Apr 28, 2025
11 checks passed
@wangxiyuan
Copy link
Collaborator

Thanks for the quick fix

@MengqingCao MengqingCao deleted the fix_output branch May 6, 2025 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants