[Bufix] Fix None output issue caused by outputs patch #712

MengqingCao · 2025-04-28T13:40:54Z

What this PR does / why we need it?

Fix None output issue caused by outputs patch

How was this patch tested?

test script:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

before this pr:

(atb) (base) cmq@cmq-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py 
INFO 04-28 13:47:37 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-28 13:47:37 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-28 13:47:37 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-28 13:47:37 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:47:37 __init__.py:44] plugin ascend loaded.
INFO 04-28 13:47:37 __init__.py:198] Platform plugin ascend is activated
INFO 04-28 13:47:37 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-28 13:47:37 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-28 13:47:37 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-28 13:47:37 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:47:37 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-28 13:47:37 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-28 13:47:37 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-28 13:47:37 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-28 13:47:37 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-28 13:47:37 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-28 13:47:37 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-28 13:47:53 config.py:549] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-28 13:47:53 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-28 13:47:55 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd0be3f340>
INFO 04-28 13:47:57 model_runner.py:822] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
INFO 04-28 13:47:58 weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 04-28 13:47:59 weight_utils.py:304] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.06it/s]

INFO 04-28 13:48:00 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-28 13:48:05 executor_base.py:111] # npu blocks: 34853, # CPU blocks: 2730
INFO 04-28 13:48:05 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 136.14x
INFO 04-28 13:48:05 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.71 seconds
Processed prompts:   0%|                                                                          | 0/4 [00:02<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
/home/cmq/miniconda3/envs/atb/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp8hswfhml'>
  _warnings.warn(warn_message, ResourceWarning)

after this pr:

(atb) (base) cmq@cmq-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py 
INFO 04-28 13:45:22 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-28 13:45:22 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-28 13:45:22 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-28 13:45:22 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:45:22 __init__.py:44] plugin ascend loaded.
INFO 04-28 13:45:22 __init__.py:198] Platform plugin ascend is activated
INFO 04-28 13:45:22 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-28 13:45:22 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-28 13:45:22 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-28 13:45:22 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 13:45:22 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-28 13:45:22 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-28 13:45:22 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-28 13:45:22 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-28 13:45:22 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-28 13:45:22 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-28 13:45:22 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-28 13:45:37 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'score', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 04-28 13:45:37 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-28 13:45:40 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcfb017280>
INFO 04-28 13:45:41 model_runner.py:822] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
INFO 04-28 13:45:42 weight_utils.py:254] Using model weights format ['*.safetensors']
INFO 04-28 13:45:43 weight_utils.py:270] Time spent downloading weights for Qwen/Qwen2.5-0.5B-Instruct: 0.839531 seconds
INFO 04-28 13:45:45 weight_utils.py:304] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.58it/s]

INFO 04-28 13:45:45 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-28 13:45:50 executor_base.py:111] # npu blocks: 34853, # CPU blocks: 2730
INFO 04-28 13:45:50 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 136.14x
INFO 04-28 13:45:51 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.41 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.38it/s, est. speed input: 7.60 toks/s, output: 138.11 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Alex and I am a 17 year old male. I have been diagnosed with a rare genetic disorder called X-linked recessive. I have been told that I will not be able to have children. I have been told that I will not be able to have children because of the gene that I have. I have been told that I will not be able to have children because of the gene that I have. I have been told that I will not be able to have children because of the gene'
Prompt: 'The president of the United States is', Generated text: ' a very important person. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country. He is the leader of the country'
Prompt: 'The capital of France is', Generated text: ' Paris. It is the largest city in Europe and the second largest city in the world. It is located in the south of France, on the banks of the Seine River. It is situated on the Île de la Cité, which is a small island in the center of the city. The city is surrounded by the Seine River, which flows through the city. The city is also surrounded by the Pyrenees mountains, which are located to the north of the city. The city'
Prompt: 'The future of AI is', Generated text: ' in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of the people. The future of AI is in the hands of'
/home/cmq/miniconda3/envs/atb/lib/python3.10/tempfile.py:869: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmplxgsfg2t'>
  _warnings.warn(warn_message, ResourceWarning)

Signed-off-by: MengqingCao <cmq0113@163.com>

zouyida2052 · 2025-04-28T14:21:24Z

Fine, I've tested it on Qwen2-VL and Qwen2.5-VL and it proves to be normal.

MengqingCao · 2025-04-28T14:43:32Z

Fine, I've tested it on Qwen2-VL and Qwen2.5-VL and it proves to be normal.

Thanks for this check! I think this pr is ready for merge @wangxiyuan @ganyi1996ppo

wangxiyuan · 2025-04-28T15:54:44Z

Thanks for the quick fix

[Bufix] Fix None output issue caused by outputs patch

fe06391

Signed-off-by: MengqingCao <cmq0113@163.com>

github-actions bot added the module:core label Apr 28, 2025

code format

44a3a43

Signed-off-by: MengqingCao <cmq0113@163.com>

wangxiyuan merged commit f2e5501 into vllm-project:v0.7.3-dev Apr 28, 2025
11 checks passed

MengqingCao deleted the fix_output branch May 6, 2025 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bufix] Fix None output issue caused by outputs patch #712

[Bufix] Fix None output issue caused by outputs patch #712

Uh oh!

MengqingCao commented Apr 28, 2025 •

edited

Loading

Uh oh!

zouyida2052 commented Apr 28, 2025

Uh oh!

MengqingCao commented Apr 28, 2025

Uh oh!

Uh oh!

wangxiyuan commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Bufix] Fix None output issue caused by outputs patch #712

[Bufix] Fix None output issue caused by outputs patch #712

Uh oh!

Conversation

MengqingCao commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

How was this patch tested?

Uh oh!

zouyida2052 commented Apr 28, 2025

Uh oh!

MengqingCao commented Apr 28, 2025

Uh oh!

Uh oh!

wangxiyuan commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MengqingCao commented Apr 28, 2025 •

edited

Loading