Skip to content

Conversation

@MengqingCao
Copy link
Collaborator

@MengqingCao MengqingCao commented Apr 28, 2025

Fix triton placeholder patch period

Test Script on v0.8.4

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="/home/xxx/cache/modelscope/models/OpenBMB/MiniCPM-2B-128k",
          trust_remote_code=True,
          max_model_len=1024)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Result

INFO 04-28 11:10:48 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-28 11:10:48 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-28 11:10:48 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-28 11:10:48 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 11:10:48 [__init__.py:44] plugin ascend loaded.
INFO 04-28 11:10:48 [__init__.py:230] Platform plugin ascend is activated
INFO 04-28 11:10:51 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-28 11:10:51 [__init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-28 11:10:51 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-28 11:10:51 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 11:10:51 [__init__.py:44] plugin ascend_enhanced_model loaded.
INFO 04-28 11:10:51 [patch_tritonplaceholder.py:33] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-28 11:10:51 [patch_tritonplaceholder.py:46] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernel compilation.
INFO 04-28 11:10:51 [patch_tritonplaceholder.py:71] Triton module has been replaced with a placeholder.
WARNING 04-28 11:10:51 [_custom_ops.py:21] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
WARNING 04-28 11:10:53 [registry.py:380] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 04-28 11:10:53 [registry.py:380] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-28 11:10:53 [registry.py:380] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-28 11:10:53 [registry.py:380] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
INFO 04-28 11:10:53 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 04-28 11:11:09 [config.py:689] This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'.
INFO 04-28 11:11:09 [arg_utils.py:1742] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
INFO 04-28 11:11:09 [config.py:1747] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 04-28 11:11:09 [platform.py:129] NPU compilation support pending. Will be available in future CANN and torch_npu releases. Using default: enforce_eager=True
INFO 04-28 11:11:09 [platform.py:134] Compilation disabled, using eager mode by default
INFO 04-28 11:11:09 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/home/cmq/cache/modelscope/models/OpenBMB/MiniCPM-2B-128k', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/OpenBMB/MiniCPM-2B-128k', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/cmq/cache/modelscope/models/OpenBMB/MiniCPM-2B-128k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 04-28 11:11:10 [config.py:209] Replacing legacy 'type' key with 'rope_type'
WARNING 04-28 11:11:10 [utils.py:2444] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcf75131f0>
INFO 04-28 11:11:11 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-28 11:11:11 [model_runner.py:950] Starting to load model /home/cmq/cache/modelscope/models/OpenBMB/MiniCPM-2B-128k...
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.16s/it]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:06<00:00,  6.16s/it]

INFO 04-28 11:11:18 [loader.py:458] Loading weights took 6.16 seconds
INFO 04-28 11:11:19 [model_runner.py:955] Loading model weights took 5.6661 GB
INFO 04-28 11:11:34 [executor_base.py:112] # npu blocks: 1066, # CPU blocks: 91
INFO 04-28 11:11:34 [executor_base.py:117] Maximum concurrency for 1024 tokens per request: 133.25x
INFO 04-28 11:11:34 [llm_engine.py:449] init engine (profile, create kv cache, warmup model) took 15.71 seconds
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.29s/it, est. speed input: 5.04 toks/s, output: 77.48 toks/s]
Prompt: 'Hello, my name is', Generated text: " John and I am a 20 year old student at the University of California, Los Angeles. I am a very passionate and dedicated individual who is always willing to help others. I have been a tutor for over 5 years and have tutored students in a variety of subjects including math, science, and English. I have also been a teacher's assistant for 2 years and have taught students in grades 6-12. I have a passion for teaching and helping others learn."
Prompt: 'The president of the United States is', Generated text: " the head of state and head of government of the United States. The president is the commander-in-chief of the military and is responsible for the overall leadership of the country. The president is elected by the people and serves a four-year term. The president is also the head of the executive branch of the government and is responsible for making important decisions and implementing policies.\n\n- The president's role in the government\n  - The president is the head of the executive branch of the"
Prompt: 'The capital of France is', Generated text: ' Paris.\nParis is the capital of France.\nParis is the capital of France.\nThe capital of France is Paris. The capital of France is Paris.\nThe capital of France is Paris. The capital of France is Paris. The capital of France is Paris.\nThe capital of France is Paris. The capital of France is Paris. The capital of France is Paris. The capital of France is Paris. The capital of France is Paris. The capital of France is Paris. The capital'
Prompt: 'The future of AI is', Generated text: " bright, but it's not without its challenges.\nArtificial intelligence (AI) is a rapidly evolving field that has the potential to transform the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is already making a significant impact on our daily lives. However, as AI continues to advance, there are also concerns about its potential to disrupt the job market and create new challenges for society.\nOne of the biggest challenges of AI is its potential"

Signed-off-by: MengqingCao <cmq0113@163.com>
@wangxiyuan wangxiyuan merged commit be9e3e8 into vllm-project:main Apr 28, 2025
13 of 16 checks passed
Yikun added a commit that referenced this pull request May 5, 2025
### What this PR does / why we need it?
Re-patch TritonPlaceholder on main to make CI happy
- Add triton patch back until
vllm-project/vllm#17446 resolved
- Move patch_main before patch_common to resolve minicpm triton import
issue
- Add `0.8.5` and `0.8.5.post1` to make patch work on 0.8.5 all versions

Related:
- #704
- #690

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
All CI passed include main

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
@MengqingCao MengqingCao deleted the tritonpatch branch May 6, 2025 02:26
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
Fix triton placeholder patch period

Signed-off-by: MengqingCao <cmq0113@163.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
### What this PR does / why we need it?
Re-patch TritonPlaceholder on main to make CI happy
- Add triton patch back until
vllm-project/vllm#17446 resolved
- Move patch_main before patch_common to resolve minicpm triton import
issue
- Add `0.8.5` and `0.8.5.post1` to make patch work on 0.8.5 all versions

Related:
- vllm-project#704
- vllm-project#690

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
All CI passed include main

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
Fix triton placeholder patch period

Signed-off-by: MengqingCao <cmq0113@163.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Re-patch TritonPlaceholder on main to make CI happy
- Add triton patch back until
vllm-project/vllm#17446 resolved
- Move patch_main before patch_common to resolve minicpm triton import
issue
- Add `0.8.5` and `0.8.5.post1` to make patch work on 0.8.5 all versions

Related:
- vllm-project#704
- vllm-project#690

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
All CI passed include main

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants