-
Notifications
You must be signed in to change notification settings - Fork 532
[ModelRunner]Add profile execute duration observation #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
52da0d4
Add profile execute duration observation
depeng1994 b4c8344
add docs
depeng1994 5f1f78d
Add UT
depeng1994 cd4885d
Merge branch 'main' of github.com:vllm-project/vllm-ascend
depeng1994 b5d8972
fix lint
depeng1994 3d68df3
Merge branch 'main' into main
depeng1994 0d448c7
Fix issue and conflict
depeng1994 3df32b3
Merge branch 'main' into main
depeng1994 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,4 +12,5 @@ using_evalscope | |
| :caption: Performance | ||
| :maxdepth: 1 | ||
| performance_benchmark | ||
| profile_execute_duration | ||
| ::: | ||
34 changes: 34 additions & 0 deletions
34
docs/source/developer_guide/evaluation/profile_execute_duration.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| # Profile Execute Duration | ||
|
|
||
| The execution duration of each stage (including pre/post-processing, model forward, etc.) usually needs to be captured during a complete inference process. Typically, this is done by using `torch.npu.synchronize()` and obtaining CPU timestamps, which increases the performance overhead of host/device synchronization. | ||
|
|
||
| **To reduce the performance overhead, we add this feature, using the NPU event timestamp mechanism to observe the device execution time asynchronously.** | ||
|
|
||
| ## Usage | ||
| * Use the environment variable `VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` to enable this feature. | ||
| * Use the non-blocking API `ProfileExecuteDuration().capture_async` to set observation points asynchronously when you need to observe the execution duration. | ||
| * Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages. | ||
|
|
||
| ## Example Output | ||
|
|
||
| ``` | ||
| 5691:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.17ms [prepare input and forward]:9.57ms [forward]:4.14ms | ||
| 5695:(IntegratedWorker pid=1502285) Profile execute duration [Decode]: [post process]:14.29ms [prepare input and forward]:10.19ms [forward]:4.14ms | ||
| 5697:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.81ms [prepare input and forward]:10.29ms [forward]:3.99ms | ||
| 5701:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.10ms [prepare input and forward]:10.62ms [forward]:4.33ms | ||
| 5705:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.65ms [prepare input and forward]:9.58ms [forward]:4.20ms | ||
| 5709:(IntegratedWorker pid=1502343) Profile execute duration [Decode]: [post process]:14.43ms [prepare input and forward]:9.88ms [forward]:4.20ms | ||
| 5711:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.89ms [prepare input and forward]:10.49ms [forward]:4.19ms | ||
| 5715:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.14ms [prepare input and forward]:11.21ms [forward]:4.18ms | ||
| 5719:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.71ms [prepare input and forward]:10.15ms [forward]:4.42ms | ||
| 5723:(IntegratedWorker pid=1502401) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.31ms [forward]:4.25ms | ||
| 5725:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.12ms [prepare input and forward]:10.33ms [forward]:4.24ms | ||
| 5729:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.58ms [prepare input and forward]:10.85ms [forward]:4.32ms | ||
| 5733:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:14.32ms [prepare input and forward]:9.79ms [forward]:4.28ms | ||
| 5737:(IntegratedWorker pid=1502462) Profile execute duration [Decode]: [post process]:15.06ms [prepare input and forward]:9.89ms [forward]:4.32ms | ||
| 5739:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.62ms [prepare input and forward]:10.48ms [forward]:4.27ms | ||
| 5743:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.60ms [prepare input and forward]:10.71ms [forward]:4.61ms | ||
| 5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms | ||
| 5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms | ||
|
|
||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # | ||
| # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved. | ||
| # This file is a part of the vllm-ascend project. | ||
| # Adapted from vllm/tests/basic_correctness/test_basic_correctness.py | ||
| # Copyright 2023 The vLLM team. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
| import os | ||
| import time | ||
| from unittest.mock import patch | ||
|
|
||
| import torch | ||
| import vllm # noqa: F401 | ||
|
|
||
| from vllm_ascend.utils import ProfileExecuteDuration | ||
|
|
||
|
|
||
| @patch.dict(os.environ, {"VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE": "1"}) | ||
| def test_execue_duration_enabled_discrepancy(): | ||
| a = torch.randn(10000, 10000).npu() | ||
| b = torch.randn(10000, 10000).npu() | ||
|
|
||
| # warmup | ||
| torch.matmul(a, b) | ||
| torch.npu.synchronize() | ||
|
|
||
| cpu_start = time.perf_counter() | ||
| with ProfileExecuteDuration().capture_async("forward"): | ||
| torch.matmul(a, b) | ||
| torch.npu.synchronize() | ||
| cpu_duration = (time.perf_counter() - cpu_start) * 1000 | ||
| npu_durations = ProfileExecuteDuration().pop_captured_sync() | ||
| assert npu_durations and 'forward' in npu_durations | ||
| assert not ProfileExecuteDuration._observations | ||
|
|
||
| # Assert discrepancy between CPU and NPU duration is within 50% roughly | ||
| diff = abs(cpu_duration - npu_durations['forward']) / max( | ||
| cpu_duration, npu_durations['forward']) | ||
| assert diff <= 0.5, ( | ||
| f"CPU={cpu_duration:.2f}ms, NPU={npu_durations['forward']:.2f}ms") | ||
|
|
||
|
|
||
| def test_execue_duration_disabled(): | ||
| a = torch.randn(100, 100).npu() | ||
| b = torch.randn(100, 100).npu() | ||
|
|
||
| with ProfileExecuteDuration().capture_async("forward"): | ||
| torch.matmul(a, b) | ||
| torch.npu.synchronize() | ||
| npu_durations = ProfileExecuteDuration().pop_captured_sync() | ||
| assert not npu_durations |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc is good but we could provide a e2e guid to help devs understand. Such as:
We already add key stage of inference (including pre-processing, model forward, etc.), you can execute inference script: