-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Dlight][CPU] Add CPU Backend Support for GEMV Optimization #17663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR adds Dlight CPU support with optimized GEMV scheduling,
including pattern detection, loop tiling, vectorization, and parallel
execution. It improves maintainability by refining target checks,
reduction handling, and scheduling logic.
CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?
Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262
extra={'prompt_tokens': 27, 'completion_tokens': 235,
'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906,
'inter_token_latency_s': 1.9013750143957446}
Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254
extra={'prompt_tokens': 27, 'completion_tokens': 227,
'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387,
'inter_token_latency_s': 0.4580444070528635}
6bb3b44 to
34b4466
Compare
34b4466 to
e09b152
Compare
|
Also cc @HongHongHongL |
python/tvm/dlight/cpu/gemv.py
Outdated
| return buffer_store.value.b | ||
|
|
||
|
|
||
| def is_gemv(sch: tir.Schedule, block_info: BlockInfo) -> Optional[List[tir.Buffer]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse gpu's util functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saying that we can create a folder named something like "analysis" or "utils" under dlight folder, for different backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree this is a good suggestion, dlight.analysis sounds right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Hzfengsy, I've created a folder analysis to ensure CPU and GPU backends reuse shared logic for GEMV, could you recheck it, thanks.
python/tvm/dlight/cpu/gemv.py
Outdated
| return ret if 0 < len(ret) < len(block_stmt.reads) else None | ||
|
|
||
|
|
||
| def normalize( # pylint: disable=too-many-locals, use-a-generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can reuse this one as well
abc8ad4 to
33b406b
Compare
|
cc @Hzfengsy for another look |
python/tvm/dlight/cpu/gemv.py
Outdated
| if not isinstance(func, tir.PrimFunc) or not self.is_target_available(target): | ||
| return None | ||
| sch = tir.Schedule(func) | ||
| sch = tir.Schedule(func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated
python/tvm/dlight/cpu/utils.py
Outdated
| return loop.extent.value if isinstance(loop.extent, tir.IntImm) else loop.extent | ||
|
|
||
|
|
||
| def auto_vectorize(sch: tir.Schedule, loop: tir.schedule.LoopRV, max_vec: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to keep another CPU copy? as there is a same file at python/tvm/dlight/gpu/utils.py
|
Hi @Hzfengsy , I've removed the duplicated definition and file, could you take a look, thanks. |
|
Thanks @mengshyu! |
…7663) * [Dlight][CPU] Add CPU Backend Support for GEMV Optimization This PR adds Dlight CPU support with optimized GEMV scheduling, including pattern detection, loop tiling, vectorization, and parallel execution. It improves maintainability by refining target checks, reduction handling, and scheduling logic. CPU: AMD Ryzen 9 7950X 16-Core Processor MODEL: Qwen2-0.5B-q4f16_1-MLC Prompt: What is the meaning of life? Results: Baseline: prompt_tokens=27 completion_tokens=235 total_tokens=262 extra={'prompt_tokens': 27, 'completion_tokens': 235, 'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 0.9777329325367138, 'decode_tokens_per_s': 0.558195154052001, 'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906, 'inter_token_latency_s': 1.9013750143957446} Optimized: usage: prompt_tokens=27 completion_tokens=227 total_tokens=254 extra={'prompt_tokens': 27, 'completion_tokens': 227, 'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 1.0010420333327994, 'decode_tokens_per_s': 2.9349053824023454, 'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387, 'inter_token_latency_s': 0.4580444070528635} * lint * Add unit test * Refactor analysis and scheduling utilities * lint * Fix duplicated schedule creation and utils.py
This PR adds Dlight CPU support with optimized GEMV scheduling, including pattern detection, loop tiling, vectorization, and parallel execution. It improves maintainability by refining target checks, reduction handling, and scheduling logic.
CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?
Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262 extra={'prompt_tokens': 27, 'completion_tokens': 235, 'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906, 'inter_token_latency_s': 1.9013750143957446}
Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254 extra={'prompt_tokens': 27, 'completion_tokens': 227, 'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387, 'inter_token_latency_s': 0.4580444070528635}