Skip to content

Conversation

@mengshyu
Copy link
Contributor

This PR adds Dlight CPU support with optimized GEMV scheduling, including pattern detection, loop tiling, vectorization, and parallel execution. It improves maintainability by refining target checks, reduction handling, and scheduling logic.

CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?

Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262 extra={'prompt_tokens': 27, 'completion_tokens': 235, 'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906, 'inter_token_latency_s': 1.9013750143957446}

Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254 extra={'prompt_tokens': 27, 'completion_tokens': 227, 'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387, 'inter_token_latency_s': 0.4580444070528635}

This PR adds Dlight CPU support with optimized GEMV scheduling,
including pattern detection, loop tiling, vectorization, and parallel
execution. It improves maintainability by refining target checks,
reduction handling, and scheduling logic.

CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?

Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262
extra={'prompt_tokens': 27, 'completion_tokens': 235,
'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906,
'inter_token_latency_s': 1.9013750143957446}

Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254
extra={'prompt_tokens': 27, 'completion_tokens': 227,
'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387,
'inter_token_latency_s': 0.4580444070528635}
@tqchen
Copy link
Member

tqchen commented Feb 18, 2025

cc @Hzfengsy can you help to take a look, also cc @tlopex

@Hzfengsy
Copy link
Member

Also cc @HongHongHongL

return buffer_store.value.b


def is_gemv(sch: tir.Schedule, block_info: BlockInfo) -> Optional[List[tir.Buffer]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse gpu's util functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saying that we can create a folder named something like "analysis" or "utils" under dlight folder, for different backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree this is a good suggestion, dlight.analysis sounds right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Hzfengsy, I've created a folder analysis to ensure CPU and GPU backends reuse shared logic for GEMV, could you recheck it, thanks.

return ret if 0 < len(ret) < len(block_stmt.reads) else None


def normalize( # pylint: disable=too-many-locals, use-a-generator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reuse this one as well

@tqchen
Copy link
Member

tqchen commented Feb 21, 2025

cc @Hzfengsy for another look

if not isinstance(func, tir.PrimFunc) or not self.is_target_available(target):
return None
sch = tir.Schedule(func)
sch = tir.Schedule(func)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated

return loop.extent.value if isinstance(loop.extent, tir.IntImm) else loop.extent


def auto_vectorize(sch: tir.Schedule, loop: tir.schedule.LoopRV, max_vec: int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to keep another CPU copy? as there is a same file at python/tvm/dlight/gpu/utils.py

@mengshyu
Copy link
Contributor Author

Hi @Hzfengsy , I've removed the duplicated definition and file, could you take a look, thanks.

@Hzfengsy Hzfengsy merged commit 3c2a1ab into apache:main Feb 25, 2025
10 checks passed
@Hzfengsy
Copy link
Member

Thanks @mengshyu!

ShiboXing pushed a commit to ShiboXing/tvm that referenced this pull request Aug 10, 2025
…7663)

* [Dlight][CPU] Add CPU Backend Support for GEMV Optimization

This PR adds Dlight CPU support with optimized GEMV scheduling,
including pattern detection, loop tiling, vectorization, and parallel
execution. It improves maintainability by refining target checks,
reduction handling, and scheduling logic.

CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?

Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262
extra={'prompt_tokens': 27, 'completion_tokens': 235,
'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906,
'inter_token_latency_s': 1.9013750143957446}

Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254
extra={'prompt_tokens': 27, 'completion_tokens': 227,
'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387,
'inter_token_latency_s': 0.4580444070528635}

* lint

* Add unit test

* Refactor analysis and scheduling utilities

* lint

* Fix duplicated schedule creation and utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants