SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt #16869

shani-f · 2025-10-30T18:29:06Z

Summary

This PR replaces the previous SYCL implementation of REPEAT_BACK that I implemented in PR #16734.
The functionality remains identical, but the kernel is now significantly more efficient.

Changes

Rewrote ggml-sycl/repeat_back.cpp with a vectorized and cache-friendly kernel
Replaced 4 nested loops with a single fused loop
Removed expensive division and modulo operations from the hot path
Uses byte-stride (nb0..nb3) addressing for correct tensor layout access
Precomputes inverse sizes to avoid repeated integer math

Performance

~3× fewer hot-path assembly instructions (verified in Compiler Explorer)
~2× faster execution time on Intel UHD GPU
No additional memory allocation

Testing

All test-backend-ops -o REPEAT_BACK tests pass on SYCL
Matches CPU backend results bit-accurately
Verified across different tensor sizes, dimensions, and repeat configurations

Compatibility

Supports F32 tensors
Works on Level-Zero and OpenCL backends
Follows the existing SYCL backend structure and coding style

…at-back-opt

Removed comments to clean up the code.

ggml/src/ggml-sycl/ggml-sycl.cpp

ggml/src/ggml-sycl/repeat_back.cpp

Remove unnecessary blank line before assigning acc to dst_dd.

shani-f · 2025-11-02T09:00:13Z

Hello @CISC @NeoZhangJianyu,
I applied the required fixes and pushed the updated version.
Please let me know if anything else is needed.
Thanks!

NeoZhangJianyu

It's good job!

Thank you!

* origin/master: (169 commits) opencl: support imrope (ggml-org#16914) fix: Viewing multiple PDF attachments (ggml-org#16974) model-conversion : pass config to from_pretrained (ggml-org#16963) server : add props.model_alias (ggml-org#16943) ggml: CUDA: add head size 72 for flash-attn (ggml-org#16962) mtmd: add --image-min/max-tokens (ggml-org#16921) mtmd: pad mask for qwen2.5vl (ggml-org#16954) ggml : LoongArch fixes (ggml-org#16958) sync: minja (glm 4.6 & minmax m2 templates) (ggml-org#16949) SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (ggml-org#16869) feat(webui): improve LaTeX rendering with currency detection (ggml-org#16508) test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (ggml-org#16936) ci : disable failing riscv cross build (ggml-org#16952) model: add Janus Pro for image understanding (ggml-org#16906) clip : use FA (ggml-org#16837) server : support unified cache across slots (ggml-org#16736) common : move gpt-oss reasoning processing to init params (ggml-org#16937) docs: remove llama_sampler_accept reference in sampling sample usage (ggml-org#16920) CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (ggml-org#16917) devops: fix failing s390x docker build (ggml-org#16918) ...

shani-f added 6 commits October 23, 2025 12:38

SYCL repeat_back v1 — add core op + switch case

8d51e18

Implement repeat_back SYCL operation and minor fixes

d3e88bc

SYCL: optimize repeat_back kernel

f7d05e7

Merge remote-tracking branch 'upstream/master' into feature/sycl-repe…

e7149c8

…at-back-opt

Remove Hebrew comment from repeat_back.cpp

20717ac

Remove comments for code clarity

f2853da

Removed comments to clean up the code.

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Oct 30, 2025

NeoZhangJianyu reviewed Oct 31, 2025

View reviewed changes

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/repeat_back.cpp Outdated Show resolved Hide resolved

shani-f added 3 commits October 31, 2025 16:09

Fix formatting in ggml-sycl.cpp

0efd7eb

Formatted lambda according to legacy style. No logic changes

2c8f977

Remove blank line in repeat_back.cpp

aa20ee1

Remove unnecessary blank line before assigning acc to dst_dd.

NeoZhangJianyu approved these changes Nov 3, 2025

View reviewed changes

NeoZhangJianyu merged commit 7e99416 into ggml-org:master Nov 3, 2025
123 of 130 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt #16869

SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt #16869

Uh oh!

shani-f commented Oct 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

shani-f commented Nov 2, 2025

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt #16869

SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt #16869

Uh oh!

Conversation

shani-f commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance

Testing

Compatibility

Uh oh!

Uh oh!

Uh oh!

shani-f commented Nov 2, 2025

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shani-f commented Oct 30, 2025 •

edited

Loading