Use select for masked single element load #1770

alexbaden · 2024-08-02T23:23:56Z

As part of my investigation into #1721, I found an opportunity to remove some branches in certain kernels by removing a predicated load in favor of a load+select. The nvidia path does something similar - the difference there is they can emit a masked load instruction. This patch removes quite a bit of branching from the sample kernel from #1721 reducing runtime from 0.090ms to 0.067ms.

I intend to do more testing as well as look for other optimizations in similar kernels, so I am marking this as draft for now but putting it up so it can run through CI (my local env is not setup for testing against IPEX).

chengjunlu · 2024-08-05T04:44:25Z

@alexbaden
If the IGC cannot optimize the branching with the prediction, and the load with mask can be good in performance.

Please use the GenISA instead of load the memory without protection. It may causes the page fault in GPU if the address is not valid.

Here is the gather load with mask in GenISA.
https://github.com/intel/intel-graphics-compiler/blob/0d1c68f522be3684fa2a671f042bf5794dda1849/IGC/GenISAIntrinsics/generator/input/Intrinsic_definitions.yml#L9055

Here is the scatter store with the mask in GenISA.
https://github.com/intel/intel-graphics-compiler/blob/0d1c68f522be3684fa2a671f042bf5794dda1849/IGC/GenISAIntrinsics/generator/input/Intrinsic_definitions.yml#L9234

whitneywhtsang · 2024-08-06T14:07:00Z

Had a discussion with IGC team, next step is for us to evaluate performance impact with the GenISA intrinsic, and open a IGC issue to have it exposed if it gives performance gain.

etiotto · 2024-08-06T14:52:14Z

@alexbaden are you going to measure the performance impact of using the proposed IGC GenISA functions for that workload ?

alexbaden · 2024-08-08T02:23:49Z

We tried the IGC function, but it is meant for masked loads for the render engine and not the GPGPU.
Using select on the other constant / load value seems to work in local testing, but we are not sure yet what the implications are of having a load instruction on what may be a null ptr / uninitialized memory, even if the result of that load is not used.
I did try selecting the ptr values (putting other into its own alloca), but the ptr has to be in global memory so this ended up being quite slow.
The current PR calls select on the value, not the ptr. Let's see if it passes CI and then decide how to proceed.

vlad-penkin linked an issue Aug 4, 2024 that may be closed by this pull request

No perf advantage for torch.compile on examples from pytorch tutorial #1721

Open

alexbaden added 4 commits August 7, 2024 18:53

Use select for masked single element load

da0d746

[tmp]: enable sycl queue timing

16c6fa3

fixes and cleanups

2bc3b06

use select mask for scalar masked load

dce39b4

alexbaden force-pushed the alex/perf_improvements branch from 5363c6e to dce39b4 Compare August 8, 2024 01:53

dvrogozh mentioned this pull request Aug 9, 2024

No perf advantage for torch.compile on examples from pytorch tutorial #1721

Open

pbchekin changed the base branch from llvm-target to main September 14, 2024 00:01

alexbaden closed this Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use select for masked single element load #1770

Use select for masked single element load #1770

alexbaden commented Aug 2, 2024

chengjunlu commented Aug 5, 2024

whitneywhtsang commented Aug 6, 2024

etiotto commented Aug 6, 2024

alexbaden commented Aug 8, 2024

Use select for masked single element load #1770

Use select for masked single element load #1770

Conversation

alexbaden commented Aug 2, 2024

chengjunlu commented Aug 5, 2024

whitneywhtsang commented Aug 6, 2024

etiotto commented Aug 6, 2024

alexbaden commented Aug 8, 2024