-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boolean indexing improvements #1923
Conversation
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_296 ran successfully. |
f9abe3e
to
a5b03ca
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully. |
1 similar comment
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_313 ran successfully. |
Incidentally, a nice improvement from the latest commit is the acceleration of cumulative sum:
whereas in 0.18.3:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This brings a nice performance boost
1. Use shared local memory to optimize access to neighboring elements of cumulative sums. 2. Introduce contig variant for masked_extract code 3. Removed unused orthog_nelems functor argument, and added local_accessor argument instead. The example ``` import dpctl.tensor as dpt x = dpt.ones(20241024, dtype='f4') m = dpt.ones(x.size, dtype='b1') %time x[m] ``` decreased from 41ms on Iris Xe WSL box to 37 ms.
Use local_accessor to improve memory bandwidth of the work-group.
Use shared local memory to improve global memory bandwidth.
Also implement get_lws to choose local-work-group-size from given choices I0 > I1 > I2 > ..., if n > I0, use I0, if n > I1 use I1, and so on.
The chunk update kernels processed consecutive elements in contiguous memory, hence sub-group memory access pattern was sub-optimal (no coalescing). This PR changes these kernels to process n_wi elements which are sub-group size apart, improving memory access patern. Running a micro-benchmark based on code from gh-1249 (for shape =(n, n,) where n = 4096) with this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010703916665753004 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.01079747307597211 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.010864820314088353 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023878061203975922 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py 0.023666468500677083 ``` while before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011415911812542213 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py 0.011722088705196424 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030126182353813893 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py 0.030459783371986338 ``` Running the same code using NumPy (same size): ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.01416253090698134 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py 0.014979530811413296 ``` The reason Level-Zero device is slower has to do with slow allocation/deallocation bug. OpenCL device has better timing. With this change: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.015038836885381627 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01527448468496678 ``` before: ``` (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.01758851639115838 (dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py 0.017089676241286926 ```
Changed left-over update kernel to use coalesceed memory access.
…ed extract/place code
f365bba
to
a8e7600
Compare
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_319 ran successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_321 ran successfully. |
Array API conformance test failures are unrelated. Merging |
This PR is motivated by gh-1249. It changes masked extract, masked place and nonzero kernels to reduce global memory bandwidth through use of shared local memory.
Kernels are changes from being range-based, to being nd_range-based. Work-items of work-group collectively load
lws + 1
elements ofcumulative_sum
values into local accessor. Each work-item reads two values of this cumulative sum, and reading them from SLM reduces the number of GM accesses by a factor of 2.For masked extract, that is called in gh-1249, this PR introduce contiguous
src
array specialization to improve performance, sincex_2d[m_2d]
runs slower thanx_2d_flat[m_2d_flat]
.