Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean indexing improvements #1923

Merged
merged 8 commits into from
Dec 10, 2024
Merged

Conversation

oleksandr-pavlyk
Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk commented Dec 6, 2024

This PR is motivated by gh-1249. It changes masked extract, masked place and nonzero kernels to reduce global memory bandwidth through use of shared local memory.

Kernels are changes from being range-based, to being nd_range-based. Work-items of work-group collectively load lws + 1 elements of cumulative_sum values into local accessor. Each work-item reads two values of this cumulative sum, and reading them from SLM reduces the number of GM accesses by a factor of 2.

For masked extract, that is called in gh-1249, this PR introduce contiguous src array specialization to improve performance, since x_2d[m_2d] runs slower than x_2d_flat[m_2d_flat].

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

Copy link

github-actions bot commented Dec 6, 2024

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Copy link

github-actions bot commented Dec 6, 2024

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_296 ran successfully.
Passed: 894
Failed: 2
Skipped: 118

@coveralls
Copy link
Collaborator

coveralls commented Dec 6, 2024

Coverage Status

coverage: 87.659%. remained the same
when pulling a8e7600 on boolean-indexing-improvements
into cb4a049 on master.

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the boolean-indexing-improvements branch from f9abe3e to a5b03ca Compare December 7, 2024 20:30
Copy link

github-actions bot commented Dec 7, 2024

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully.
Passed: 896
Failed: 0
Skipped: 118

1 similar comment
Copy link

github-actions bot commented Dec 7, 2024

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_302 ran successfully.
Passed: 896
Failed: 0
Skipped: 118

Copy link

github-actions bot commented Dec 9, 2024

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_313 ran successfully.
Passed: 895
Failed: 1
Skipped: 118

@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review December 9, 2024 20:36
@oleksandr-pavlyk
Copy link
Collaborator Author

Incidentally, a nice improvement from the latest commit is the acceleration of cumulative sum:

In [1]: import dpctl.tensor as dpt

In [2]: x = dpt.ones(10**7, dtype="i4")

In [3]: m = dpt.cumulative_sum(x)

In [4]: %timeit dpt.cumulative_sum(x, out=m).sycl_queue.wait()
388 μs ± 1.31 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

whereas in 0.18.3:

In [6]: %timeit dpt.cumulative_sum(x, out=m).sycl_queue.wait()
648 μs ± 1.51 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

ndgrigorian
ndgrigorian previously approved these changes Dec 10, 2024
Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This brings a nice performance boost

1. Use shared local memory to optimize access to neighboring
   elements of cumulative sums.

2. Introduce contig variant for masked_extract code
3. Removed unused orthog_nelems functor argument, and
   added local_accessor argument instead.

The example

```
import dpctl.tensor as dpt
x = dpt.ones(20241024, dtype='f4')
m = dpt.ones(x.size, dtype='b1')
%time x[m]
```

decreased from 41ms on Iris Xe WSL box to 37 ms.
Use local_accessor to improve memory bandwidth of the work-group.
Use shared local memory to improve global memory bandwidth.
Also implement get_lws to choose local-work-group-size from
given choices I0 > I1 > I2 > ..., if n > I0, use I0, if n > I1
use I1, and so on.
The chunk update kernels processed consecutive elements in contiguous
memory, hence sub-group memory access pattern was sub-optimal (no
coalescing).

This PR changes these kernels to process n_wi elements which are
sub-group size apart, improving memory access patern.

Running a micro-benchmark based on code from gh-1249 (for
shape =(n, n,) where n = 4096) with this change:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.010703916665753004
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.01079747307597211
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.010864820314088353
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py
0.023878061203975922
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index.py
0.023666468500677083
```

while before:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.011415911812542213
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=cuda:gpu python index.py
0.011722088705196424
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.030126182353813893
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu python index.py
0.030459783371986338
```

Running the same code using NumPy (same size):

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py
0.01416253090698134
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ python index_np.py
0.014979530811413296
```

The reason Level-Zero device is slower has to do with slow allocation/deallocation bug.

OpenCL device has better timing. With this change:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.015038836885381627
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.01527448468496678
```

before:

```
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.01758851639115838
(dev_dpctl) opavlyk@mtl-world:~/repos/dpctl$ ONEAPI_DEVICE_SELECTOR=opencl:gpu python index.py
0.017089676241286926
```
Changed left-over update kernel to use coalesceed memory access.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the boolean-indexing-improvements branch from f365bba to a8e7600 Compare December 10, 2024 20:14
Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_319 ran successfully.
Passed: 896
Failed: 0
Skipped: 118

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

Array API standard conformance tests for dpctl=0.19.0dev0=py310h93fe807_321 ran successfully.
Passed: 893
Failed: 3
Skipped: 118

@oleksandr-pavlyk
Copy link
Collaborator Author

Array API conformance test failures are unrelated. Merging

@oleksandr-pavlyk oleksandr-pavlyk merged commit 0bcd635 into master Dec 10, 2024
49 of 51 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the boolean-indexing-improvements branch December 10, 2024 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants