-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argsort performance improvement #1859
Conversation
Instead of implementing argsort as sort over structures (index, value), with subsequent projection to index, it is now implemented as sort over linear indices themselves, with dereferencing comparator, and subsequent mapping from linear index to row-wise index. On Iris Xe, argsort call took 215 ms to argsort 5670000 elements of type int32, and it now takes 117 ms. The new implementation also has smaller temporary allocation footprint. Previously, it would allocate 2*(sizeof(ValueT) + sizeof(IndexT)), now it only allocates sizeof(IndexT) for storing linear indices.
Eliminate use of temporary allocation altogether, cutting argsort execution time from 116 ms to 110 ms for 5670000 element array of type int32_t.
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_118 ran successfully. |
Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_119 ran successfully. |
Drop in coverage must be related to some changes in coverall analysis code. PR did not make changes to files coverall reports have reduced coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, the performance improvement is very nice. Thank you @oleksandr-pavlyk !
This change modifies implementation of
tensor.argsort
functions making is about 2x faster.Instead of implementing argsort as sort over structures (index, value),
with subsequent projection to index, it is now implemented as sort
over linear indices themselves, with dereferencing comparator, and
subsequent mapping from linear index to row-wise index.
On Iris Xe,
tensor.argsort
call took 215 ms to find sorting permutation forvector of 5670000 elements of type int32, and it now takes 106 ms.
The new implementation no longer makes temporary allocations for
storing indices. Previously, it would allocate
2*(sizeof(ValueT) + sizeof(IndexT))
, now it just uses the outputallocation.
Tests exist.