Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argsort performance improvement #1859

Merged
merged 4 commits into from
Oct 9, 2024

Conversation

oleksandr-pavlyk
Copy link
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk commented Oct 9, 2024

This change modifies implementation of tensor.argsort functions making is about 2x faster.

Instead of implementing argsort as sort over structures (index, value),
with subsequent projection to index, it is now implemented as sort
over linear indices themselves, with dereferencing comparator, and
subsequent mapping from linear index to row-wise index.

On Iris Xe, tensor.argsort call took 215 ms to find sorting permutation for
vector of 5670000 elements of type int32, and it now takes 106 ms.

The new implementation no longer makes temporary allocations for
storing indices. Previously, it would allocate
2*(sizeof(ValueT) + sizeof(IndexT)), now it just uses the output
allocation.

Tests exist.


  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • Have you added documentation for your changes, if necessary?
  • Have you added your changes to the changelog?
  • If this PR is a work in progress, are you opening the PR as a draft?

Instead of implementing argsort as sort over structures (index, value),
with subsequent projection to index, it is now implemented as sort
over linear indices themselves, with dereferencing comparator, and
subsequent mapping from linear index to row-wise index.

On Iris Xe, argsort call took 215 ms to argsort 5670000 elements of
type int32, and it now takes 117 ms.

The new implementation also has smaller temporary allocation footprint.
Previously, it would allocate 2*(sizeof(ValueT) + sizeof(IndexT)), now
it only allocates sizeof(IndexT) for storing linear indices.
Eliminate use of temporary allocation altogether, cutting argsort execution
time from 116 ms to 110 ms for 5670000 element array of type int32_t.
Copy link

github-actions bot commented Oct 9, 2024

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

Copy link

github-actions bot commented Oct 9, 2024

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_118 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

Copy link

github-actions bot commented Oct 9, 2024

Array API standard conformance tests for dpctl=0.19.0dev0=py310hdf72452_119 ran successfully.
Passed: 894
Failed: 1
Skipped: 119

@coveralls
Copy link
Collaborator

coveralls commented Oct 9, 2024

Coverage Status

coverage: 87.669% (-0.2%) from 87.907%
when pulling aeb1b1f on argsort-performance-improvement
into d5de65b on master.

@oleksandr-pavlyk
Copy link
Collaborator Author

Drop in coverage must be related to some changes in coverall analysis code. PR did not make changes to files coverall reports have reduced coverage.

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, the performance improvement is very nice. Thank you @oleksandr-pavlyk !

@oleksandr-pavlyk oleksandr-pavlyk merged commit 2037d49 into master Oct 9, 2024
49 checks passed
@oleksandr-pavlyk oleksandr-pavlyk deleted the argsort-performance-improvement branch October 9, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants