Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OA retrieve(_outer) and its multiset API #537

Draft
wants to merge 17 commits into
base: dev
Choose a base branch
from

Conversation

sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Jul 11, 2024

WIP

Closes #465
Closes #489

@sleeepyjack sleeepyjack added type: feature request New feature request P0: Must have Critical feature or bug fix topic: static_multiset Issue related to the static_multiset labels Jul 11, 2024
@sleeepyjack sleeepyjack added this to the static_multiset milestone Jul 11, 2024
@sleeepyjack sleeepyjack self-assigned this Jul 11, 2024
@PointKernel PointKernel added the helps: rapids Helps or needed by RAPIDS label Jul 11, 2024
@sleeepyjack sleeepyjack added the help wanted Extra attention is needed label Jul 17, 2024
@sleeepyjack
Copy link
Collaborator Author

The outer test is still failing and the speeddown compared to the previous implementation is still 1.5x. Apart from that, the other unit tests look good. So the natural next steps would be to fix the bug in the retrieve_outer (shouldn't be a big deal) and dive into optimizations. For the latter I could use a second pair of eyes since this kernel is notoriously complex.


int32_t constexpr block_size = cuco::detail::default_block_size();
// int32_t grid_size =
// detail::max_occupancy_grid_size(block_size,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_occupancy version is always a bit slower. I still don't understand why that's the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed it's a better choice only for several cases of the legacy static_multimap::pair_count.

@PointKernel
Copy link
Member

For the latter I could use a second pair of eyes since this kernel is notoriously complex.

Commenting out the code part by part to find the largest bottleneck is probably the most efficient way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed helps: rapids Helps or needed by RAPIDS P0: Must have Critical feature or bug fix topic: static_multiset Issue related to the static_multiset type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add static_multiset::retrieve_outer [FEATURE]: Add multiset host-bulk retrieve APIs
2 participants