Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multi-CTA algorithm #492

Open
wants to merge 7 commits into
base: branch-25.02
Choose a base branch
from

Conversation

anaruse
Copy link
Contributor

@anaruse anaruse commented Nov 25, 2024

It has been reported that when the number of search results is large, for example 100, using the multi-CTA algorithm can cause a decrease in recall. This PR is intended to alleviate this low recall issue.

close #208

@anaruse anaruse requested a review from a team as a code owner November 25, 2024 10:46
Copy link

copy-pr-bot bot commented Nov 25, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Nov 25, 2024
@tfeher tfeher added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Nov 25, 2024
@cjnolet
Copy link
Member

cjnolet commented Dec 3, 2024

/ok to test

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anaruse for this PR, it is great to see these improvements. Overall the changes look good, and the benchmarks that you have shared offline look very encouraging. I just have a few questions below.

@tfeher
Copy link
Contributor

tfeher commented Dec 5, 2024

@anaruse there are some unsigned commits, that blocks CI from testing the changes automatically. To fix this issue, could you rebase the PR?

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/ok to test

…when the number of results is large

Fix some issues

Fix lower recall issue with new multi-cta algo

Removing redundant code and changing some parameters

Update cpp/src/neighbors/detail/cagra/search_plan.cuh

Co-authored-by: Tamas Bela Feher <tfeher@nvidia.com>

Remove an unnecessary line and satisfy clang-format
@anaruse anaruse force-pushed the improved_multi_cta_algo branch from 3dce160 to 6223fd2 Compare December 5, 2024 06:58
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Akira for the updates, the PR looks good to me.

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/merge

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/ok to test

@tfeher
Copy link
Contributor

tfeher commented Dec 5, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/merge

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

/merge

@cjnolet
Copy link
Member

cjnolet commented Dec 5, 2024

We are on the brink of missing code freeze for this PR. Please anyone reading this, don't click the "update" button. It inserts a merge commit, which reruns CI in its entirety and this is not needed to merge the PR. We can re-run individual flaky tests that fail without having to rerun the entire CI (the former takes minutes and the latter can take several hours).

@cjnolet
Copy link
Member

cjnolet commented Dec 6, 2024

@anaruse @tfeher CI seems to be running successfully for other PRs but the gtests seem to be consistently timing out for this PR. As far as I can tell, there's no updates to any of the tests, in this PR, but the timeouts don't seem flaky, they seem isolated to these changes, somehow.

We are pushing back code freeze by 1 day. Do you guys think we can still make this in time for 24.12?

@achirkin achirkin changed the base branch from branch-24.12 to branch-25.02 December 9, 2024 09:55
Handle the case when the search result contains invalid indices when building the updated graph in add_nodes.
For debugging purposes, fail if any invalid indices found; in future, we can replace RAFT_FAIL with RAFT_LOG_WARN to make the add_nodes routine more robust.
@achirkin
Copy link
Contributor

achirkin commented Dec 9, 2024

I took the liberty to add a workaround to add_nodes, which handles the case when CAGRA search doesn't return enough valid indices. With this, the tests should fail with a descriptive message in place of the segfault.
When we find the source of the bug, we can relax the RAFT_FAIL with RAFT_LOG_WARN.

@achirkin
Copy link
Contributor

achirkin commented Dec 9, 2024

/ok to test

@anaruse
Copy link
Contributor Author

anaruse commented Dec 10, 2024

The cause has not yet been identified, but it seems that this problem only occurs when the dimensionality of the dataset is 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[BUG] Decreasing recall when increasing threads in ANN benchmark
4 participants