Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KNN] Add comment and remove duplicate code #13594

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

dungba88
Copy link
Contributor

@dungba88 dungba88 commented Jul 20, 2024

Description

  • Add comments to methods for easier to understand their behaviors
  • Remove duplicate codes to createBitSet used in both the main top-k KNN query and vector similarity query
  • Also remove unused variable weight

To discuss: The createFilterWeight might also merit to be shared with the vector similarity query. There is a discrepancy between the 2 classes: the top-k KNN query enhance the filter so that only documents having values on the KNN field will be matched, but the vector similarity query doesn't. I think they should share similar behavior as they are both vector search.

Moreover, should we make their behavior more closely so that they can share/extend one another? The top-k KNN query has the main logic inside rewriteQuery while vector-similarity query has the main logic in Weight. (I now understand that for top-k Query we need to get top-k across all segments and thus cannot just do it in createWeights, which only acts as segment-level matcher). Query timeout check is also not currently supported by vector similarity query.

@kaivalnp
Copy link
Contributor

+1 to share as much logic as possible (including createFilterWeight). The FieldExistsQuery proposal (to only collect pre-filtered docs which have vectors) seems promising too

they can share/extend one another?

IMO we should not tie AbstractVectorSimilarityQuery and AbstractKnnVectorQuery because they compute results differently (in the scorer v/s rewrite) which has broader implications (eg the similarity query is cacheable, inherent parallelism across segments, etc)

Maybe some common KNN utility classes / functions instead?

@dungba88
Copy link
Contributor Author

I think common utility makes sense. I'll move both createFilterWeights and createBitSet to a utility class.

@dungba88
Copy link
Contributor Author

There were some recent commits I need to rebase first as well.

Copy link
Contributor

@kaivalnp kaivalnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thanks!

Comment on lines 24 to 25
/** Common utilities for KNN queries. */
class KnnQueryUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should we name it something like Ann* (Approximate Near Neighbor *) to fit with *VectorSimilarityQuery too? (Knn* sounds specific to Knn*VectorQuery)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I'll change it to AnnQueryUtils

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Aug 22, 2024
@dungba88
Copy link
Contributor Author

Thanks Kaival for reviewing and approving.

Could someone from Lucene committers help review and merge this PR if it looks good?

import org.apache.lucene.util.Bits;

/** Common utilities for KNN queries. */
class KnnQueryUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class KnnQueryUtils {
final class KnnQueryUtils {

import org.apache.lucene.util.Bits;

/** Common utilities for ANN queries. */
class AnnQueryUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be a final class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can we name it KnnQueryUtils since all the queries are Knn queries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to a final class.

It's also used by VectorSimilarityQuery which is not exactly KNN (k-nearest) though.

protected final int k;

/** the filter to executed before KNN search */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the filter is executed before search to gather the relevant bit set, we should be clear that when the filter is applied is up to the underlying knn index.

@@ -116,6 +115,7 @@ private TopDocs searchLeaf(
return results;
}

// Execute the filter if any and perform KNN search at each segment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment doesn't make sense to me. If we really want to add java docs to private methods, lets fix the wording. Maybe something like

"Perform kNN search for the provided LeafReaderContext applying filterWeight as necessary"

@github-actions github-actions bot removed the Stale label Nov 15, 2024
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Nov 29, 2024
@github-actions github-actions bot removed the Stale label Dec 15, 2024
@dungba88
Copy link
Contributor Author

@benwtrent sorry for the delay. I have addressed the comments. Can you please take a look? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants